Cloud Agent Health, Resource Locking, and Scheduling¶

Use this page to understand how BenchCI decides whether a cloud-connected bench is ready to receive hardware test runs.

BenchCI v0.7 adds a reliability path across the Agent, backend, scheduler, and dashboard:

Agent checks bench health
        ↓
Backend stores bench health
        ↓
Scheduler avoids unhealthy benches
        ↓
Dashboard explains bench health and failure source

This helps users distinguish between:

firmware failures
test logic failures
bench infrastructure problems
Agent/cloud problems
configuration problems

Health states¶

BenchCI uses four bench health states:

Health	Meaning	Scheduler behavior
`healthy`	The bench passed required health checks.	Eligible for cloud runs.
`degraded`	The bench has warnings or optional issues.	Eligible for cloud runs in v0.7.
`failing`	The bench has a health problem that can make runs unreliable.	Not scheduled.
`unknown`	No valid health report is available.	Not scheduled.

For the v0.7 baseline, the scheduler assigns runs only to:

healthy
degraded

The scheduler avoids:

failing
unknown
missing health
malformed health

Even a specifically requested --bench-id cannot bypass this health filter.

Agent startup self-check¶

When a cloud Agent starts, it can run a non-destructive self-test for the registered bench and include health in the backend sync payload.

Start a cloud Agent:

benchci agent cloud   --backend https://api.benchci.dev   --token YOUR_AGENT_TOKEN   --bench bench.yaml   --bench-id raspi-nucleo-demo   --agent-name "Lab Agent 01"

The Agent can report:

health
health_status
health_checked_at
health_summary

The backend stores those fields and exposes them through Cloud and Dashboard APIs.

Agent self-test safety boundary¶

Agent startup health checks are non-destructive by default.

They do not:

flash firmware
reset the target
toggle relays
power-cycle outlets
drive GPIO outputs
send UART/CAN/Modbus commands
read measurements unless explicitly enabled

This makes startup checks safe for normal Agent operation.

Agent self-test controls¶

Environment variables control how deep startup checks go:

BENCHCI_AGENT_SELF_TEST_ON_STARTUP=1
BENCHCI_AGENT_SELF_TEST_OPEN_HARDWARE=1
BENCHCI_AGENT_SELF_TEST_READ_INPUTS=0
BENCHCI_AGENT_SELF_TEST_READ_MEASUREMENTS=0

Recommended starting point:

export BENCHCI_AGENT_SELF_TEST_ON_STARTUP=1
export BENCHCI_AGENT_SELF_TEST_OPEN_HARDWARE=1
export BENCHCI_AGENT_SELF_TEST_READ_INPUTS=0
export BENCHCI_AGENT_SELF_TEST_READ_MEASUREMENTS=0

Enable input or measurement reads only after confirming they are safe for your bench.

Health report artifacts on the Agent¶

Health reports are written under the Agent work directory:

benchci-agent-results/
  bench-health/
    <bench_id>/
      self-test.log
      self-test-summary.json
      nodes/
      resources/

These files help lab owners debug why a bench is marked degraded or failing.

Resource locking¶

BenchCI protects hardware resources with a file-based cross-process lock layer.

This prevents two runs on the same machine from using the same physical interface at the same time.

By default, locks are stored under:

~/.benchci/locks/

You can customize the lock directory:

export BENCHCI_LOCK_DIR=/path/to/locks

Emergency opt-out for debugging only:

export BENCHCI_DISABLE_RESOURCE_LOCKS=1

Do not disable locks for normal lab or CI operation.

Resources protected¶

Resource locking covers:

serial transports such as UART and Modbus RTU ports
CAN interfaces
Modbus TCP endpoints
flash/reset tools and probe/port usage
GPIO lines
power relay resources and outlets
measurement resources

If a second run tries to use a locked resource, it should fail early with a resource-lock failure instead of producing confusing hardware results.

Backend health storage¶

The backend stores bench health fields in the bench inventory:

health_status
health_checked_at
health_summary
health_json

This allows Cloud APIs, the scheduler, and the dashboard to use the same health state.

You can inspect visible benches:

benchci benches list
benchci benches show raspi-nucleo-demo

The dashboard also shows health fields in the Benches view.

Scheduler behavior¶

The scheduler only assigns queued runs to benches that are:

online
idle
accessible to the workspace
matching requested tags/capabilities
healthy or degraded

A queued run remains queued if no suitable bench is available.

This is intentional. A queued run is better than assigning a run to a bench that is known to be unhealthy.

Dashboard behavior¶

The dashboard shows health on bench cards:

Healthy
Degraded
Failing
Unknown

The bench health panel can show:

health summary
last checked timestamp
scheduling eligibility message
pass/warn/fail/skip counts
failing or warning diagnostic checks
categories and suggested fixes

The Runs view can also show failure source labels such as:

Firmware
Test logic
Bench infrastructure
Agent / cloud
Configuration
Unknown

Example failure explanation:

Likely source: Bench infrastructure
Category: Transport Open Failed
The physical bench, wiring, instrument, or local interface likely needs attention.

Troubleshooting unhealthy benches¶

Bench is `unknown`¶

Common causes:

Agent has not synced health yet
Agent is not v0.7+
health self-test is disabled
backend has not received a new bench sync
Agent token/workspace mismatch

Try:

benchci benches list

Then restart the Agent with health enabled:

export BENCHCI_AGENT_SELF_TEST_ON_STARTUP=1
export BENCHCI_AGENT_SELF_TEST_OPEN_HARDWARE=1

benchci agent cloud   --backend https://api.benchci.dev   --token YOUR_AGENT_TOKEN   --bench bench.yaml   --bench-id raspi-nucleo-demo   --agent-name "Lab Agent 01"

Bench is `failing`¶

Run self-test manually on the hardware machine:

benchci bench self-test   --bench bench.yaml   --open-hardware   --log-dir bench-health

Then inspect:

bench-health/self-test.log
bench-health/self-test-summary.json
bench-health/nodes/
bench-health/resources/

Typical fixes include:

reconnecting USB-UART adapters
correcting /dev/ttyUSB* paths
fixing GPIO chip/line numbers
installing missing flash tools
checking relay power and permissions
checking HTTP relay or measurement controller URLs
confirming the correct Agent token/workspace/bench ID

Bench is `degraded`¶

A degraded bench can still receive runs in v0.7.

Review warnings before relying on results, especially for release evidence or customer demos.

Possible causes:

optional measurement resource unavailable
optional readback unsupported
non-critical tool warning
resource warning that is not required by the current suite

Future scheduler versions may choose whether a degraded bench is acceptable based on the queued run’s exact required resources.

Reference demo bench¶

A practical starter bench for real embedded validation can be:

Raspberry Pi Zero 2 W
Nucleo F072RB
powered USB hub
USB relay
TTL-USB adapter
TTL-RS485 adapter
RS485-USB adapter
2 GPIO lines from Raspberry Pi to Nucleo

This type of bench can demonstrate:

UART boot validation
RS-485/Modbus tests
GPIO ready/reset checks
relay-based power cycling
Cloud Agent scheduling
dashboard health visibility
run evidence and failure classification

Start simple with UART and flashing, then add power, GPIO, RS-485, and measurements as the bench becomes more professional.