Cloud Agent Health, Resource Locking, and Scheduling¶
Use this page to understand how BenchCI decides whether a cloud-connected bench is ready to receive hardware test runs.
BenchCI v0.7 adds a reliability path across the Agent, backend, scheduler, and dashboard:
Agent checks bench health
↓
Backend stores bench health
↓
Scheduler avoids unhealthy benches
↓
Dashboard explains bench health and failure source
This helps users distinguish between:
firmware failures
test logic failures
bench infrastructure problems
Agent/cloud problems
configuration problems
Health states¶
BenchCI uses four bench health states:
Health |
Meaning |
Scheduler behavior |
|---|---|---|
|
The bench passed required health checks. |
Eligible for cloud runs. |
|
The bench has warnings or optional issues. |
Eligible for cloud runs in v0.7. |
|
The bench has a health problem that can make runs unreliable. |
Not scheduled. |
|
No valid health report is available. |
Not scheduled. |
For the v0.7 baseline, the scheduler assigns runs only to:
healthy
degraded
The scheduler avoids:
failing
unknown
missing health
malformed health
Even a specifically requested --bench-id cannot bypass this health filter.
Agent startup self-check¶
When a cloud Agent starts, it can run a non-destructive self-test for the registered bench and include health in the backend sync payload.
Start a cloud Agent:
benchci agent cloud --backend https://api.benchci.dev --token YOUR_AGENT_TOKEN --bench bench.yaml --bench-id raspi-nucleo-demo --agent-name "Lab Agent 01"
The Agent can report:
healthhealth_statushealth_checked_athealth_summary
The backend stores those fields and exposes them through Cloud and Dashboard APIs.
Agent self-test safety boundary¶
Agent startup health checks are non-destructive by default.
They do not:
flash firmware
reset the target
toggle relays
power-cycle outlets
drive GPIO outputs
send UART/CAN/Modbus commands
read measurements unless explicitly enabled
This makes startup checks safe for normal Agent operation.
Agent self-test controls¶
Environment variables control how deep startup checks go:
BENCHCI_AGENT_SELF_TEST_ON_STARTUP=1
BENCHCI_AGENT_SELF_TEST_OPEN_HARDWARE=1
BENCHCI_AGENT_SELF_TEST_READ_INPUTS=0
BENCHCI_AGENT_SELF_TEST_READ_MEASUREMENTS=0
Recommended starting point:
export BENCHCI_AGENT_SELF_TEST_ON_STARTUP=1
export BENCHCI_AGENT_SELF_TEST_OPEN_HARDWARE=1
export BENCHCI_AGENT_SELF_TEST_READ_INPUTS=0
export BENCHCI_AGENT_SELF_TEST_READ_MEASUREMENTS=0
Enable input or measurement reads only after confirming they are safe for your bench.
Health report artifacts on the Agent¶
Health reports are written under the Agent work directory:
benchci-agent-results/
bench-health/
<bench_id>/
self-test.log
self-test-summary.json
nodes/
resources/
These files help lab owners debug why a bench is marked degraded or failing.
Resource locking¶
BenchCI protects hardware resources with a file-based cross-process lock layer.
This prevents two runs on the same machine from using the same physical interface at the same time.
By default, locks are stored under:
~/.benchci/locks/
You can customize the lock directory:
export BENCHCI_LOCK_DIR=/path/to/locks
Emergency opt-out for debugging only:
export BENCHCI_DISABLE_RESOURCE_LOCKS=1
Do not disable locks for normal lab or CI operation.
Resources protected¶
Resource locking covers:
serial transports such as UART and Modbus RTU ports
CAN interfaces
Modbus TCP endpoints
flash/reset tools and probe/port usage
GPIO lines
power relay resources and outlets
measurement resources
If a second run tries to use a locked resource, it should fail early with a resource-lock failure instead of producing confusing hardware results.
Backend health storage¶
The backend stores bench health fields in the bench inventory:
health_status
health_checked_at
health_summary
health_json
This allows Cloud APIs, the scheduler, and the dashboard to use the same health state.
You can inspect visible benches:
benchci benches list
benchci benches show raspi-nucleo-demo
The dashboard also shows health fields in the Benches view.
Scheduler behavior¶
The scheduler only assigns queued runs to benches that are:
online
idle
accessible to the workspace
matching requested tags/capabilities
healthy or degraded
A queued run remains queued if no suitable bench is available.
This is intentional. A queued run is better than assigning a run to a bench that is known to be unhealthy.
Dashboard behavior¶
The dashboard shows health on bench cards:
Healthy
Degraded
Failing
Unknown
The bench health panel can show:
health summary
last checked timestamp
scheduling eligibility message
pass/warn/fail/skip counts
failing or warning diagnostic checks
categories and suggested fixes
The Runs view can also show failure source labels such as:
Firmware
Test logic
Bench infrastructure
Agent / cloud
Configuration
Unknown
Example failure explanation:
Likely source: Bench infrastructure
Category: Transport Open Failed
The physical bench, wiring, instrument, or local interface likely needs attention.
Troubleshooting unhealthy benches¶
Bench is unknown¶
Common causes:
Agent has not synced health yet
Agent is not v0.7+
health self-test is disabled
backend has not received a new bench sync
Agent token/workspace mismatch
Try:
benchci benches list
Then restart the Agent with health enabled:
export BENCHCI_AGENT_SELF_TEST_ON_STARTUP=1
export BENCHCI_AGENT_SELF_TEST_OPEN_HARDWARE=1
benchci agent cloud --backend https://api.benchci.dev --token YOUR_AGENT_TOKEN --bench bench.yaml --bench-id raspi-nucleo-demo --agent-name "Lab Agent 01"
Bench is failing¶
Run self-test manually on the hardware machine:
benchci bench self-test --bench bench.yaml --open-hardware --log-dir bench-health
Then inspect:
bench-health/self-test.log
bench-health/self-test-summary.json
bench-health/nodes/
bench-health/resources/
Typical fixes include:
reconnecting USB-UART adapters
correcting
/dev/ttyUSB*pathsfixing GPIO chip/line numbers
installing missing flash tools
checking relay power and permissions
checking HTTP relay or measurement controller URLs
confirming the correct Agent token/workspace/bench ID
Bench is degraded¶
A degraded bench can still receive runs in v0.7.
Review warnings before relying on results, especially for release evidence or customer demos.
Possible causes:
optional measurement resource unavailable
optional readback unsupported
non-critical tool warning
resource warning that is not required by the current suite
Future scheduler versions may choose whether a degraded bench is acceptable based on the queued run’s exact required resources.
Reference demo bench¶
A practical starter bench for real embedded validation can be:
Raspberry Pi Zero 2 W
Nucleo F072RB
powered USB hub
USB relay
TTL-USB adapter
TTL-RS485 adapter
RS485-USB adapter
2 GPIO lines from Raspberry Pi to Nucleo
This type of bench can demonstrate:
UART boot validation
RS-485/Modbus tests
GPIO ready/reset checks
relay-based power cycling
Cloud Agent scheduling
dashboard health visibility
run evidence and failure classification
Start simple with UART and flashing, then add power, GPIO, RS-485, and measurements as the bench becomes more professional.