# Cloud Agent Health, Resource Locking, and Scheduling Use this page to understand how BenchCI decides whether a cloud-connected bench is ready to receive hardware test runs. BenchCI v0.7 adds a reliability path across the Agent, backend, scheduler, and dashboard: ```text Agent checks bench health ↓ Backend stores bench health ↓ Scheduler avoids unhealthy benches ↓ Dashboard explains bench health and failure source ``` This helps users distinguish between: - firmware failures - test logic failures - bench infrastructure problems - Agent/cloud problems - configuration problems --- ## Health states BenchCI uses four bench health states: | Health | Meaning | Scheduler behavior | |---|---|---| | `healthy` | The bench passed required health checks. | Eligible for cloud runs. | | `degraded` | The bench has warnings or optional issues. | Eligible for cloud runs in v0.7. | | `failing` | The bench has a health problem that can make runs unreliable. | Not scheduled. | | `unknown` | No valid health report is available. | Not scheduled. | For the v0.7 baseline, the scheduler assigns runs only to: ```text healthy degraded ``` The scheduler avoids: ```text failing unknown missing health malformed health ``` Even a specifically requested `--bench-id` cannot bypass this health filter. --- ## Agent startup self-check When a cloud Agent starts, it can run a non-destructive self-test for the registered bench and include health in the backend sync payload. Start a cloud Agent: ```bash benchci agent cloud --backend https://api.benchci.dev --token YOUR_AGENT_TOKEN --bench bench.yaml --bench-id raspi-nucleo-demo --agent-name "Lab Agent 01" ``` The Agent can report: - `health` - `health_status` - `health_checked_at` - `health_summary` The backend stores those fields and exposes them through Cloud and Dashboard APIs. --- ## Agent self-test safety boundary Agent startup health checks are non-destructive by default. They do **not**: - flash firmware - reset the target - toggle relays - power-cycle outlets - drive GPIO outputs - send UART/CAN/Modbus commands - read measurements unless explicitly enabled This makes startup checks safe for normal Agent operation. --- ## Agent self-test controls Environment variables control how deep startup checks go: ```bash BENCHCI_AGENT_SELF_TEST_ON_STARTUP=1 BENCHCI_AGENT_SELF_TEST_OPEN_HARDWARE=1 BENCHCI_AGENT_SELF_TEST_READ_INPUTS=0 BENCHCI_AGENT_SELF_TEST_READ_MEASUREMENTS=0 ``` Recommended starting point: ```bash export BENCHCI_AGENT_SELF_TEST_ON_STARTUP=1 export BENCHCI_AGENT_SELF_TEST_OPEN_HARDWARE=1 export BENCHCI_AGENT_SELF_TEST_READ_INPUTS=0 export BENCHCI_AGENT_SELF_TEST_READ_MEASUREMENTS=0 ``` Enable input or measurement reads only after confirming they are safe for your bench. --- ## Health report artifacts on the Agent Health reports are written under the Agent work directory: ```text benchci-agent-results/ bench-health/ / self-test.log self-test-summary.json nodes/ resources/ ``` These files help lab owners debug why a bench is marked degraded or failing. --- ## Resource locking BenchCI protects hardware resources with a file-based cross-process lock layer. This prevents two runs on the same machine from using the same physical interface at the same time. By default, locks are stored under: ```bash ~/.benchci/locks/ ``` You can customize the lock directory: ```bash export BENCHCI_LOCK_DIR=/path/to/locks ``` Emergency opt-out for debugging only: ```bash export BENCHCI_DISABLE_RESOURCE_LOCKS=1 ``` Do not disable locks for normal lab or CI operation. ### Resources protected Resource locking covers: - serial transports such as UART and Modbus RTU ports - CAN interfaces - Modbus TCP endpoints - flash/reset tools and probe/port usage - GPIO lines - power relay resources and outlets - measurement resources If a second run tries to use a locked resource, it should fail early with a resource-lock failure instead of producing confusing hardware results. --- ## Backend health storage The backend stores bench health fields in the bench inventory: ```text health_status health_checked_at health_summary health_json ``` This allows Cloud APIs, the scheduler, and the dashboard to use the same health state. You can inspect visible benches: ```bash benchci benches list benchci benches show raspi-nucleo-demo ``` The dashboard also shows health fields in the Benches view. --- ## Scheduler behavior The scheduler only assigns queued runs to benches that are: - online - idle - accessible to the workspace - matching requested tags/capabilities - healthy or degraded A queued run remains queued if no suitable bench is available. This is intentional. A queued run is better than assigning a run to a bench that is known to be unhealthy. --- ## Dashboard behavior The dashboard shows health on bench cards: ```text Healthy Degraded Failing Unknown ``` The bench health panel can show: - health summary - last checked timestamp - scheduling eligibility message - pass/warn/fail/skip counts - failing or warning diagnostic checks - categories and suggested fixes The Runs view can also show failure source labels such as: ```text Firmware Test logic Bench infrastructure Agent / cloud Configuration Unknown ``` Example failure explanation: ```text Likely source: Bench infrastructure Category: Transport Open Failed The physical bench, wiring, instrument, or local interface likely needs attention. ``` --- ## Troubleshooting unhealthy benches ### Bench is `unknown` Common causes: - Agent has not synced health yet - Agent is not v0.7+ - health self-test is disabled - backend has not received a new bench sync - Agent token/workspace mismatch Try: ```bash benchci benches list ``` Then restart the Agent with health enabled: ```bash export BENCHCI_AGENT_SELF_TEST_ON_STARTUP=1 export BENCHCI_AGENT_SELF_TEST_OPEN_HARDWARE=1 benchci agent cloud --backend https://api.benchci.dev --token YOUR_AGENT_TOKEN --bench bench.yaml --bench-id raspi-nucleo-demo --agent-name "Lab Agent 01" ``` ### Bench is `failing` Run self-test manually on the hardware machine: ```bash benchci bench self-test --bench bench.yaml --open-hardware --log-dir bench-health ``` Then inspect: ```text bench-health/self-test.log bench-health/self-test-summary.json bench-health/nodes/ bench-health/resources/ ``` Typical fixes include: - reconnecting USB-UART adapters - correcting `/dev/ttyUSB*` paths - fixing GPIO chip/line numbers - installing missing flash tools - checking relay power and permissions - checking HTTP relay or measurement controller URLs - confirming the correct Agent token/workspace/bench ID ### Bench is `degraded` A degraded bench can still receive runs in v0.7. Review warnings before relying on results, especially for release evidence or customer demos. Possible causes: - optional measurement resource unavailable - optional readback unsupported - non-critical tool warning - resource warning that is not required by the current suite Future scheduler versions may choose whether a degraded bench is acceptable based on the queued run’s exact required resources. --- ## Reference demo bench A practical starter bench for real embedded validation can be: ```text Raspberry Pi Zero 2 W Nucleo F072RB powered USB hub USB relay TTL-USB adapter TTL-RS485 adapter RS485-USB adapter 2 GPIO lines from Raspberry Pi to Nucleo ``` This type of bench can demonstrate: - UART boot validation - RS-485/Modbus tests - GPIO ready/reset checks - relay-based power cycling - Cloud Agent scheduling - dashboard health visibility - run evidence and failure classification Start simple with UART and flashing, then add power, GPIO, RS-485, and measurements as the bench becomes more professional.