# Quick Reference: Worker Lifecycle & Heartbeat Validation **Last Updated:** 2026-02-04 **Status:** Production Ready ## Overview Workers use graceful shutdown and heartbeat validation to ensure reliable execution scheduling. ## Worker Lifecycle ### Startup 1. Load configuration 2. Connect to database and message queue 3. Detect runtime capabilities 4. Register in database (status = `Active`) 5. Start heartbeat loop 6. Start consuming execution messages ### Normal Operation - **Heartbeat:** Updates `worker.last_heartbeat` every 30 seconds (default) - **Status:** Remains `Active` - **Executions:** Processes messages from worker-specific queue ### Shutdown (Graceful) 1. Receive SIGINT or SIGTERM signal 2. Stop heartbeat loop 3. Mark worker as `Inactive` in database 4. Exit cleanly ### Shutdown (Crash/Kill) - Worker does not deregister - Status remains `Active` in database - Heartbeat stops updating - **Executor detects as stale after 90 seconds** ## Heartbeat Validation ### Configuration ```yaml worker: heartbeat_interval: 30 # seconds (default) ``` ### Staleness Threshold - **Formula:** `heartbeat_interval * 3 = 90 seconds` - **Rationale:** Allows 2 missed heartbeats + buffer - **Detection:** Executor checks on every scheduling attempt ### Worker States | Last Heartbeat Age | Status | Schedulable | |-------------------|--------|-------------| | < 90 seconds | Fresh | ✅ Yes | | ≥ 90 seconds | Stale | ❌ No | | None/NULL | Stale | ❌ No | ## Executor Scheduling Flow ``` Execution Requested ↓ Find Action Workers ↓ Filter by Runtime Compatibility ↓ Filter by Active Status ↓ Filter by Heartbeat Freshness ← NEW ↓ Select Best Worker ↓ Queue to Worker ``` ## Signal Handling ### Supported Signals - **SIGINT** (Ctrl+C) - Graceful shutdown - **SIGTERM** (docker stop, k8s termination) - Graceful shutdown - **SIGKILL** (force kill) - No cleanup possible ### Docker Example ```bash # Graceful shutdown (10s grace period) docker compose stop worker-shell # Force kill (immediate) docker compose kill worker-shell ``` ### Kubernetes Example ```yaml spec: terminationGracePeriodSeconds: 30 # Time for graceful shutdown ``` ## Monitoring & Debugging ### Check Worker Status ```sql SELECT id, name, status, last_heartbeat, EXTRACT(EPOCH FROM (NOW() - last_heartbeat)) as seconds_ago FROM worker WHERE worker_role = 'action' ORDER BY last_heartbeat DESC; ``` ### Identify Stale Workers ```sql SELECT id, name, status, EXTRACT(EPOCH FROM (NOW() - last_heartbeat)) as seconds_ago FROM worker WHERE worker_role = 'action' AND status = 'active' AND (last_heartbeat IS NULL OR last_heartbeat < NOW() - INTERVAL '90 seconds'); ``` ### View Worker Logs ```bash # Docker Compose docker compose logs -f worker-shell # Look for: # - "Worker registered with ID: X" # - "Heartbeat sent successfully" (debug level) # - "Received SIGTERM signal" # - "Deregistering worker ID: X" ``` ### View Executor Logs ```bash docker compose logs -f executor # Look for: # - "Worker X heartbeat is stale: last seen N seconds ago" # - "No workers with fresh heartbeats available" ``` ## Common Issues ### Issue: "No workers with fresh heartbeats available" **Causes:** 1. All workers crashed/terminated 2. Workers paused/frozen 3. Network partition between workers and database 4. Database connection issues **Solutions:** 1. Check if workers are running: `docker compose ps` 2. Restart workers: `docker compose restart worker-shell` 3. Check worker logs for errors 4. Verify database connectivity ### Issue: Worker not deregistering on shutdown **Causes:** 1. SIGKILL used instead of SIGTERM 2. Grace period too short 3. Database connection lost before deregister **Solutions:** 1. Use `docker compose stop` not `docker compose kill` 2. Increase grace period: `docker compose down -t 30` 3. Check network connectivity ### Issue: Worker stuck in Active status after crash **Behavior:** Normal - executor will detect as stale after 90s **Manual Cleanup (if needed):** ```sql UPDATE worker SET status = 'inactive' WHERE last_heartbeat < NOW() - INTERVAL '5 minutes'; ``` ## Testing ### Test Graceful Shutdown ```bash # Start worker docker compose up -d worker-shell # Wait for registration sleep 5 # Check status (should be 'active') docker compose exec postgres psql -U attune -c \ "SELECT name, status FROM worker WHERE name LIKE 'worker-shell%';" # Graceful shutdown docker compose stop worker-shell # Check status (should be 'inactive') docker compose exec postgres psql -U attune -c \ "SELECT name, status FROM worker WHERE name LIKE 'worker-shell%';" ``` ### Test Heartbeat Validation ```bash # Pause worker (simulate freeze) docker compose pause worker-shell # Wait for staleness (90+ seconds) sleep 100 # Try to schedule execution (should fail) # Use API or CLI to trigger execution attune execution create --action core.echo --param message="test" # Should see: "No workers with fresh heartbeats available" ``` ## Configuration Reference ### Worker Config ```yaml worker: name: "worker-01" heartbeat_interval: 30 # Heartbeat update frequency (seconds) max_concurrent_tasks: 10 # Concurrent execution limit task_timeout: 300 # Per-task timeout (seconds) ``` ### Relevant Constants ```rust // crates/executor/src/scheduler.rs const DEFAULT_HEARTBEAT_INTERVAL: u64 = 30; const HEARTBEAT_STALENESS_MULTIPLIER: u64 = 3; // Max age = 90 seconds ``` ## Best Practices 1. **Use Graceful Shutdown:** Always use SIGTERM, not SIGKILL 2. **Monitor Heartbeats:** Alert when workers go stale 3. **Set Grace Periods:** Allow 10-30s for worker shutdown in production 4. **Health Checks:** Implement liveness probes in Kubernetes 5. **Auto-Restart:** Configure restart policies for crashed workers ## Related Documentation - `work-summary/2026-02-worker-graceful-shutdown-heartbeat-validation.md` - Implementation details - `docs/architecture/worker-service.md` - Worker architecture - `docs/architecture/executor-service.md` - Executor architecture - `AGENTS.md` - Project conventions ## Future Enhancements - [ ] Configurable staleness multiplier - [ ] Active health probing - [ ] Graceful work completion before shutdown - [ ] Worker reconnection logic - [ ] Load-based worker selection