# Worker Graceful Shutdown and Heartbeat Validation **Date:** 2026-02-04 **Status:** Complete **Services Modified:** `attune-worker`, `attune-executor` ## Overview Implemented graceful shutdown handling for workers and added heartbeat validation in the executor to prevent scheduling executions to stale or unavailable workers. ## Problem Statement Workers were not properly marking themselves as offline when shutting down, leading to: - Executors attempting to schedule work to terminated workers - Failed executions due to worker unavailability - No validation of worker health before scheduling ## Changes Implemented ### 1. Worker Graceful Shutdown (`attune-worker`) **File:** `crates/worker/src/main.rs` - **Signal Handling:** Added proper handling for `SIGINT` and `SIGTERM` signals using tokio's Unix signal API - **Shutdown Flow:** Workers now properly deregister (mark as inactive) before shutdown - **Service Lifecycle:** Separated `start()` and `stop()` calls from signal handling logic **Key Changes:** ```rust // Setup signal handlers for graceful shutdown let mut sigint = signal(SignalKind::interrupt())?; let mut sigterm = signal(SignalKind::terminate())?; tokio::select! { _ = sigint.recv() => { info!("Received SIGINT signal"); } _ = sigterm.recv() => { info!("Received SIGTERM signal"); } } // Stop the service and mark worker as inactive service.stop().await?; ``` **File:** `crates/worker/src/service.rs` - **Removed:** `run()` method that mixed signal handling with service logic - **Rationale:** Signal handling is now cleanly separated in `main.rs`, making the service module more testable and focused ### 2. Executor Heartbeat Validation (`attune-executor`) **File:** `crates/executor/src/scheduler.rs` Added heartbeat freshness validation before scheduling executions to workers. **Constants:** ```rust const DEFAULT_HEARTBEAT_INTERVAL: u64 = 30; // seconds const HEARTBEAT_STALENESS_MULTIPLIER: u64 = 3; // 3x interval // Max age = 90 seconds (3 * 30s) ``` **New Function:** `is_worker_heartbeat_fresh()` - Checks if worker's `last_heartbeat` timestamp exists - Validates heartbeat is within `HEARTBEAT_INTERVAL * STALENESS_MULTIPLIER` (90 seconds) - Logs warnings for stale workers - Returns `false` if no heartbeat recorded **Integration:** Added heartbeat filtering in `select_worker()` flow: ```rust // Filter by heartbeat freshness (only workers with recent heartbeats) let fresh_workers: Vec<_> = active_workers .into_iter() .filter(|w| Self::is_worker_heartbeat_fresh(w)) .collect(); if fresh_workers.is_empty() { return Err(anyhow::anyhow!( "No workers with fresh heartbeats available" )); } ``` **Worker Selection Order:** 1. Filter by runtime compatibility 2. Filter by active status 3. **NEW:** Filter by heartbeat freshness 4. Select best worker (currently first available) ### 3. Unit Tests **File:** `crates/executor/src/scheduler.rs` Added comprehensive unit tests for heartbeat validation: - `test_heartbeat_freshness_with_recent_heartbeat` - 30s old (fresh) - `test_heartbeat_freshness_with_stale_heartbeat` - 100s old (stale) - `test_heartbeat_freshness_at_boundary` - 90s old (boundary case) - `test_heartbeat_freshness_with_no_heartbeat` - no heartbeat (stale) - `test_heartbeat_freshness_with_very_recent` - 5s old (fresh) **Test Results:** All 6 tests pass ✅ ## Technical Details ### Heartbeat Staleness Calculation - **Default Heartbeat Interval:** 30 seconds (from `WorkerConfig::default_heartbeat_interval`) - **Staleness Threshold:** 3x heartbeat interval = 90 seconds - **Rationale:** Allows for up to 2 missed heartbeats plus buffer time before considering worker stale ### Shutdown Sequence 1. Worker receives SIGINT/SIGTERM signal 2. Signal handler triggers graceful shutdown 3. `service.stop()` is called: - Stops heartbeat manager - Waits 100ms for heartbeat to stop - Calls `registration.deregister()` - Updates worker status to `Inactive` in database 4. Worker exits cleanly ### Error Handling - **Stale Workers:** Executor logs warning and excludes from scheduling - **No Fresh Workers:** Execution scheduling fails with descriptive error message - **Heartbeat Validation:** Runs on every execution scheduling attempt ## Benefits 1. **Improved Reliability:** Prevents scheduling to dead workers 2. **Faster Failure Detection:** Workers mark themselves offline immediately on shutdown 3. **Better Observability:** Clear logging when workers are stale or unavailable 4. **Graceful Degradation:** System continues operating with remaining healthy workers 5. **Production Ready:** Proper signal handling for containerized environments (Docker, Kubernetes) ## Docker Compatibility The SIGTERM handling is especially important for containerized environments: - Docker sends SIGTERM on `docker stop` - Kubernetes sends SIGTERM during pod termination - Workers now have 10s (default grace period) to mark themselves offline before forced SIGKILL ## Configuration No new configuration required. Uses existing `WorkerConfig::heartbeat_interval` (default: 30s). **Future Enhancement Opportunity:** Add configurable staleness multiplier: ```yaml worker: heartbeat_interval: 30 heartbeat_staleness_multiplier: 3 # Optional, defaults to 3 ``` ## Testing Recommendations ### Manual Testing 1. **Worker Graceful Shutdown:** ```bash # Start worker docker compose up worker-shell # Send SIGTERM docker compose stop worker-shell # Verify in logs: "Deregistering worker ID: X" # Verify in DB: worker status = 'inactive' ``` 2. **Heartbeat Validation:** ```bash # Stop worker heartbeat (simulate crash) docker compose pause worker-shell # Wait 100 seconds # Attempt to schedule execution # Should fail with "No workers with fresh heartbeats available" ``` ### Integration Testing - Test execution scheduling with stale workers - Test execution scheduling with no workers - Test worker restart with existing registration - Test multiple workers with varying heartbeat states ## Related Files - `crates/worker/src/main.rs` - Signal handling - `crates/worker/src/service.rs` - Service lifecycle - `crates/worker/src/registration.rs` - Worker registration/deregistration - `crates/worker/src/heartbeat.rs` - Heartbeat manager - `crates/executor/src/scheduler.rs` - Execution scheduling with heartbeat validation - `crates/common/src/config.rs` - Worker configuration (heartbeat_interval) - `crates/common/src/models.rs` - Worker model (last_heartbeat field) ## Migration Notes **No database migration required.** Uses existing `worker.last_heartbeat` column. **No configuration changes required.** Uses existing heartbeat interval settings. **Backward Compatible:** Works with existing workers; old workers without proper shutdown will be detected as stale after 90s. ## Future Enhancements 1. **Configurable Staleness Multiplier:** Allow tuning staleness threshold per environment 2. **Worker Health Checks:** Add active health probing beyond passive heartbeat monitoring 3. **Graceful Work Completion:** Allow in-progress executions to complete before shutdown (requires execution state tracking) 4. **Worker Reconnection:** Handle network partitions vs. actual worker failures 5. **Load-Based Selection:** Consider worker load alongside heartbeat freshness ## Conclusion These changes significantly improve the robustness of the worker infrastructure by ensuring: - Workers cleanly deregister on shutdown - Executor only schedules to healthy, responsive workers - System gracefully handles worker failures and restarts All changes are backward compatible and require no configuration updates.