7.6 KiB
Worker Graceful Shutdown and Heartbeat Validation
Date: 2026-02-04
Status: Complete
Services Modified: attune-worker, attune-executor
Overview
Implemented graceful shutdown handling for workers and added heartbeat validation in the executor to prevent scheduling executions to stale or unavailable workers.
Problem Statement
Workers were not properly marking themselves as offline when shutting down, leading to:
- Executors attempting to schedule work to terminated workers
- Failed executions due to worker unavailability
- No validation of worker health before scheduling
Changes Implemented
1. Worker Graceful Shutdown (attune-worker)
File: crates/worker/src/main.rs
- Signal Handling: Added proper handling for
SIGINTandSIGTERMsignals using tokio's Unix signal API - Shutdown Flow: Workers now properly deregister (mark as inactive) before shutdown
- Service Lifecycle: Separated
start()andstop()calls from signal handling logic
Key Changes:
// Setup signal handlers for graceful shutdown
let mut sigint = signal(SignalKind::interrupt())?;
let mut sigterm = signal(SignalKind::terminate())?;
tokio::select! {
_ = sigint.recv() => {
info!("Received SIGINT signal");
}
_ = sigterm.recv() => {
info!("Received SIGTERM signal");
}
}
// Stop the service and mark worker as inactive
service.stop().await?;
File: crates/worker/src/service.rs
- Removed:
run()method that mixed signal handling with service logic - Rationale: Signal handling is now cleanly separated in
main.rs, making the service module more testable and focused
2. Executor Heartbeat Validation (attune-executor)
File: crates/executor/src/scheduler.rs
Added heartbeat freshness validation before scheduling executions to workers.
Constants:
const DEFAULT_HEARTBEAT_INTERVAL: u64 = 30; // seconds
const HEARTBEAT_STALENESS_MULTIPLIER: u64 = 3; // 3x interval
// Max age = 90 seconds (3 * 30s)
New Function: is_worker_heartbeat_fresh()
- Checks if worker's
last_heartbeattimestamp exists - Validates heartbeat is within
HEARTBEAT_INTERVAL * STALENESS_MULTIPLIER(90 seconds) - Logs warnings for stale workers
- Returns
falseif no heartbeat recorded
Integration: Added heartbeat filtering in select_worker() flow:
// Filter by heartbeat freshness (only workers with recent heartbeats)
let fresh_workers: Vec<_> = active_workers
.into_iter()
.filter(|w| Self::is_worker_heartbeat_fresh(w))
.collect();
if fresh_workers.is_empty() {
return Err(anyhow::anyhow!(
"No workers with fresh heartbeats available"
));
}
Worker Selection Order:
- Filter by runtime compatibility
- Filter by active status
- NEW: Filter by heartbeat freshness
- Select best worker (currently first available)
3. Unit Tests
File: crates/executor/src/scheduler.rs
Added comprehensive unit tests for heartbeat validation:
test_heartbeat_freshness_with_recent_heartbeat- 30s old (fresh)test_heartbeat_freshness_with_stale_heartbeat- 100s old (stale)test_heartbeat_freshness_at_boundary- 90s old (boundary case)test_heartbeat_freshness_with_no_heartbeat- no heartbeat (stale)test_heartbeat_freshness_with_very_recent- 5s old (fresh)
Test Results: All 6 tests pass ✅
Technical Details
Heartbeat Staleness Calculation
- Default Heartbeat Interval: 30 seconds (from
WorkerConfig::default_heartbeat_interval) - Staleness Threshold: 3x heartbeat interval = 90 seconds
- Rationale: Allows for up to 2 missed heartbeats plus buffer time before considering worker stale
Shutdown Sequence
- Worker receives SIGINT/SIGTERM signal
- Signal handler triggers graceful shutdown
service.stop()is called:- Stops heartbeat manager
- Waits 100ms for heartbeat to stop
- Calls
registration.deregister() - Updates worker status to
Inactivein database
- Worker exits cleanly
Error Handling
- Stale Workers: Executor logs warning and excludes from scheduling
- No Fresh Workers: Execution scheduling fails with descriptive error message
- Heartbeat Validation: Runs on every execution scheduling attempt
Benefits
- Improved Reliability: Prevents scheduling to dead workers
- Faster Failure Detection: Workers mark themselves offline immediately on shutdown
- Better Observability: Clear logging when workers are stale or unavailable
- Graceful Degradation: System continues operating with remaining healthy workers
- Production Ready: Proper signal handling for containerized environments (Docker, Kubernetes)
Docker Compatibility
The SIGTERM handling is especially important for containerized environments:
- Docker sends SIGTERM on
docker stop - Kubernetes sends SIGTERM during pod termination
- Workers now have 10s (default grace period) to mark themselves offline before forced SIGKILL
Configuration
No new configuration required. Uses existing WorkerConfig::heartbeat_interval (default: 30s).
Future Enhancement Opportunity: Add configurable staleness multiplier:
worker:
heartbeat_interval: 30
heartbeat_staleness_multiplier: 3 # Optional, defaults to 3
Testing Recommendations
Manual Testing
-
Worker Graceful Shutdown:
# Start worker docker compose up worker-shell # Send SIGTERM docker compose stop worker-shell # Verify in logs: "Deregistering worker ID: X" # Verify in DB: worker status = 'inactive' -
Heartbeat Validation:
# Stop worker heartbeat (simulate crash) docker compose pause worker-shell # Wait 100 seconds # Attempt to schedule execution # Should fail with "No workers with fresh heartbeats available"
Integration Testing
- Test execution scheduling with stale workers
- Test execution scheduling with no workers
- Test worker restart with existing registration
- Test multiple workers with varying heartbeat states
Related Files
crates/worker/src/main.rs- Signal handlingcrates/worker/src/service.rs- Service lifecyclecrates/worker/src/registration.rs- Worker registration/deregistrationcrates/worker/src/heartbeat.rs- Heartbeat managercrates/executor/src/scheduler.rs- Execution scheduling with heartbeat validationcrates/common/src/config.rs- Worker configuration (heartbeat_interval)crates/common/src/models.rs- Worker model (last_heartbeat field)
Migration Notes
No database migration required. Uses existing worker.last_heartbeat column.
No configuration changes required. Uses existing heartbeat interval settings.
Backward Compatible: Works with existing workers; old workers without proper shutdown will be detected as stale after 90s.
Future Enhancements
- Configurable Staleness Multiplier: Allow tuning staleness threshold per environment
- Worker Health Checks: Add active health probing beyond passive heartbeat monitoring
- Graceful Work Completion: Allow in-progress executions to complete before shutdown (requires execution state tracking)
- Worker Reconnection: Handle network partitions vs. actual worker failures
- Load-Based Selection: Consider worker load alongside heartbeat freshness
Conclusion
These changes significantly improve the robustness of the worker infrastructure by ensuring:
- Workers cleanly deregister on shutdown
- Executor only schedules to healthy, responsive workers
- System gracefully handles worker failures and restarts
All changes are backward compatible and require no configuration updates.