Files
attune/work-summary/2026-02-worker-graceful-shutdown-heartbeat-validation.md

7.6 KiB

Worker Graceful Shutdown and Heartbeat Validation

Date: 2026-02-04
Status: Complete
Services Modified: attune-worker, attune-executor

Overview

Implemented graceful shutdown handling for workers and added heartbeat validation in the executor to prevent scheduling executions to stale or unavailable workers.

Problem Statement

Workers were not properly marking themselves as offline when shutting down, leading to:

  • Executors attempting to schedule work to terminated workers
  • Failed executions due to worker unavailability
  • No validation of worker health before scheduling

Changes Implemented

1. Worker Graceful Shutdown (attune-worker)

File: crates/worker/src/main.rs

  • Signal Handling: Added proper handling for SIGINT and SIGTERM signals using tokio's Unix signal API
  • Shutdown Flow: Workers now properly deregister (mark as inactive) before shutdown
  • Service Lifecycle: Separated start() and stop() calls from signal handling logic

Key Changes:

// Setup signal handlers for graceful shutdown
let mut sigint = signal(SignalKind::interrupt())?;
let mut sigterm = signal(SignalKind::terminate())?;

tokio::select! {
    _ = sigint.recv() => {
        info!("Received SIGINT signal");
    }
    _ = sigterm.recv() => {
        info!("Received SIGTERM signal");
    }
}

// Stop the service and mark worker as inactive
service.stop().await?;

File: crates/worker/src/service.rs

  • Removed: run() method that mixed signal handling with service logic
  • Rationale: Signal handling is now cleanly separated in main.rs, making the service module more testable and focused

2. Executor Heartbeat Validation (attune-executor)

File: crates/executor/src/scheduler.rs

Added heartbeat freshness validation before scheduling executions to workers.

Constants:

const DEFAULT_HEARTBEAT_INTERVAL: u64 = 30;          // seconds
const HEARTBEAT_STALENESS_MULTIPLIER: u64 = 3;        // 3x interval
// Max age = 90 seconds (3 * 30s)

New Function: is_worker_heartbeat_fresh()

  • Checks if worker's last_heartbeat timestamp exists
  • Validates heartbeat is within HEARTBEAT_INTERVAL * STALENESS_MULTIPLIER (90 seconds)
  • Logs warnings for stale workers
  • Returns false if no heartbeat recorded

Integration: Added heartbeat filtering in select_worker() flow:

// Filter by heartbeat freshness (only workers with recent heartbeats)
let fresh_workers: Vec<_> = active_workers
    .into_iter()
    .filter(|w| Self::is_worker_heartbeat_fresh(w))
    .collect();

if fresh_workers.is_empty() {
    return Err(anyhow::anyhow!(
        "No workers with fresh heartbeats available"
    ));
}

Worker Selection Order:

  1. Filter by runtime compatibility
  2. Filter by active status
  3. NEW: Filter by heartbeat freshness
  4. Select best worker (currently first available)

3. Unit Tests

File: crates/executor/src/scheduler.rs

Added comprehensive unit tests for heartbeat validation:

  • test_heartbeat_freshness_with_recent_heartbeat - 30s old (fresh)
  • test_heartbeat_freshness_with_stale_heartbeat - 100s old (stale)
  • test_heartbeat_freshness_at_boundary - 90s old (boundary case)
  • test_heartbeat_freshness_with_no_heartbeat - no heartbeat (stale)
  • test_heartbeat_freshness_with_very_recent - 5s old (fresh)

Test Results: All 6 tests pass

Technical Details

Heartbeat Staleness Calculation

  • Default Heartbeat Interval: 30 seconds (from WorkerConfig::default_heartbeat_interval)
  • Staleness Threshold: 3x heartbeat interval = 90 seconds
  • Rationale: Allows for up to 2 missed heartbeats plus buffer time before considering worker stale

Shutdown Sequence

  1. Worker receives SIGINT/SIGTERM signal
  2. Signal handler triggers graceful shutdown
  3. service.stop() is called:
    • Stops heartbeat manager
    • Waits 100ms for heartbeat to stop
    • Calls registration.deregister()
    • Updates worker status to Inactive in database
  4. Worker exits cleanly

Error Handling

  • Stale Workers: Executor logs warning and excludes from scheduling
  • No Fresh Workers: Execution scheduling fails with descriptive error message
  • Heartbeat Validation: Runs on every execution scheduling attempt

Benefits

  1. Improved Reliability: Prevents scheduling to dead workers
  2. Faster Failure Detection: Workers mark themselves offline immediately on shutdown
  3. Better Observability: Clear logging when workers are stale or unavailable
  4. Graceful Degradation: System continues operating with remaining healthy workers
  5. Production Ready: Proper signal handling for containerized environments (Docker, Kubernetes)

Docker Compatibility

The SIGTERM handling is especially important for containerized environments:

  • Docker sends SIGTERM on docker stop
  • Kubernetes sends SIGTERM during pod termination
  • Workers now have 10s (default grace period) to mark themselves offline before forced SIGKILL

Configuration

No new configuration required. Uses existing WorkerConfig::heartbeat_interval (default: 30s).

Future Enhancement Opportunity: Add configurable staleness multiplier:

worker:
  heartbeat_interval: 30
  heartbeat_staleness_multiplier: 3  # Optional, defaults to 3

Testing Recommendations

Manual Testing

  1. Worker Graceful Shutdown:

    # Start worker
    docker compose up worker-shell
    
    # Send SIGTERM
    docker compose stop worker-shell
    
    # Verify in logs: "Deregistering worker ID: X"
    # Verify in DB: worker status = 'inactive'
    
  2. Heartbeat Validation:

    # Stop worker heartbeat (simulate crash)
    docker compose pause worker-shell
    
    # Wait 100 seconds
    
    # Attempt to schedule execution
    # Should fail with "No workers with fresh heartbeats available"
    

Integration Testing

  • Test execution scheduling with stale workers
  • Test execution scheduling with no workers
  • Test worker restart with existing registration
  • Test multiple workers with varying heartbeat states
  • crates/worker/src/main.rs - Signal handling
  • crates/worker/src/service.rs - Service lifecycle
  • crates/worker/src/registration.rs - Worker registration/deregistration
  • crates/worker/src/heartbeat.rs - Heartbeat manager
  • crates/executor/src/scheduler.rs - Execution scheduling with heartbeat validation
  • crates/common/src/config.rs - Worker configuration (heartbeat_interval)
  • crates/common/src/models.rs - Worker model (last_heartbeat field)

Migration Notes

No database migration required. Uses existing worker.last_heartbeat column.

No configuration changes required. Uses existing heartbeat interval settings.

Backward Compatible: Works with existing workers; old workers without proper shutdown will be detected as stale after 90s.

Future Enhancements

  1. Configurable Staleness Multiplier: Allow tuning staleness threshold per environment
  2. Worker Health Checks: Add active health probing beyond passive heartbeat monitoring
  3. Graceful Work Completion: Allow in-progress executions to complete before shutdown (requires execution state tracking)
  4. Worker Reconnection: Handle network partitions vs. actual worker failures
  5. Load-Based Selection: Consider worker load alongside heartbeat freshness

Conclusion

These changes significantly improve the robustness of the worker infrastructure by ensuring:

  • Workers cleanly deregister on shutdown
  • Executor only schedules to healthy, responsive workers
  • System gracefully handles worker failures and restarts

All changes are backward compatible and require no configuration updates.