attune/work-summary/2026-02-worker-graceful-shutdown-heartbeat-validation.md

# Worker Graceful Shutdown and Heartbeat Validation

**Date:** 2026-02-04
**Status:** Complete
**Services Modified:** `attune-worker`, `attune-executor`

## Overview

Implemented graceful shutdown handling for workers and added heartbeat validation in the executor to prevent scheduling executions to stale or unavailable workers.

## Problem Statement

Workers were not properly marking themselves as offline when shutting down, leading to:
- Executors attempting to schedule work to terminated workers
- Failed executions due to worker unavailability
- No validation of worker health before scheduling

## Changes Implemented

### 1. Worker Graceful Shutdown (`attune-worker`)

**File:** `crates/worker/src/main.rs`

- **Signal Handling:** Added proper handling for `SIGINT` and `SIGTERM` signals using tokio's Unix signal API
- **Shutdown Flow:** Workers now properly deregister (mark as inactive) before shutdown
- **Service Lifecycle:** Separated `start()` and `stop()` calls from signal handling logic

**Key Changes:**
```rust
// Setup signal handlers for graceful shutdown
let mut sigint = signal(SignalKind::interrupt())?;
let mut sigterm = signal(SignalKind::terminate())?;

tokio::select! {
    _ = sigint.recv() => {
        info!("Received SIGINT signal");
    }
    _ = sigterm.recv() => {
        info!("Received SIGTERM signal");
    }
}

// Stop the service and mark worker as inactive
service.stop().await?;
```

**File:** `crates/worker/src/service.rs`

- **Removed:** `run()` method that mixed signal handling with service logic
- **Rationale:** Signal handling is now cleanly separated in `main.rs`, making the service module more testable and focused

### 2. Executor Heartbeat Validation (`attune-executor`)

**File:** `crates/executor/src/scheduler.rs`

Added heartbeat freshness validation before scheduling executions to workers.

**Constants:**
```rust
const DEFAULT_HEARTBEAT_INTERVAL: u64 = 30;          // seconds
const HEARTBEAT_STALENESS_MULTIPLIER: u64 = 3;        // 3x interval
// Max age = 90 seconds (3 * 30s)
```

**New Function:** `is_worker_heartbeat_fresh()`
- Checks if worker's `last_heartbeat` timestamp exists
- Validates heartbeat is within `HEARTBEAT_INTERVAL * STALENESS_MULTIPLIER` (90 seconds)
- Logs warnings for stale workers
- Returns `false` if no heartbeat recorded

**Integration:** Added heartbeat filtering in `select_worker()` flow:
```rust
// Filter by heartbeat freshness (only workers with recent heartbeats)
let fresh_workers: Vec<_> = active_workers
    .into_iter()
    .filter(|w| Self::is_worker_heartbeat_fresh(w))
    .collect();

if fresh_workers.is_empty() {
    return Err(anyhow::anyhow!(
        "No workers with fresh heartbeats available"
    ));
}
```

**Worker Selection Order:**
1. Filter by runtime compatibility
2. Filter by active status
3. **NEW:** Filter by heartbeat freshness
4. Select best worker (currently first available)

### 3. Unit Tests

**File:** `crates/executor/src/scheduler.rs`

Added comprehensive unit tests for heartbeat validation:
- `test_heartbeat_freshness_with_recent_heartbeat` - 30s old (fresh)
- `test_heartbeat_freshness_with_stale_heartbeat` - 100s old (stale)
- `test_heartbeat_freshness_at_boundary` - 90s old (boundary case)
- `test_heartbeat_freshness_with_no_heartbeat` - no heartbeat (stale)
- `test_heartbeat_freshness_with_very_recent` - 5s old (fresh)

**Test Results:** All 6 tests pass ✅

## Technical Details

### Heartbeat Staleness Calculation

- **Default Heartbeat Interval:** 30 seconds (from `WorkerConfig::default_heartbeat_interval`)
- **Staleness Threshold:** 3x heartbeat interval = 90 seconds
- **Rationale:** Allows for up to 2 missed heartbeats plus buffer time before considering worker stale

### Shutdown Sequence

1. Worker receives SIGINT/SIGTERM signal
2. Signal handler triggers graceful shutdown
3. `service.stop()` is called:
   - Stops heartbeat manager
   - Waits 100ms for heartbeat to stop
   - Calls `registration.deregister()`
   - Updates worker status to `Inactive` in database
4. Worker exits cleanly

### Error Handling

- **Stale Workers:** Executor logs warning and excludes from scheduling
- **No Fresh Workers:** Execution scheduling fails with descriptive error message
- **Heartbeat Validation:** Runs on every execution scheduling attempt

## Benefits

1. **Improved Reliability:** Prevents scheduling to dead workers
2. **Faster Failure Detection:** Workers mark themselves offline immediately on shutdown
3. **Better Observability:** Clear logging when workers are stale or unavailable
4. **Graceful Degradation:** System continues operating with remaining healthy workers
5. **Production Ready:** Proper signal handling for containerized environments (Docker, Kubernetes)

## Docker Compatibility

The SIGTERM handling is especially important for containerized environments:
- Docker sends SIGTERM on `docker stop`
- Kubernetes sends SIGTERM during pod termination
- Workers now have 10s (default grace period) to mark themselves offline before forced SIGKILL

## Configuration

No new configuration required. Uses existing `WorkerConfig::heartbeat_interval` (default: 30s).

**Future Enhancement Opportunity:** Add configurable staleness multiplier:
```yaml
worker:
  heartbeat_interval: 30
  heartbeat_staleness_multiplier: 3  # Optional, defaults to 3
```

## Testing Recommendations

### Manual Testing

1. **Worker Graceful Shutdown:**
   ```bash
   # Start worker
   docker compose up worker-shell

   # Send SIGTERM
   docker compose stop worker-shell

   # Verify in logs: "Deregistering worker ID: X"
   # Verify in DB: worker status = 'inactive'
   ```

2. **Heartbeat Validation:**
   ```bash
   # Stop worker heartbeat (simulate crash)
   docker compose pause worker-shell

   # Wait 100 seconds

   # Attempt to schedule execution
   # Should fail with "No workers with fresh heartbeats available"
   ```

### Integration Testing

- Test execution scheduling with stale workers
- Test execution scheduling with no workers
- Test worker restart with existing registration
- Test multiple workers with varying heartbeat states

## Related Files

- `crates/worker/src/main.rs` - Signal handling
- `crates/worker/src/service.rs` - Service lifecycle
- `crates/worker/src/registration.rs` - Worker registration/deregistration
- `crates/worker/src/heartbeat.rs` - Heartbeat manager
- `crates/executor/src/scheduler.rs` - Execution scheduling with heartbeat validation
- `crates/common/src/config.rs` - Worker configuration (heartbeat_interval)
- `crates/common/src/models.rs` - Worker model (last_heartbeat field)

## Migration Notes

**No database migration required.** Uses existing `worker.last_heartbeat` column.

**No configuration changes required.** Uses existing heartbeat interval settings.

**Backward Compatible:** Works with existing workers; old workers without proper shutdown will be detected as stale after 90s.

## Future Enhancements

1. **Configurable Staleness Multiplier:** Allow tuning staleness threshold per environment
2. **Worker Health Checks:** Add active health probing beyond passive heartbeat monitoring
3. **Graceful Work Completion:** Allow in-progress executions to complete before shutdown (requires execution state tracking)
4. **Worker Reconnection:** Handle network partitions vs. actual worker failures
5. **Load-Based Selection:** Consider worker load alongside heartbeat freshness

## Conclusion

These changes significantly improve the robustness of the worker infrastructure by ensuring:
- Workers cleanly deregister on shutdown
- Executor only schedules to healthy, responsive workers
- System gracefully handles worker failures and restarts

All changes are backward compatible and require no configuration updates.