working out the worker/execution interface

2026-02-08 12:55:33 -06:00
parent c62f41669d
commit a74e13fa0b
108 changed files with 21162 additions and 674 deletions
--- a/work-summary/2026-02-worker-graceful-shutdown-heartbeat-validation.md
+++ b/work-summary/2026-02-worker-graceful-shutdown-heartbeat-validation.md
@@ -0,0 +1,223 @@
+# Worker Graceful Shutdown and Heartbeat Validation
+
+**Date:** 2026-02-04  
+**Status:** Complete  
+**Services Modified:** `attune-worker`, `attune-executor`
+
+## Overview
+
+Implemented graceful shutdown handling for workers and added heartbeat validation in the executor to prevent scheduling executions to stale or unavailable workers.
+
+## Problem Statement
+
+Workers were not properly marking themselves as offline when shutting down, leading to:
+- Executors attempting to schedule work to terminated workers
+- Failed executions due to worker unavailability
+- No validation of worker health before scheduling
+
+## Changes Implemented
+
+### 1. Worker Graceful Shutdown (`attune-worker`)
+
+**File:** `crates/worker/src/main.rs`
+
+- **Signal Handling:** Added proper handling for `SIGINT` and `SIGTERM` signals using tokio's Unix signal API
+- **Shutdown Flow:** Workers now properly deregister (mark as inactive) before shutdown
+- **Service Lifecycle:** Separated `start()` and `stop()` calls from signal handling logic
+
+**Key Changes:**
+```rust
+// Setup signal handlers for graceful shutdown
+let mut sigint = signal(SignalKind::interrupt())?;
+let mut sigterm = signal(SignalKind::terminate())?;
+
+tokio::select! {
+    _ = sigint.recv() => {
+        info!("Received SIGINT signal");
+    }
+    _ = sigterm.recv() => {
+        info!("Received SIGTERM signal");
+    }
+}
+
+// Stop the service and mark worker as inactive
+service.stop().await?;
+```
+
+**File:** `crates/worker/src/service.rs`
+
+- **Removed:** `run()` method that mixed signal handling with service logic
+- **Rationale:** Signal handling is now cleanly separated in `main.rs`, making the service module more testable and focused
+
+### 2. Executor Heartbeat Validation (`attune-executor`)
+
+**File:** `crates/executor/src/scheduler.rs`
+
+Added heartbeat freshness validation before scheduling executions to workers.
+
+**Constants:**
+```rust
+const DEFAULT_HEARTBEAT_INTERVAL: u64 = 30;          // seconds
+const HEARTBEAT_STALENESS_MULTIPLIER: u64 = 3;        // 3x interval
+// Max age = 90 seconds (3 * 30s)
+```
+
+**New Function:** `is_worker_heartbeat_fresh()`
+- Checks if worker's `last_heartbeat` timestamp exists
+- Validates heartbeat is within `HEARTBEAT_INTERVAL * STALENESS_MULTIPLIER` (90 seconds)
+- Logs warnings for stale workers
+- Returns `false` if no heartbeat recorded
+
+**Integration:** Added heartbeat filtering in `select_worker()` flow:
+```rust
+// Filter by heartbeat freshness (only workers with recent heartbeats)
+let fresh_workers: Vec<_> = active_workers
+    .into_iter()
+    .filter(|w| Self::is_worker_heartbeat_fresh(w))
+    .collect();
+
+if fresh_workers.is_empty() {
+    return Err(anyhow::anyhow!(
+        "No workers with fresh heartbeats available"
+    ));
+}
+```
+
+**Worker Selection Order:**
+1. Filter by runtime compatibility
+2. Filter by active status
+3. **NEW:** Filter by heartbeat freshness
+4. Select best worker (currently first available)
+
+### 3. Unit Tests
+
+**File:** `crates/executor/src/scheduler.rs`
+
+Added comprehensive unit tests for heartbeat validation:
+- `test_heartbeat_freshness_with_recent_heartbeat` - 30s old (fresh)
+- `test_heartbeat_freshness_with_stale_heartbeat` - 100s old (stale)
+- `test_heartbeat_freshness_at_boundary` - 90s old (boundary case)
+- `test_heartbeat_freshness_with_no_heartbeat` - no heartbeat (stale)
+- `test_heartbeat_freshness_with_very_recent` - 5s old (fresh)
+
+**Test Results:** All 6 tests pass ✅
+
+## Technical Details
+
+### Heartbeat Staleness Calculation
+
+- **Default Heartbeat Interval:** 30 seconds (from `WorkerConfig::default_heartbeat_interval`)
+- **Staleness Threshold:** 3x heartbeat interval = 90 seconds
+- **Rationale:** Allows for up to 2 missed heartbeats plus buffer time before considering worker stale
+
+### Shutdown Sequence
+
+1. Worker receives SIGINT/SIGTERM signal
+2. Signal handler triggers graceful shutdown
+3. `service.stop()` is called:
+   - Stops heartbeat manager
+   - Waits 100ms for heartbeat to stop
+   - Calls `registration.deregister()`
+   - Updates worker status to `Inactive` in database
+4. Worker exits cleanly
+
+### Error Handling
+
+- **Stale Workers:** Executor logs warning and excludes from scheduling
+- **No Fresh Workers:** Execution scheduling fails with descriptive error message
+- **Heartbeat Validation:** Runs on every execution scheduling attempt
+
+## Benefits
+
+1. **Improved Reliability:** Prevents scheduling to dead workers
+2. **Faster Failure Detection:** Workers mark themselves offline immediately on shutdown
+3. **Better Observability:** Clear logging when workers are stale or unavailable
+4. **Graceful Degradation:** System continues operating with remaining healthy workers
+5. **Production Ready:** Proper signal handling for containerized environments (Docker, Kubernetes)
+
+## Docker Compatibility
+
+The SIGTERM handling is especially important for containerized environments:
+- Docker sends SIGTERM on `docker stop`
+- Kubernetes sends SIGTERM during pod termination
+- Workers now have 10s (default grace period) to mark themselves offline before forced SIGKILL
+
+## Configuration
+
+No new configuration required. Uses existing `WorkerConfig::heartbeat_interval` (default: 30s).
+
+**Future Enhancement Opportunity:** Add configurable staleness multiplier:
+```yaml
+worker:
+  heartbeat_interval: 30
+  heartbeat_staleness_multiplier: 3  # Optional, defaults to 3
+```
+
+## Testing Recommendations
+
+### Manual Testing
+
+1. **Worker Graceful Shutdown:**
+   ```bash
+   # Start worker
+   docker compose up worker-shell
+   
+   # Send SIGTERM
+   docker compose stop worker-shell
+   
+   # Verify in logs: "Deregistering worker ID: X"
+   # Verify in DB: worker status = 'inactive'
+   ```
+
+2. **Heartbeat Validation:**
+   ```bash
+   # Stop worker heartbeat (simulate crash)
+   docker compose pause worker-shell
+   
+   # Wait 100 seconds
+   
+   # Attempt to schedule execution
+   # Should fail with "No workers with fresh heartbeats available"
+   ```
+
+### Integration Testing
+
+- Test execution scheduling with stale workers
+- Test execution scheduling with no workers
+- Test worker restart with existing registration
+- Test multiple workers with varying heartbeat states
+
+## Related Files
+
+- `crates/worker/src/main.rs` - Signal handling
+- `crates/worker/src/service.rs` - Service lifecycle
+- `crates/worker/src/registration.rs` - Worker registration/deregistration
+- `crates/worker/src/heartbeat.rs` - Heartbeat manager
+- `crates/executor/src/scheduler.rs` - Execution scheduling with heartbeat validation
+- `crates/common/src/config.rs` - Worker configuration (heartbeat_interval)
+- `crates/common/src/models.rs` - Worker model (last_heartbeat field)
+
+## Migration Notes
+
+**No database migration required.** Uses existing `worker.last_heartbeat` column.
+
+**No configuration changes required.** Uses existing heartbeat interval settings.
+
+**Backward Compatible:** Works with existing workers; old workers without proper shutdown will be detected as stale after 90s.
+
+## Future Enhancements
+
+1. **Configurable Staleness Multiplier:** Allow tuning staleness threshold per environment
+2. **Worker Health Checks:** Add active health probing beyond passive heartbeat monitoring
+3. **Graceful Work Completion:** Allow in-progress executions to complete before shutdown (requires execution state tracking)
+4. **Worker Reconnection:** Handle network partitions vs. actual worker failures
+5. **Load-Based Selection:** Consider worker load alongside heartbeat freshness
+
+## Conclusion
+
+These changes significantly improve the robustness of the worker infrastructure by ensuring:
+- Workers cleanly deregister on shutdown
+- Executor only schedules to healthy, responsive workers
+- System gracefully handles worker failures and restarts
+
+All changes are backward compatible and require no configuration updates.