# Worker Availability Handling - Gap Analysis **Date**: 2026-02-09 **Status**: Investigation Complete - Implementation Pending **Priority**: High **Impact**: Operational Reliability ## Issue Reported User reported that when workers are brought down (e.g., `docker compose down worker-shell`), the executor continues attempting to send executions to the unavailable workers, resulting in stuck executions that never complete or fail. ## Investigation Summary Investigated the executor's worker selection and scheduling logic to understand how worker availability is determined and what happens when workers become unavailable. ### Current Architecture **Heartbeat-Based Availability:** - Workers send heartbeats to database every 30 seconds (configurable) - Scheduler filters workers based on heartbeat freshness - Workers are considered "stale" if heartbeat is older than 90 seconds (3x heartbeat interval) - Only workers with fresh heartbeats are eligible for scheduling **Scheduling Flow:** ``` Execution (REQUESTED) → Scheduler finds worker with fresh heartbeat → Execution status updated to SCHEDULED → Message published to worker-specific queue → Worker consumes and executes ``` ### Root Causes Identified 1. **Heartbeat Staleness Window**: Workers can stop within the 90-second staleness window and still appear "available" - Worker sends heartbeat at T=0 - Worker stops at T=30 - Scheduler can still select this worker until T=90 - 60-second window where dead worker appears healthy 2. **No Execution Timeout**: Once scheduled, executions have no timeout mechanism - Execution remains in SCHEDULED status indefinitely - No background process monitors scheduled executions - No automatic failure after reasonable time period 3. **Message Queue Accumulation**: Messages sit in worker-specific queues forever - Worker-specific queues: `attune.execution.worker.{worker_id}` - No TTL configured on these queues - No dead letter queue (DLQ) for expired messages - Messages never expire even if worker is permanently down 4. **No Graceful Shutdown**: Workers don't update their status when stopping - Docker SIGTERM signal not handled - Worker status remains "active" in database - No notification that worker is shutting down 5. **Retry Logic Issues**: Failed scheduling doesn't trigger meaningful retries - Scheduler returns error if no workers available - Error triggers message requeue (via nack) - But if worker WAS available during scheduling, message is successfully published - No mechanism to detect that worker never picked up the message ### Code Locations **Heartbeat Check:** ```rust // crates/executor/src/scheduler.rs:226-241 fn is_worker_heartbeat_fresh(worker: &Worker) -> bool { let max_age = Duration::from_secs( DEFAULT_HEARTBEAT_INTERVAL * HEARTBEAT_STALENESS_MULTIPLIER ); // 30 * 3 = 90 seconds let is_fresh = age.to_std().unwrap_or(Duration::MAX) <= max_age; // ... } ``` **Worker Selection:** ```rust // crates/executor/src/scheduler.rs:171-246 async fn select_worker(pool: &PgPool, action: &Action) -> Result { // 1. Find action workers // 2. Filter by runtime compatibility // 3. Filter by active status // 4. Filter by heartbeat freshness ← Gap: 90s window // 5. Select first available (no load balancing) } ``` **Message Queue Consumer:** ```rust // crates/common/src/mq/consumer.rs:150-175 match handler(envelope.clone()).await { Err(e) => { let requeue = e.is_retriable(); // Only retries connection errors channel.basic_nack(delivery_tag, BasicNackOptions { requeue, .. }) } } ``` ## Impact Analysis ### User Experience - **Stuck executions**: Appear to be running but never complete - **No feedback**: Users don't know execution failed until they check manually - **Confusion**: Status shows SCHEDULED but nothing happens - **Lost work**: Executions that could have been routed to healthy workers are stuck ### System Impact - **Queue buildup**: Messages accumulate in unavailable worker queues - **Database pollution**: SCHEDULED executions remain in database indefinitely - **Resource waste**: Memory and disk consumed by stuck state - **Monitoring gaps**: No clear way to detect this condition ### Severity **HIGH** - This affects core functionality (execution reliability) and user trust in the system. In production, this would result in: - Failed automations with no notification - Debugging difficulties (why didn't my rule execute?) - Potential data loss (execution intended to process event is lost) ## Proposed Solutions Comprehensive solution document created at: `docs/architecture/worker-availability-handling.md` ### Phase 1: Immediate Fixes (HIGH PRIORITY) #### 1. Execution Timeout Monitor **Purpose**: Fail executions that remain SCHEDULED too long **Implementation:** - Background task in executor service - Checks every 60 seconds for stale scheduled executions - Fails executions older than 5 minutes - Updates status to FAILED with descriptive error - Publishes ExecutionCompleted notification **Impact**: Prevents indefinitely stuck executions #### 2. Graceful Worker Shutdown **Purpose**: Mark workers inactive before they stop **Implementation:** - Add SIGTERM handler to worker service - Update worker status to INACTIVE in database - Stop consuming from queue - Wait for in-flight tasks to complete (30s timeout) - Then exit **Impact**: Reduces window where dead worker appears available ### Phase 2: Medium-Term Improvements (MEDIUM PRIORITY) #### 3. Worker Queue TTL + Dead Letter Queue **Purpose**: Expire messages that sit too long in worker queues **Implementation:** - Configure `x-message-ttl: 300000` (5 minutes) on worker queues - Configure `x-dead-letter-exchange` to route expired messages - Create DLQ exchange and queue - Add dead letter handler to fail executions from DLQ **Impact**: Prevents message queue buildup #### 4. Reduced Heartbeat Interval **Purpose**: Detect unavailable workers faster **Configuration Changes:** ```yaml worker: heartbeat_interval: 10 # Down from 30 seconds executor: # Staleness = 10 * 3 = 30 seconds (down from 90s) ``` **Impact**: 60-second window reduced to 20 seconds ### Phase 3: Long-Term Enhancements (LOW PRIORITY) #### 5. Active Health Probes **Purpose**: Verify worker availability beyond heartbeats **Implementation:** - Add health endpoint to worker service - Background health checker in executor - Pings workers periodically - Marks workers INACTIVE if unresponsive **Impact**: More reliable availability detection #### 6. Intelligent Retry with Worker Affinity **Purpose**: Reschedule failed executions to different workers **Implementation:** - Track which worker was assigned to execution - On timeout, reschedule to different worker - Implement exponential backoff - Maximum retry limit **Impact**: Better fault tolerance ## Recommended Immediate Actions 1. **Deploy Execution Timeout Monitor** (Week 1) - Add timeout check to executor service - Configure 5-minute timeout for SCHEDULED executions - Monitor timeout rate to tune values 2. **Add Graceful Shutdown to Workers** (Week 1) - Implement SIGTERM handler - Update Docker Compose `stop_grace_period: 45s` - Test worker restart scenarios 3. **Reduce Heartbeat Interval** (Week 1) - Update config: `worker.heartbeat_interval: 10` - Reduces staleness window from 90s to 30s - Low-risk configuration change 4. **Document Known Limitation** (Week 1) - Add operational notes about worker restart behavior - Document expected timeout duration - Provide troubleshooting guide ## Testing Strategy ### Manual Testing 1. Start system with worker running 2. Create execution 3. Immediately stop worker: `docker compose stop worker-shell` 4. Observe execution status over 5 minutes 5. Verify execution fails with timeout error 6. Verify notification sent to user ### Integration Tests ```rust #[tokio::test] async fn test_execution_timeout_on_worker_unavailable() { // 1. Create worker and start heartbeat // 2. Schedule execution // 3. Stop worker (no graceful shutdown) // 4. Wait > timeout duration // 5. Assert execution status = FAILED // 6. Assert error message contains "timeout" } #[tokio::test] async fn test_graceful_worker_shutdown() { // 1. Create worker with active execution // 2. Send SIGTERM // 3. Verify worker status → INACTIVE // 4. Verify existing execution completes // 5. Verify new executions not scheduled to this worker } ``` ### Load Testing - Test with multiple workers - Stop workers randomly during execution - Verify executions redistribute to healthy workers - Measure timeout detection latency ## Metrics to Monitor Post-Deployment 1. **Execution Timeout Rate**: Track how often executions timeout 2. **Timeout Latency**: Time from worker stop to execution failure 3. **Queue Depth**: Monitor worker-specific queue lengths 4. **Heartbeat Gaps**: Track time between last heartbeat and status change 5. **Worker Restart Impact**: Measure execution disruption during restarts ## Configuration Recommendations ### Development ```yaml executor: scheduled_timeout: 120 # 2 minutes (faster feedback) timeout_check_interval: 30 # Check every 30 seconds worker: heartbeat_interval: 10 shutdown_timeout: 15 ``` ### Production ```yaml executor: scheduled_timeout: 300 # 5 minutes timeout_check_interval: 60 # Check every minute worker: heartbeat_interval: 10 shutdown_timeout: 30 ``` ## Related Work This investigation complements: - **2026-02-09 DOTENV Parameter Flattening**: Fixes action execution parameters - **2026-02-09 URL Query Parameter Support**: Improves web UI filtering - **Worker Heartbeat Monitoring**: Existing heartbeat mechanism (needs enhancement) Together, these improvements address both execution correctness (parameter passing) and execution reliability (worker availability). ## Documentation Created 1. `docs/architecture/worker-availability-handling.md` - Comprehensive solution guide - Problem statement and current architecture - Detailed solutions with code examples - Implementation priorities and phases - Configuration recommendations - Testing strategies - Migration path ## Next Steps 1. **Review solutions document** with team 2. **Prioritize implementation** based on urgency and resources 3. **Create implementation tickets** for each solution 4. **Schedule deployment** of Phase 1 fixes 5. **Establish monitoring** for new metrics 6. **Document operational procedures** for worker management ## Conclusion The executor lacks robust handling for worker unavailability, relying solely on heartbeat staleness checks with a wide time window. Multiple complementary solutions are needed: - **Short-term**: Timeout monitor + graceful shutdown (prevents indefinite stuck state) - **Medium-term**: Queue TTL + DLQ (prevents message buildup) - **Long-term**: Health probes + retry logic (improves reliability) **Priority**: Phase 1 solutions should be implemented immediately as they address critical operational gaps that affect system reliability and user experience.