Files
attune/work-summary/2026-02-09-worker-availability-gaps.md

11 KiB

Worker Availability Handling - Gap Analysis

Date: 2026-02-09 Status: Investigation Complete - Implementation Pending Priority: High Impact: Operational Reliability

Issue Reported

User reported that when workers are brought down (e.g., docker compose down worker-shell), the executor continues attempting to send executions to the unavailable workers, resulting in stuck executions that never complete or fail.

Investigation Summary

Investigated the executor's worker selection and scheduling logic to understand how worker availability is determined and what happens when workers become unavailable.

Current Architecture

Heartbeat-Based Availability:

  • Workers send heartbeats to database every 30 seconds (configurable)
  • Scheduler filters workers based on heartbeat freshness
  • Workers are considered "stale" if heartbeat is older than 90 seconds (3x heartbeat interval)
  • Only workers with fresh heartbeats are eligible for scheduling

Scheduling Flow:

Execution (REQUESTED) 
  → Scheduler finds worker with fresh heartbeat
  → Execution status updated to SCHEDULED
  → Message published to worker-specific queue
  → Worker consumes and executes

Root Causes Identified

  1. Heartbeat Staleness Window: Workers can stop within the 90-second staleness window and still appear "available"

    • Worker sends heartbeat at T=0
    • Worker stops at T=30
    • Scheduler can still select this worker until T=90
    • 60-second window where dead worker appears healthy
  2. No Execution Timeout: Once scheduled, executions have no timeout mechanism

    • Execution remains in SCHEDULED status indefinitely
    • No background process monitors scheduled executions
    • No automatic failure after reasonable time period
  3. Message Queue Accumulation: Messages sit in worker-specific queues forever

    • Worker-specific queues: attune.execution.worker.{worker_id}
    • No TTL configured on these queues
    • No dead letter queue (DLQ) for expired messages
    • Messages never expire even if worker is permanently down
  4. No Graceful Shutdown: Workers don't update their status when stopping

    • Docker SIGTERM signal not handled
    • Worker status remains "active" in database
    • No notification that worker is shutting down
  5. Retry Logic Issues: Failed scheduling doesn't trigger meaningful retries

    • Scheduler returns error if no workers available
    • Error triggers message requeue (via nack)
    • But if worker WAS available during scheduling, message is successfully published
    • No mechanism to detect that worker never picked up the message

Code Locations

Heartbeat Check:

// crates/executor/src/scheduler.rs:226-241
fn is_worker_heartbeat_fresh(worker: &Worker) -> bool {
    let max_age = Duration::from_secs(
        DEFAULT_HEARTBEAT_INTERVAL * HEARTBEAT_STALENESS_MULTIPLIER
    ); // 30 * 3 = 90 seconds
    
    let is_fresh = age.to_std().unwrap_or(Duration::MAX) <= max_age;
    // ...
}

Worker Selection:

// crates/executor/src/scheduler.rs:171-246
async fn select_worker(pool: &PgPool, action: &Action) -> Result<Worker> {
    // 1. Find action workers
    // 2. Filter by runtime compatibility
    // 3. Filter by active status
    // 4. Filter by heartbeat freshness ← Gap: 90s window
    // 5. Select first available (no load balancing)
}

Message Queue Consumer:

// crates/common/src/mq/consumer.rs:150-175
match handler(envelope.clone()).await {
    Err(e) => {
        let requeue = e.is_retriable(); // Only retries connection errors
        channel.basic_nack(delivery_tag, BasicNackOptions { requeue, .. })
    }
}

Impact Analysis

User Experience

  • Stuck executions: Appear to be running but never complete
  • No feedback: Users don't know execution failed until they check manually
  • Confusion: Status shows SCHEDULED but nothing happens
  • Lost work: Executions that could have been routed to healthy workers are stuck

System Impact

  • Queue buildup: Messages accumulate in unavailable worker queues
  • Database pollution: SCHEDULED executions remain in database indefinitely
  • Resource waste: Memory and disk consumed by stuck state
  • Monitoring gaps: No clear way to detect this condition

Severity

HIGH - This affects core functionality (execution reliability) and user trust in the system. In production, this would result in:

  • Failed automations with no notification
  • Debugging difficulties (why didn't my rule execute?)
  • Potential data loss (execution intended to process event is lost)

Proposed Solutions

Comprehensive solution document created at: docs/architecture/worker-availability-handling.md

Phase 1: Immediate Fixes (HIGH PRIORITY)

1. Execution Timeout Monitor

Purpose: Fail executions that remain SCHEDULED too long

Implementation:

  • Background task in executor service
  • Checks every 60 seconds for stale scheduled executions
  • Fails executions older than 5 minutes
  • Updates status to FAILED with descriptive error
  • Publishes ExecutionCompleted notification

Impact: Prevents indefinitely stuck executions

2. Graceful Worker Shutdown

Purpose: Mark workers inactive before they stop

Implementation:

  • Add SIGTERM handler to worker service
  • Update worker status to INACTIVE in database
  • Stop consuming from queue
  • Wait for in-flight tasks to complete (30s timeout)
  • Then exit

Impact: Reduces window where dead worker appears available

Phase 2: Medium-Term Improvements (MEDIUM PRIORITY)

3. Worker Queue TTL + Dead Letter Queue

Purpose: Expire messages that sit too long in worker queues

Implementation:

  • Configure x-message-ttl: 300000 (5 minutes) on worker queues
  • Configure x-dead-letter-exchange to route expired messages
  • Create DLQ exchange and queue
  • Add dead letter handler to fail executions from DLQ

Impact: Prevents message queue buildup

4. Reduced Heartbeat Interval

Purpose: Detect unavailable workers faster

Configuration Changes:

worker:
  heartbeat_interval: 10  # Down from 30 seconds

executor:
  # Staleness = 10 * 3 = 30 seconds (down from 90s)

Impact: 60-second window reduced to 20 seconds

Phase 3: Long-Term Enhancements (LOW PRIORITY)

5. Active Health Probes

Purpose: Verify worker availability beyond heartbeats

Implementation:

  • Add health endpoint to worker service
  • Background health checker in executor
  • Pings workers periodically
  • Marks workers INACTIVE if unresponsive

Impact: More reliable availability detection

6. Intelligent Retry with Worker Affinity

Purpose: Reschedule failed executions to different workers

Implementation:

  • Track which worker was assigned to execution
  • On timeout, reschedule to different worker
  • Implement exponential backoff
  • Maximum retry limit

Impact: Better fault tolerance

  1. Deploy Execution Timeout Monitor (Week 1)

    • Add timeout check to executor service
    • Configure 5-minute timeout for SCHEDULED executions
    • Monitor timeout rate to tune values
  2. Add Graceful Shutdown to Workers (Week 1)

    • Implement SIGTERM handler
    • Update Docker Compose stop_grace_period: 45s
    • Test worker restart scenarios
  3. Reduce Heartbeat Interval (Week 1)

    • Update config: worker.heartbeat_interval: 10
    • Reduces staleness window from 90s to 30s
    • Low-risk configuration change
  4. Document Known Limitation (Week 1)

    • Add operational notes about worker restart behavior
    • Document expected timeout duration
    • Provide troubleshooting guide

Testing Strategy

Manual Testing

  1. Start system with worker running
  2. Create execution
  3. Immediately stop worker: docker compose stop worker-shell
  4. Observe execution status over 5 minutes
  5. Verify execution fails with timeout error
  6. Verify notification sent to user

Integration Tests

#[tokio::test]
async fn test_execution_timeout_on_worker_unavailable() {
    // 1. Create worker and start heartbeat
    // 2. Schedule execution
    // 3. Stop worker (no graceful shutdown)
    // 4. Wait > timeout duration
    // 5. Assert execution status = FAILED
    // 6. Assert error message contains "timeout"
}

#[tokio::test]
async fn test_graceful_worker_shutdown() {
    // 1. Create worker with active execution
    // 2. Send SIGTERM
    // 3. Verify worker status → INACTIVE
    // 4. Verify existing execution completes
    // 5. Verify new executions not scheduled to this worker
}

Load Testing

  • Test with multiple workers
  • Stop workers randomly during execution
  • Verify executions redistribute to healthy workers
  • Measure timeout detection latency

Metrics to Monitor Post-Deployment

  1. Execution Timeout Rate: Track how often executions timeout
  2. Timeout Latency: Time from worker stop to execution failure
  3. Queue Depth: Monitor worker-specific queue lengths
  4. Heartbeat Gaps: Track time between last heartbeat and status change
  5. Worker Restart Impact: Measure execution disruption during restarts

Configuration Recommendations

Development

executor:
  scheduled_timeout: 120  # 2 minutes (faster feedback)
  timeout_check_interval: 30  # Check every 30 seconds

worker:
  heartbeat_interval: 10
  shutdown_timeout: 15

Production

executor:
  scheduled_timeout: 300  # 5 minutes
  timeout_check_interval: 60  # Check every minute

worker:
  heartbeat_interval: 10
  shutdown_timeout: 30

This investigation complements:

  • 2026-02-09 DOTENV Parameter Flattening: Fixes action execution parameters
  • 2026-02-09 URL Query Parameter Support: Improves web UI filtering
  • Worker Heartbeat Monitoring: Existing heartbeat mechanism (needs enhancement)

Together, these improvements address both execution correctness (parameter passing) and execution reliability (worker availability).

Documentation Created

  1. docs/architecture/worker-availability-handling.md - Comprehensive solution guide
    • Problem statement and current architecture
    • Detailed solutions with code examples
    • Implementation priorities and phases
    • Configuration recommendations
    • Testing strategies
    • Migration path

Next Steps

  1. Review solutions document with team
  2. Prioritize implementation based on urgency and resources
  3. Create implementation tickets for each solution
  4. Schedule deployment of Phase 1 fixes
  5. Establish monitoring for new metrics
  6. Document operational procedures for worker management

Conclusion

The executor lacks robust handling for worker unavailability, relying solely on heartbeat staleness checks with a wide time window. Multiple complementary solutions are needed:

  • Short-term: Timeout monitor + graceful shutdown (prevents indefinite stuck state)
  • Medium-term: Queue TTL + DLQ (prevents message buildup)
  • Long-term: Health probes + retry logic (improves reliability)

Priority: Phase 1 solutions should be implemented immediately as they address critical operational gaps that affect system reliability and user experience.