Files

David Culbreth e31ecb781b more internal polish, resilient workers

2026-02-09 18:32:34 -06:00

11 KiB

Raw Blame History

Worker Availability Handling - Gap Analysis

Date: 2026-02-09 Status: Investigation Complete - Implementation Pending Priority: High Impact: Operational Reliability

Issue Reported

User reported that when workers are brought down (e.g., docker compose down worker-shell), the executor continues attempting to send executions to the unavailable workers, resulting in stuck executions that never complete or fail.

Investigation Summary

Investigated the executor's worker selection and scheduling logic to understand how worker availability is determined and what happens when workers become unavailable.

Current Architecture

Heartbeat-Based Availability:

Workers send heartbeats to database every 30 seconds (configurable)
Scheduler filters workers based on heartbeat freshness
Workers are considered "stale" if heartbeat is older than 90 seconds (3x heartbeat interval)
Only workers with fresh heartbeats are eligible for scheduling

Scheduling Flow:

Execution (REQUESTED) 
  → Scheduler finds worker with fresh heartbeat
  → Execution status updated to SCHEDULED
  → Message published to worker-specific queue
  → Worker consumes and executes

Root Causes Identified

Heartbeat Staleness Window: Workers can stop within the 90-second staleness window and still appear "available"
- Worker sends heartbeat at T=0
- Worker stops at T=30
- Scheduler can still select this worker until T=90
- 60-second window where dead worker appears healthy
No Execution Timeout: Once scheduled, executions have no timeout mechanism
- Execution remains in SCHEDULED status indefinitely
- No background process monitors scheduled executions
- No automatic failure after reasonable time period
Message Queue Accumulation: Messages sit in worker-specific queues forever
- Worker-specific queues: attune.execution.worker.{worker_id}
- No TTL configured on these queues
- No dead letter queue (DLQ) for expired messages
- Messages never expire even if worker is permanently down
No Graceful Shutdown: Workers don't update their status when stopping
- Docker SIGTERM signal not handled
- Worker status remains "active" in database
- No notification that worker is shutting down
Retry Logic Issues: Failed scheduling doesn't trigger meaningful retries
- Scheduler returns error if no workers available
- Error triggers message requeue (via nack)
- But if worker WAS available during scheduling, message is successfully published
- No mechanism to detect that worker never picked up the message

Code Locations

Heartbeat Check:

// crates/executor/src/scheduler.rs:226-241
fn is_worker_heartbeat_fresh(worker: &Worker) -> bool {
    let max_age = Duration::from_secs(
        DEFAULT_HEARTBEAT_INTERVAL * HEARTBEAT_STALENESS_MULTIPLIER
    ); // 30 * 3 = 90 seconds
    
    let is_fresh = age.to_std().unwrap_or(Duration::MAX) <= max_age;
    // ...
}

Worker Selection:

// crates/executor/src/scheduler.rs:171-246
async fn select_worker(pool: &PgPool, action: &Action) -> Result<Worker> {
    // 1. Find action workers
    // 2. Filter by runtime compatibility
    // 3. Filter by active status
    // 4. Filter by heartbeat freshness ← Gap: 90s window
    // 5. Select first available (no load balancing)
}

Message Queue Consumer:

// crates/common/src/mq/consumer.rs:150-175
match handler(envelope.clone()).await {
    Err(e) => {
        let requeue = e.is_retriable(); // Only retries connection errors
        channel.basic_nack(delivery_tag, BasicNackOptions { requeue, .. })
    }
}

Impact Analysis

User Experience

Stuck executions: Appear to be running but never complete
No feedback: Users don't know execution failed until they check manually
Confusion: Status shows SCHEDULED but nothing happens
Lost work: Executions that could have been routed to healthy workers are stuck

System Impact

Queue buildup: Messages accumulate in unavailable worker queues
Database pollution: SCHEDULED executions remain in database indefinitely
Resource waste: Memory and disk consumed by stuck state
Monitoring gaps: No clear way to detect this condition

Severity

HIGH - This affects core functionality (execution reliability) and user trust in the system. In production, this would result in:

Failed automations with no notification
Debugging difficulties (why didn't my rule execute?)
Potential data loss (execution intended to process event is lost)

Proposed Solutions

Comprehensive solution document created at: docs/architecture/worker-availability-handling.md

Phase 1: Immediate Fixes (HIGH PRIORITY)

1. Execution Timeout Monitor

Purpose: Fail executions that remain SCHEDULED too long

Implementation:

Background task in executor service
Checks every 60 seconds for stale scheduled executions
Fails executions older than 5 minutes
Updates status to FAILED with descriptive error
Publishes ExecutionCompleted notification

Impact: Prevents indefinitely stuck executions

2. Graceful Worker Shutdown

Purpose: Mark workers inactive before they stop

Implementation:

Add SIGTERM handler to worker service
Update worker status to INACTIVE in database
Stop consuming from queue
Wait for in-flight tasks to complete (30s timeout)
Then exit

Impact: Reduces window where dead worker appears available

Phase 2: Medium-Term Improvements (MEDIUM PRIORITY)

3. Worker Queue TTL + Dead Letter Queue

Purpose: Expire messages that sit too long in worker queues

Implementation:

Configure x-message-ttl: 300000 (5 minutes) on worker queues
Configure x-dead-letter-exchange to route expired messages
Create DLQ exchange and queue
Add dead letter handler to fail executions from DLQ

Impact: Prevents message queue buildup

4. Reduced Heartbeat Interval

Purpose: Detect unavailable workers faster

Configuration Changes:

worker:
  heartbeat_interval: 10  # Down from 30 seconds

executor:
  # Staleness = 10 * 3 = 30 seconds (down from 90s)

Impact: 60-second window reduced to 20 seconds

Phase 3: Long-Term Enhancements (LOW PRIORITY)

5. Active Health Probes

Purpose: Verify worker availability beyond heartbeats

Implementation:

Add health endpoint to worker service
Background health checker in executor
Pings workers periodically
Marks workers INACTIVE if unresponsive

Impact: More reliable availability detection

6. Intelligent Retry with Worker Affinity

Purpose: Reschedule failed executions to different workers

Implementation:

Track which worker was assigned to execution
On timeout, reschedule to different worker
Implement exponential backoff
Maximum retry limit

Impact: Better fault tolerance

Recommended Immediate Actions

Deploy Execution Timeout Monitor (Week 1)
- Add timeout check to executor service
- Configure 5-minute timeout for SCHEDULED executions
- Monitor timeout rate to tune values
Add Graceful Shutdown to Workers (Week 1)
- Implement SIGTERM handler
- Update Docker Compose stop_grace_period: 45s
- Test worker restart scenarios
Reduce Heartbeat Interval (Week 1)
- Update config: worker.heartbeat_interval: 10
- Reduces staleness window from 90s to 30s
- Low-risk configuration change
Document Known Limitation (Week 1)
- Add operational notes about worker restart behavior
- Document expected timeout duration
- Provide troubleshooting guide

Testing Strategy

Manual Testing

Start system with worker running
Create execution
Immediately stop worker: docker compose stop worker-shell
Observe execution status over 5 minutes
Verify execution fails with timeout error
Verify notification sent to user

Integration Tests

#[tokio::test]
async fn test_execution_timeout_on_worker_unavailable() {
    // 1. Create worker and start heartbeat
    // 2. Schedule execution
    // 3. Stop worker (no graceful shutdown)
    // 4. Wait > timeout duration
    // 5. Assert execution status = FAILED
    // 6. Assert error message contains "timeout"
}

#[tokio::test]
async fn test_graceful_worker_shutdown() {
    // 1. Create worker with active execution
    // 2. Send SIGTERM
    // 3. Verify worker status → INACTIVE
    // 4. Verify existing execution completes
    // 5. Verify new executions not scheduled to this worker
}

Load Testing

Test with multiple workers
Stop workers randomly during execution
Verify executions redistribute to healthy workers
Measure timeout detection latency

Metrics to Monitor Post-Deployment

Execution Timeout Rate: Track how often executions timeout
Timeout Latency: Time from worker stop to execution failure
Queue Depth: Monitor worker-specific queue lengths
Heartbeat Gaps: Track time between last heartbeat and status change
Worker Restart Impact: Measure execution disruption during restarts

Configuration Recommendations

Development

executor:
  scheduled_timeout: 120  # 2 minutes (faster feedback)
  timeout_check_interval: 30  # Check every 30 seconds

worker:
  heartbeat_interval: 10
  shutdown_timeout: 15

Production

executor:
  scheduled_timeout: 300  # 5 minutes
  timeout_check_interval: 60  # Check every minute

worker:
  heartbeat_interval: 10
  shutdown_timeout: 30

This investigation complements:

2026-02-09 DOTENV Parameter Flattening: Fixes action execution parameters
2026-02-09 URL Query Parameter Support: Improves web UI filtering
Worker Heartbeat Monitoring: Existing heartbeat mechanism (needs enhancement)

Together, these improvements address both execution correctness (parameter passing) and execution reliability (worker availability).

Documentation Created

docs/architecture/worker-availability-handling.md - Comprehensive solution guide
- Problem statement and current architecture
- Detailed solutions with code examples
- Implementation priorities and phases
- Configuration recommendations
- Testing strategies
- Migration path

Next Steps

Review solutions document with team
Prioritize implementation based on urgency and resources
Create implementation tickets for each solution
Schedule deployment of Phase 1 fixes
Establish monitoring for new metrics
Document operational procedures for worker management

Conclusion

The executor lacks robust handling for worker unavailability, relying solely on heartbeat staleness checks with a wide time window. Multiple complementary solutions are needed:

Short-term: Timeout monitor + graceful shutdown (prevents indefinite stuck state)
Medium-term: Queue TTL + DLQ (prevents message buildup)
Long-term: Health probes + retry logic (improves reliability)

Priority: Phase 1 solutions should be implemented immediately as they address critical operational gaps that affect system reliability and user experience.

11 KiB Raw Blame History

Worker Availability Handling - Gap Analysis

Issue Reported

Investigation Summary

Current Architecture

Root Causes Identified

Code Locations

Impact Analysis

User Experience

System Impact

Severity

Proposed Solutions

Phase 1: Immediate Fixes (HIGH PRIORITY)

1. Execution Timeout Monitor

2. Graceful Worker Shutdown

Phase 2: Medium-Term Improvements (MEDIUM PRIORITY)

3. Worker Queue TTL + Dead Letter Queue

4. Reduced Heartbeat Interval

Phase 3: Long-Term Enhancements (LOW PRIORITY)

5. Active Health Probes

6. Intelligent Retry with Worker Affinity

Recommended Immediate Actions

Testing Strategy

Manual Testing

Integration Tests

Load Testing

Metrics to Monitor Post-Deployment

Configuration Recommendations

Development

Production

Related Work

Documentation Created

Next Steps

Conclusion

11 KiB

Raw Blame History