Files

David Culbreth e31ecb781b more internal polish, resilient workers

2026-02-09 18:32:34 -06:00

15 KiB

Raw Blame History

Worker Queue TTL and Dead Letter Queue (Phase 2)

Overview

Phase 2 of worker availability handling implements message TTL (time-to-live) on worker-specific queues and dead letter queue (DLQ) processing. This ensures that executions sent to unavailable workers are automatically failed instead of remaining stuck indefinitely.

Architecture

Message Flow

┌─────────────┐
│  Executor   │
│  Scheduler  │
└──────┬──────┘
       │ Publishes ExecutionRequested
       │ routing_key: execution.dispatch.worker.{id}
       │
       ▼
┌──────────────────────────────────┐
│  worker.{id}.executions queue    │
│                                  │
│  Properties:                     │
│  - x-message-ttl: 300000ms (5m)  │
│  - x-dead-letter-exchange: dlx   │
└──────┬───────────────────┬───────┘
       │                   │
       │ Worker consumes   │ TTL expires
       │ (normal flow)     │ (worker unavailable)
       │                   │
       ▼                   ▼
┌──────────────┐    ┌──────────────────┐
│   Worker     │    │  attune.dlx      │
│   Service    │    │  (Dead Letter    │
│              │    │   Exchange)      │
└──────────────┘    └────────┬─────────┘
                             │
                             │ Routes to DLQ
                             │
                             ▼
                    ┌──────────────────────┐
                    │  attune.dlx.queue    │
                    │  (Dead Letter Queue) │
                    └────────┬─────────────┘
                             │
                             │ Consumes
                             │
                             ▼
                    ┌──────────────────────┐
                    │  Dead Letter Handler │
                    │  (in Executor)       │
                    │                      │
                    │  - Identifies exec   │
                    │  - Marks as FAILED   │
                    │  - Logs failure      │
                    └──────────────────────┘

Components

1. Worker Queue TTL

Configuration:

Default: 5 minutes (300,000 milliseconds)
Configurable via rabbitmq.worker_queue_ttl_ms

Implementation:

Applied during queue declaration in Connection::setup_worker_infrastructure()
Uses RabbitMQ's x-message-ttl queue argument
Only applies to worker-specific queues (worker.{id}.executions)

Behavior:

When a message remains in the queue longer than TTL
RabbitMQ automatically moves it to the configured dead letter exchange
Original message properties and headers are preserved
Includes x-death header with expiration details

2. Dead Letter Exchange (DLX)

Configuration:

Exchange name: attune.dlx
Type: direct
Durable: true

Setup:

Created in Connection::setup_common_infrastructure()
Bound to dead letter queue with routing key # (all messages)
Shared across all services

3. Dead Letter Queue

Configuration:

Queue name: attune.dlx.queue
Durable: true
TTL: 24 hours (configurable via rabbitmq.dead_letter.ttl_ms)

Properties:

Retains messages for debugging and analysis
Messages auto-expire after retention period
No DLX on the DLQ itself (prevents infinite loops)

4. Dead Letter Handler

Location: crates/executor/src/dead_letter_handler.rs

Responsibilities:

Consume messages from attune.dlx.queue
Deserialize message envelope
Extract execution ID from payload
Verify execution is in non-terminal state
Update execution to FAILED status
Add descriptive error information
Acknowledge message (remove from DLQ)

Error Handling:

Invalid messages: Acknowledged and discarded
Missing executions: Acknowledged (already processed)
Terminal state executions: Acknowledged (no action needed)
Database errors: Nacked with requeue (retry later)

Configuration

RabbitMQ Configuration Structure

message_queue:
  rabbitmq:
    # Worker queue TTL - how long messages wait before DLX
    worker_queue_ttl_ms: 300000  # 5 minutes (default)
    
    # Dead letter configuration
    dead_letter:
      enabled: true                # Enable DLQ system
      exchange: attune.dlx         # DLX name
      ttl_ms: 86400000            # DLQ retention (24 hours)

Environment-Specific Settings

Development (`config.development.yaml`)

message_queue:
  rabbitmq:
    worker_queue_ttl_ms: 300000  # 5 minutes
    dead_letter:
      enabled: true
      exchange: attune.dlx
      ttl_ms: 86400000  # 24 hours

Production (`config.docker.yaml`)

message_queue:
  rabbitmq:
    worker_queue_ttl_ms: 300000  # 5 minutes
    dead_letter:
      enabled: true
      exchange: attune.dlx
      ttl_ms: 86400000  # 24 hours

Tuning Guidelines

Worker Queue TTL (worker_queue_ttl_ms):

Too short: Legitimate slow workers may have executions failed prematurely
Too long: Unavailable workers cause delayed failure detection
Recommendation: 2-5x typical execution time, minimum 2 minutes
Default (5 min): Good balance for most workloads

DLQ Retention (dead_letter.ttl_ms):

Purpose: Debugging and forensics
Too short: May lose data before analysis
Too long: Accumulates stale data
Recommendation: 24-48 hours in production
Default (24 hours): Adequate for most troubleshooting

Code Structure

Queue Declaration with TTL

// crates/common/src/mq/connection.rs

pub async fn declare_queue_with_dlx_and_ttl(
    &self,
    config: &QueueConfig,
    dlx_exchange: &str,
    ttl_ms: Option<u64>,
) -> MqResult<()> {
    let mut args = FieldTable::default();
    
    // Configure DLX
    args.insert(
        "x-dead-letter-exchange".into(),
        AMQPValue::LongString(dlx_exchange.into()),
    );
    
    // Configure TTL if specified
    if let Some(ttl) = ttl_ms {
        args.insert(
            "x-message-ttl".into(),
            AMQPValue::LongInt(ttl as i64),
        );
    }
    
    // Declare queue with arguments
    channel.queue_declare(&config.name, options, args).await?;
    Ok(())
}

Dead Letter Handler

// crates/executor/src/dead_letter_handler.rs

pub struct DeadLetterHandler {
    pool: Arc<PgPool>,
    consumer: Consumer,
    running: Arc<Mutex<bool>>,
}

impl DeadLetterHandler {
    pub async fn start(&self) -> Result<(), Error> {
        self.consumer.consume_with_handler(|envelope| {
            match envelope.message_type {
                MessageType::ExecutionRequested => {
                    handle_execution_requested(&pool, &envelope).await
                }
                _ => {
                    // Unexpected message type - acknowledge and discard
                    Ok(())
                }
            }
        }).await
    }
}

async fn handle_execution_requested(
    pool: &PgPool,
    envelope: &MessageEnvelope<Value>,
) -> MqResult<()> {
    // Extract execution ID
    let execution_id = envelope.payload.get("execution_id")
        .and_then(|v| v.as_i64())
        .ok_or_else(|| /* error */)?;
    
    // Fetch current state
    let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
    
    // Only fail if in non-terminal state
    if !execution.status.is_terminal() {
        ExecutionRepository::update(pool, execution_id, UpdateExecutionInput {
            status: Some(ExecutionStatus::Failed),
            result: Some(json!({
                "error": "Worker queue TTL expired",
                "message": "Worker did not process execution within configured TTL",
            })),
            ended: Some(Some(Utc::now())),
            ..Default::default()
        }).await?;
    }
    
    Ok(())
}

Integration with Executor Service

The dead letter handler is started automatically by the executor service if DLQ is enabled:

// crates/executor/src/service.rs

pub async fn start(&self) -> Result<()> {
    // ... other components ...
    
    // Start dead letter handler (if enabled)
    if self.inner.mq_config.rabbitmq.dead_letter.enabled {
        let dlq_name = format!("{}.queue", 
            self.inner.mq_config.rabbitmq.dead_letter.exchange);
        let dlq_consumer = Consumer::new(
            &self.inner.mq_connection,
            create_dlq_consumer_config(&dlq_name, "executor.dlq"),
        ).await?;
        
        let dlq_handler = Arc::new(
            DeadLetterHandler::new(self.inner.pool.clone(), dlq_consumer).await?
        );
        
        handles.push(tokio::spawn(async move {
            dlq_handler.start().await
        }));
    }
    
    // ... wait for completion ...
}

Operational Considerations

Monitoring

Key Metrics:

DLQ message rate (messages/sec entering DLQ)
DLQ queue depth (current messages in DLQ)
DLQ processing latency (time from DLX to handler)
Failed execution count (executions failed via DLQ)

Alerting Thresholds:

DLQ rate > 10/min: Workers may be unhealthy or TTL too aggressive
DLQ depth > 100: Handler may be falling behind
High failure rate: Systematic worker availability issues

RabbitMQ Management

View DLQ:

# List messages in DLQ
rabbitmqadmin list queues name messages

# Get DLQ details
rabbitmqadmin show queue name=attune.dlx.queue

# Purge DLQ (use with caution)
rabbitmqadmin purge queue name=attune.dlx.queue

View Dead Letters:

# Get message from DLQ
rabbitmqadmin get queue=attune.dlx.queue count=1

# Check message death history
# Look for x-death header in message properties

Troubleshooting

High DLQ Rate

Symptoms: Many executions failing via DLQ

Causes:

Workers down or restarting frequently
Worker queue TTL too aggressive
Worker overloaded (not consuming fast enough)
Network issues between executor and workers

Resolution:

Check worker health and logs
Verify worker heartbeats in database
Consider increasing worker_queue_ttl_ms
Scale worker fleet if overloaded

DLQ Handler Not Processing

Symptoms: DLQ depth increasing, executions stuck

Causes:

Executor service not running
DLQ disabled in configuration
Database connection issues
Handler crashed or deadlocked

Resolution:

Check executor service logs
Verify dead_letter.enabled = true
Check database connectivity
Restart executor service if needed

Messages Not Reaching DLQ

Symptoms: Executions stuck, DLQ empty

Causes:

Worker queues not configured with DLX
DLX exchange not created
DLQ not bound to DLX
TTL not configured on worker queues

Resolution:

Restart services to recreate infrastructure
Verify RabbitMQ configuration
Check queue properties in RabbitMQ management UI

Testing

Unit Tests

#[tokio::test]
async fn test_expired_execution_handling() {
    let pool = setup_test_db().await;
    
    // Create execution in SCHEDULED state
    let execution = create_test_execution(&pool, ExecutionStatus::Scheduled).await;
    
    // Simulate DLQ message
    let envelope = MessageEnvelope::new(
        MessageType::ExecutionRequested,
        json!({ "execution_id": execution.id }),
    );
    
    // Process message
    handle_execution_requested(&pool, &envelope).await.unwrap();
    
    // Verify execution failed
    let updated = ExecutionRepository::find_by_id(&pool, execution.id).await.unwrap();
    assert_eq!(updated.status, ExecutionStatus::Failed);
    assert!(updated.result.unwrap()["error"].as_str().unwrap().contains("TTL expired"));
}

Integration Tests

# 1. Start all services
docker compose up -d

# 2. Create execution targeting stopped worker
curl -X POST http://localhost:8080/api/v1/executions \
  -H "Content-Type: application/json" \
  -d '{
    "action_ref": "core.echo",
    "parameters": {"message": "test"},
    "worker_id": 999  # Non-existent worker
  }'

# 3. Wait for TTL expiration (5+ minutes)
sleep 330

# 4. Verify execution failed
curl http://localhost:8080/api/v1/executions/{id}
# Should show status: "failed", error: "Worker queue TTL expired"

# 5. Check DLQ processed the message
rabbitmqadmin list queues name messages | grep attune.dlx.queue
# Should show 0 messages (processed and removed)

Relationship to Other Phases

Phase 1 (Completed)

Execution timeout monitor: Handles executions stuck in SCHEDULED
Graceful shutdown: Prevents new tasks to stopping workers
Reduced heartbeat: Faster stale worker detection

Interaction: Phase 1 timeout monitor acts as a backstop if DLQ processing fails

Phase 2 (Current)

Worker queue TTL: Automatic message expiration
Dead letter queue: Capture expired messages
Dead letter handler: Process and fail expired executions

Benefit: More precise failure detection at the message queue level

Phase 3 (Planned)

Health probes: Proactive worker health checking
Intelligent retry: Retry transient failures
Load balancing: Distribute work across healthy workers

Integration: Phase 3 will use Phase 2 DLQ data to inform routing decisions

Benefits

Automatic Failure Detection: No manual intervention needed for unavailable workers
Precise Timing: TTL provides exact failure window (vs polling-based Phase 1)
Resource Efficiency: Prevents message accumulation in worker queues
Debugging Support: DLQ retains messages for forensic analysis
Graceful Degradation: System continues functioning even with worker failures

Limitations

TTL Precision: RabbitMQ TTL is approximate, not guaranteed to the millisecond
Race Conditions: Worker may start processing just as TTL expires (rare)
DLQ Capacity: Very high failure rates may overwhelm DLQ
No Retry Logic: Phase 2 always fails; Phase 3 will add intelligent retry

Future Enhancements (Phase 3)

Conditional Retry: Retry messages based on failure reason
Priority DLQ: Prioritize critical execution failures
DLQ Analytics: Aggregate statistics on failure patterns
Auto-scaling: Scale workers based on DLQ rate
Custom TTL: Per-action or per-execution TTL configuration

References

RabbitMQ Dead Letter Exchanges: https://www.rabbitmq.com/dlx.html
RabbitMQ TTL: https://www.rabbitmq.com/ttl.html
Phase 1 Documentation: docs/architecture/worker-availability-handling.md
Queue Architecture: docs/architecture/queue-architecture.md

15 KiB Raw Blame History

Worker Queue TTL and Dead Letter Queue (Phase 2)

Overview

Architecture

Message Flow

Components

1. Worker Queue TTL

2. Dead Letter Exchange (DLX)

3. Dead Letter Queue

4. Dead Letter Handler

Configuration

RabbitMQ Configuration Structure

Environment-Specific Settings

Development (config.development.yaml)

Production (config.docker.yaml)

Tuning Guidelines

Code Structure

Queue Declaration with TTL

Dead Letter Handler

Integration with Executor Service

Operational Considerations

Monitoring

RabbitMQ Management

Troubleshooting

High DLQ Rate

DLQ Handler Not Processing

Messages Not Reaching DLQ

Testing

Unit Tests

Integration Tests

Relationship to Other Phases

Phase 1 (Completed)

Phase 2 (Current)

Phase 3 (Planned)

Benefits

Limitations

Future Enhancements (Phase 3)

References

15 KiB

Raw Blame History

Development (`config.development.yaml`)

Production (`config.docker.yaml`)