Files
attune/docs/architecture/worker-queue-ttl-dlq.md

15 KiB

Worker Queue TTL and Dead Letter Queue (Phase 2)

Overview

Phase 2 of worker availability handling implements message TTL (time-to-live) on worker-specific queues and dead letter queue (DLQ) processing. This ensures that executions sent to unavailable workers are automatically failed instead of remaining stuck indefinitely.

Architecture

Message Flow

┌─────────────┐
│  Executor   │
│  Scheduler  │
└──────┬──────┘
       │ Publishes ExecutionRequested
       │ routing_key: execution.dispatch.worker.{id}
       │
       ▼
┌──────────────────────────────────┐
│  worker.{id}.executions queue    │
│                                  │
│  Properties:                     │
│  - x-message-ttl: 300000ms (5m)  │
│  - x-dead-letter-exchange: dlx   │
└──────┬───────────────────┬───────┘
       │                   │
       │ Worker consumes   │ TTL expires
       │ (normal flow)     │ (worker unavailable)
       │                   │
       ▼                   ▼
┌──────────────┐    ┌──────────────────┐
│   Worker     │    │  attune.dlx      │
│   Service    │    │  (Dead Letter    │
│              │    │   Exchange)      │
└──────────────┘    └────────┬─────────┘
                             │
                             │ Routes to DLQ
                             │
                             ▼
                    ┌──────────────────────┐
                    │  attune.dlx.queue    │
                    │  (Dead Letter Queue) │
                    └────────┬─────────────┘
                             │
                             │ Consumes
                             │
                             ▼
                    ┌──────────────────────┐
                    │  Dead Letter Handler │
                    │  (in Executor)       │
                    │                      │
                    │  - Identifies exec   │
                    │  - Marks as FAILED   │
                    │  - Logs failure      │
                    └──────────────────────┘

Components

1. Worker Queue TTL

Configuration:

  • Default: 5 minutes (300,000 milliseconds)
  • Configurable via rabbitmq.worker_queue_ttl_ms

Implementation:

  • Applied during queue declaration in Connection::setup_worker_infrastructure()
  • Uses RabbitMQ's x-message-ttl queue argument
  • Only applies to worker-specific queues (worker.{id}.executions)

Behavior:

  • When a message remains in the queue longer than TTL
  • RabbitMQ automatically moves it to the configured dead letter exchange
  • Original message properties and headers are preserved
  • Includes x-death header with expiration details

2. Dead Letter Exchange (DLX)

Configuration:

  • Exchange name: attune.dlx
  • Type: direct
  • Durable: true

Setup:

  • Created in Connection::setup_common_infrastructure()
  • Bound to dead letter queue with routing key # (all messages)
  • Shared across all services

3. Dead Letter Queue

Configuration:

  • Queue name: attune.dlx.queue
  • Durable: true
  • TTL: 24 hours (configurable via rabbitmq.dead_letter.ttl_ms)

Properties:

  • Retains messages for debugging and analysis
  • Messages auto-expire after retention period
  • No DLX on the DLQ itself (prevents infinite loops)

4. Dead Letter Handler

Location: crates/executor/src/dead_letter_handler.rs

Responsibilities:

  1. Consume messages from attune.dlx.queue
  2. Deserialize message envelope
  3. Extract execution ID from payload
  4. Verify execution is in non-terminal state
  5. Update execution to FAILED status
  6. Add descriptive error information
  7. Acknowledge message (remove from DLQ)

Error Handling:

  • Invalid messages: Acknowledged and discarded
  • Missing executions: Acknowledged (already processed)
  • Terminal state executions: Acknowledged (no action needed)
  • Database errors: Nacked with requeue (retry later)

Configuration

RabbitMQ Configuration Structure

message_queue:
  rabbitmq:
    # Worker queue TTL - how long messages wait before DLX
    worker_queue_ttl_ms: 300000  # 5 minutes (default)
    
    # Dead letter configuration
    dead_letter:
      enabled: true                # Enable DLQ system
      exchange: attune.dlx         # DLX name
      ttl_ms: 86400000            # DLQ retention (24 hours)

Environment-Specific Settings

Development (config.development.yaml)

message_queue:
  rabbitmq:
    worker_queue_ttl_ms: 300000  # 5 minutes
    dead_letter:
      enabled: true
      exchange: attune.dlx
      ttl_ms: 86400000  # 24 hours

Production (config.docker.yaml)

message_queue:
  rabbitmq:
    worker_queue_ttl_ms: 300000  # 5 minutes
    dead_letter:
      enabled: true
      exchange: attune.dlx
      ttl_ms: 86400000  # 24 hours

Tuning Guidelines

Worker Queue TTL (worker_queue_ttl_ms):

  • Too short: Legitimate slow workers may have executions failed prematurely
  • Too long: Unavailable workers cause delayed failure detection
  • Recommendation: 2-5x typical execution time, minimum 2 minutes
  • Default (5 min): Good balance for most workloads

DLQ Retention (dead_letter.ttl_ms):

  • Purpose: Debugging and forensics
  • Too short: May lose data before analysis
  • Too long: Accumulates stale data
  • Recommendation: 24-48 hours in production
  • Default (24 hours): Adequate for most troubleshooting

Code Structure

Queue Declaration with TTL

// crates/common/src/mq/connection.rs

pub async fn declare_queue_with_dlx_and_ttl(
    &self,
    config: &QueueConfig,
    dlx_exchange: &str,
    ttl_ms: Option<u64>,
) -> MqResult<()> {
    let mut args = FieldTable::default();
    
    // Configure DLX
    args.insert(
        "x-dead-letter-exchange".into(),
        AMQPValue::LongString(dlx_exchange.into()),
    );
    
    // Configure TTL if specified
    if let Some(ttl) = ttl_ms {
        args.insert(
            "x-message-ttl".into(),
            AMQPValue::LongInt(ttl as i64),
        );
    }
    
    // Declare queue with arguments
    channel.queue_declare(&config.name, options, args).await?;
    Ok(())
}

Dead Letter Handler

// crates/executor/src/dead_letter_handler.rs

pub struct DeadLetterHandler {
    pool: Arc<PgPool>,
    consumer: Consumer,
    running: Arc<Mutex<bool>>,
}

impl DeadLetterHandler {
    pub async fn start(&self) -> Result<(), Error> {
        self.consumer.consume_with_handler(|envelope| {
            match envelope.message_type {
                MessageType::ExecutionRequested => {
                    handle_execution_requested(&pool, &envelope).await
                }
                _ => {
                    // Unexpected message type - acknowledge and discard
                    Ok(())
                }
            }
        }).await
    }
}

async fn handle_execution_requested(
    pool: &PgPool,
    envelope: &MessageEnvelope<Value>,
) -> MqResult<()> {
    // Extract execution ID
    let execution_id = envelope.payload.get("execution_id")
        .and_then(|v| v.as_i64())
        .ok_or_else(|| /* error */)?;
    
    // Fetch current state
    let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
    
    // Only fail if in non-terminal state
    if !execution.status.is_terminal() {
        ExecutionRepository::update(pool, execution_id, UpdateExecutionInput {
            status: Some(ExecutionStatus::Failed),
            result: Some(json!({
                "error": "Worker queue TTL expired",
                "message": "Worker did not process execution within configured TTL",
            })),
            ended: Some(Some(Utc::now())),
            ..Default::default()
        }).await?;
    }
    
    Ok(())
}

Integration with Executor Service

The dead letter handler is started automatically by the executor service if DLQ is enabled:

// crates/executor/src/service.rs

pub async fn start(&self) -> Result<()> {
    // ... other components ...
    
    // Start dead letter handler (if enabled)
    if self.inner.mq_config.rabbitmq.dead_letter.enabled {
        let dlq_name = format!("{}.queue", 
            self.inner.mq_config.rabbitmq.dead_letter.exchange);
        let dlq_consumer = Consumer::new(
            &self.inner.mq_connection,
            create_dlq_consumer_config(&dlq_name, "executor.dlq"),
        ).await?;
        
        let dlq_handler = Arc::new(
            DeadLetterHandler::new(self.inner.pool.clone(), dlq_consumer).await?
        );
        
        handles.push(tokio::spawn(async move {
            dlq_handler.start().await
        }));
    }
    
    // ... wait for completion ...
}

Operational Considerations

Monitoring

Key Metrics:

  • DLQ message rate (messages/sec entering DLQ)
  • DLQ queue depth (current messages in DLQ)
  • DLQ processing latency (time from DLX to handler)
  • Failed execution count (executions failed via DLQ)

Alerting Thresholds:

  • DLQ rate > 10/min: Workers may be unhealthy or TTL too aggressive
  • DLQ depth > 100: Handler may be falling behind
  • High failure rate: Systematic worker availability issues

RabbitMQ Management

View DLQ:

# List messages in DLQ
rabbitmqadmin list queues name messages

# Get DLQ details
rabbitmqadmin show queue name=attune.dlx.queue

# Purge DLQ (use with caution)
rabbitmqadmin purge queue name=attune.dlx.queue

View Dead Letters:

# Get message from DLQ
rabbitmqadmin get queue=attune.dlx.queue count=1

# Check message death history
# Look for x-death header in message properties

Troubleshooting

High DLQ Rate

Symptoms: Many executions failing via DLQ

Causes:

  1. Workers down or restarting frequently
  2. Worker queue TTL too aggressive
  3. Worker overloaded (not consuming fast enough)
  4. Network issues between executor and workers

Resolution:

  1. Check worker health and logs
  2. Verify worker heartbeats in database
  3. Consider increasing worker_queue_ttl_ms
  4. Scale worker fleet if overloaded

DLQ Handler Not Processing

Symptoms: DLQ depth increasing, executions stuck

Causes:

  1. Executor service not running
  2. DLQ disabled in configuration
  3. Database connection issues
  4. Handler crashed or deadlocked

Resolution:

  1. Check executor service logs
  2. Verify dead_letter.enabled = true
  3. Check database connectivity
  4. Restart executor service if needed

Messages Not Reaching DLQ

Symptoms: Executions stuck, DLQ empty

Causes:

  1. Worker queues not configured with DLX
  2. DLX exchange not created
  3. DLQ not bound to DLX
  4. TTL not configured on worker queues

Resolution:

  1. Restart services to recreate infrastructure
  2. Verify RabbitMQ configuration
  3. Check queue properties in RabbitMQ management UI

Testing

Unit Tests

#[tokio::test]
async fn test_expired_execution_handling() {
    let pool = setup_test_db().await;
    
    // Create execution in SCHEDULED state
    let execution = create_test_execution(&pool, ExecutionStatus::Scheduled).await;
    
    // Simulate DLQ message
    let envelope = MessageEnvelope::new(
        MessageType::ExecutionRequested,
        json!({ "execution_id": execution.id }),
    );
    
    // Process message
    handle_execution_requested(&pool, &envelope).await.unwrap();
    
    // Verify execution failed
    let updated = ExecutionRepository::find_by_id(&pool, execution.id).await.unwrap();
    assert_eq!(updated.status, ExecutionStatus::Failed);
    assert!(updated.result.unwrap()["error"].as_str().unwrap().contains("TTL expired"));
}

Integration Tests

# 1. Start all services
docker compose up -d

# 2. Create execution targeting stopped worker
curl -X POST http://localhost:8080/api/v1/executions \
  -H "Content-Type: application/json" \
  -d '{
    "action_ref": "core.echo",
    "parameters": {"message": "test"},
    "worker_id": 999  # Non-existent worker
  }'

# 3. Wait for TTL expiration (5+ minutes)
sleep 330

# 4. Verify execution failed
curl http://localhost:8080/api/v1/executions/{id}
# Should show status: "failed", error: "Worker queue TTL expired"

# 5. Check DLQ processed the message
rabbitmqadmin list queues name messages | grep attune.dlx.queue
# Should show 0 messages (processed and removed)

Relationship to Other Phases

Phase 1 (Completed)

  • Execution timeout monitor: Handles executions stuck in SCHEDULED
  • Graceful shutdown: Prevents new tasks to stopping workers
  • Reduced heartbeat: Faster stale worker detection

Interaction: Phase 1 timeout monitor acts as a backstop if DLQ processing fails

Phase 2 (Current)

  • Worker queue TTL: Automatic message expiration
  • Dead letter queue: Capture expired messages
  • Dead letter handler: Process and fail expired executions

Benefit: More precise failure detection at the message queue level

Phase 3 (Planned)

  • Health probes: Proactive worker health checking
  • Intelligent retry: Retry transient failures
  • Load balancing: Distribute work across healthy workers

Integration: Phase 3 will use Phase 2 DLQ data to inform routing decisions

Benefits

  1. Automatic Failure Detection: No manual intervention needed for unavailable workers
  2. Precise Timing: TTL provides exact failure window (vs polling-based Phase 1)
  3. Resource Efficiency: Prevents message accumulation in worker queues
  4. Debugging Support: DLQ retains messages for forensic analysis
  5. Graceful Degradation: System continues functioning even with worker failures

Limitations

  1. TTL Precision: RabbitMQ TTL is approximate, not guaranteed to the millisecond
  2. Race Conditions: Worker may start processing just as TTL expires (rare)
  3. DLQ Capacity: Very high failure rates may overwhelm DLQ
  4. No Retry Logic: Phase 2 always fails; Phase 3 will add intelligent retry

Future Enhancements (Phase 3)

  • Conditional Retry: Retry messages based on failure reason
  • Priority DLQ: Prioritize critical execution failures
  • DLQ Analytics: Aggregate statistics on failure patterns
  • Auto-scaling: Scale workers based on DLQ rate
  • Custom TTL: Per-action or per-execution TTL configuration

References