# Worker Queue TTL and Dead Letter Queue (Phase 2) ## Overview Phase 2 of worker availability handling implements message TTL (time-to-live) on worker-specific queues and dead letter queue (DLQ) processing. This ensures that executions sent to unavailable workers are automatically failed instead of remaining stuck indefinitely. ## Architecture ### Message Flow ``` ┌─────────────┐ │ Executor │ │ Scheduler │ └──────┬──────┘ │ Publishes ExecutionRequested │ routing_key: execution.dispatch.worker.{id} │ ▼ ┌──────────────────────────────────┐ │ worker.{id}.executions queue │ │ │ │ Properties: │ │ - x-message-ttl: 300000ms (5m) │ │ - x-dead-letter-exchange: dlx │ └──────┬───────────────────┬───────┘ │ │ │ Worker consumes │ TTL expires │ (normal flow) │ (worker unavailable) │ │ ▼ ▼ ┌──────────────┐ ┌──────────────────┐ │ Worker │ │ attune.dlx │ │ Service │ │ (Dead Letter │ │ │ │ Exchange) │ └──────────────┘ └────────┬─────────┘ │ │ Routes to DLQ │ ▼ ┌──────────────────────┐ │ attune.dlx.queue │ │ (Dead Letter Queue) │ └────────┬─────────────┘ │ │ Consumes │ ▼ ┌──────────────────────┐ │ Dead Letter Handler │ │ (in Executor) │ │ │ │ - Identifies exec │ │ - Marks as FAILED │ │ - Logs failure │ └──────────────────────┘ ``` ### Components #### 1. Worker Queue TTL **Configuration:** - Default: 5 minutes (300,000 milliseconds) - Configurable via `rabbitmq.worker_queue_ttl_ms` **Implementation:** - Applied during queue declaration in `Connection::setup_worker_infrastructure()` - Uses RabbitMQ's `x-message-ttl` queue argument - Only applies to worker-specific queues (`worker.{id}.executions`) **Behavior:** - When a message remains in the queue longer than TTL - RabbitMQ automatically moves it to the configured dead letter exchange - Original message properties and headers are preserved - Includes `x-death` header with expiration details #### 2. Dead Letter Exchange (DLX) **Configuration:** - Exchange name: `attune.dlx` - Type: `direct` - Durable: `true` **Setup:** - Created in `Connection::setup_common_infrastructure()` - Bound to dead letter queue with routing key `#` (all messages) - Shared across all services #### 3. Dead Letter Queue **Configuration:** - Queue name: `attune.dlx.queue` - Durable: `true` - TTL: 24 hours (configurable via `rabbitmq.dead_letter.ttl_ms`) **Properties:** - Retains messages for debugging and analysis - Messages auto-expire after retention period - No DLX on the DLQ itself (prevents infinite loops) #### 4. Dead Letter Handler **Location:** `crates/executor/src/dead_letter_handler.rs` **Responsibilities:** 1. Consume messages from `attune.dlx.queue` 2. Deserialize message envelope 3. Extract execution ID from payload 4. Verify execution is in non-terminal state 5. Update execution to FAILED status 6. Add descriptive error information 7. Acknowledge message (remove from DLQ) **Error Handling:** - Invalid messages: Acknowledged and discarded - Missing executions: Acknowledged (already processed) - Terminal state executions: Acknowledged (no action needed) - Database errors: Nacked with requeue (retry later) ## Configuration ### RabbitMQ Configuration Structure ```yaml message_queue: rabbitmq: # Worker queue TTL - how long messages wait before DLX worker_queue_ttl_ms: 300000 # 5 minutes (default) # Dead letter configuration dead_letter: enabled: true # Enable DLQ system exchange: attune.dlx # DLX name ttl_ms: 86400000 # DLQ retention (24 hours) ``` ### Environment-Specific Settings #### Development (`config.development.yaml`) ```yaml message_queue: rabbitmq: worker_queue_ttl_ms: 300000 # 5 minutes dead_letter: enabled: true exchange: attune.dlx ttl_ms: 86400000 # 24 hours ``` #### Production (`config.docker.yaml`) ```yaml message_queue: rabbitmq: worker_queue_ttl_ms: 300000 # 5 minutes dead_letter: enabled: true exchange: attune.dlx ttl_ms: 86400000 # 24 hours ``` ### Tuning Guidelines **Worker Queue TTL (`worker_queue_ttl_ms`):** - **Too short:** Legitimate slow workers may have executions failed prematurely - **Too long:** Unavailable workers cause delayed failure detection - **Recommendation:** 2-5x typical execution time, minimum 2 minutes - **Default (5 min):** Good balance for most workloads **DLQ Retention (`dead_letter.ttl_ms`):** - Purpose: Debugging and forensics - **Too short:** May lose data before analysis - **Too long:** Accumulates stale data - **Recommendation:** 24-48 hours in production - **Default (24 hours):** Adequate for most troubleshooting ## Code Structure ### Queue Declaration with TTL ```rust // crates/common/src/mq/connection.rs pub async fn declare_queue_with_dlx_and_ttl( &self, config: &QueueConfig, dlx_exchange: &str, ttl_ms: Option, ) -> MqResult<()> { let mut args = FieldTable::default(); // Configure DLX args.insert( "x-dead-letter-exchange".into(), AMQPValue::LongString(dlx_exchange.into()), ); // Configure TTL if specified if let Some(ttl) = ttl_ms { args.insert( "x-message-ttl".into(), AMQPValue::LongInt(ttl as i64), ); } // Declare queue with arguments channel.queue_declare(&config.name, options, args).await?; Ok(()) } ``` ### Dead Letter Handler ```rust // crates/executor/src/dead_letter_handler.rs pub struct DeadLetterHandler { pool: Arc, consumer: Consumer, running: Arc>, } impl DeadLetterHandler { pub async fn start(&self) -> Result<(), Error> { self.consumer.consume_with_handler(|envelope| { match envelope.message_type { MessageType::ExecutionRequested => { handle_execution_requested(&pool, &envelope).await } _ => { // Unexpected message type - acknowledge and discard Ok(()) } } }).await } } async fn handle_execution_requested( pool: &PgPool, envelope: &MessageEnvelope, ) -> MqResult<()> { // Extract execution ID let execution_id = envelope.payload.get("execution_id") .and_then(|v| v.as_i64()) .ok_or_else(|| /* error */)?; // Fetch current state let execution = ExecutionRepository::find_by_id(pool, execution_id).await?; // Only fail if in non-terminal state if !execution.status.is_terminal() { ExecutionRepository::update(pool, execution_id, UpdateExecutionInput { status: Some(ExecutionStatus::Failed), result: Some(json!({ "error": "Worker queue TTL expired", "message": "Worker did not process execution within configured TTL", })), ended: Some(Some(Utc::now())), ..Default::default() }).await?; } Ok(()) } ``` ## Integration with Executor Service The dead letter handler is started automatically by the executor service if DLQ is enabled: ```rust // crates/executor/src/service.rs pub async fn start(&self) -> Result<()> { // ... other components ... // Start dead letter handler (if enabled) if self.inner.mq_config.rabbitmq.dead_letter.enabled { let dlq_name = format!("{}.queue", self.inner.mq_config.rabbitmq.dead_letter.exchange); let dlq_consumer = Consumer::new( &self.inner.mq_connection, create_dlq_consumer_config(&dlq_name, "executor.dlq"), ).await?; let dlq_handler = Arc::new( DeadLetterHandler::new(self.inner.pool.clone(), dlq_consumer).await? ); handles.push(tokio::spawn(async move { dlq_handler.start().await })); } // ... wait for completion ... } ``` ## Operational Considerations ### Monitoring **Key Metrics:** - DLQ message rate (messages/sec entering DLQ) - DLQ queue depth (current messages in DLQ) - DLQ processing latency (time from DLX to handler) - Failed execution count (executions failed via DLQ) **Alerting Thresholds:** - DLQ rate > 10/min: Workers may be unhealthy or TTL too aggressive - DLQ depth > 100: Handler may be falling behind - High failure rate: Systematic worker availability issues ### RabbitMQ Management **View DLQ:** ```bash # List messages in DLQ rabbitmqadmin list queues name messages # Get DLQ details rabbitmqadmin show queue name=attune.dlx.queue # Purge DLQ (use with caution) rabbitmqadmin purge queue name=attune.dlx.queue ``` **View Dead Letters:** ```bash # Get message from DLQ rabbitmqadmin get queue=attune.dlx.queue count=1 # Check message death history # Look for x-death header in message properties ``` ### Troubleshooting #### High DLQ Rate **Symptoms:** Many executions failing via DLQ **Causes:** 1. Workers down or restarting frequently 2. Worker queue TTL too aggressive 3. Worker overloaded (not consuming fast enough) 4. Network issues between executor and workers **Resolution:** 1. Check worker health and logs 2. Verify worker heartbeats in database 3. Consider increasing `worker_queue_ttl_ms` 4. Scale worker fleet if overloaded #### DLQ Handler Not Processing **Symptoms:** DLQ depth increasing, executions stuck **Causes:** 1. Executor service not running 2. DLQ disabled in configuration 3. Database connection issues 4. Handler crashed or deadlocked **Resolution:** 1. Check executor service logs 2. Verify `dead_letter.enabled = true` 3. Check database connectivity 4. Restart executor service if needed #### Messages Not Reaching DLQ **Symptoms:** Executions stuck, DLQ empty **Causes:** 1. Worker queues not configured with DLX 2. DLX exchange not created 3. DLQ not bound to DLX 4. TTL not configured on worker queues **Resolution:** 1. Restart services to recreate infrastructure 2. Verify RabbitMQ configuration 3. Check queue properties in RabbitMQ management UI ## Testing ### Unit Tests ```rust #[tokio::test] async fn test_expired_execution_handling() { let pool = setup_test_db().await; // Create execution in SCHEDULED state let execution = create_test_execution(&pool, ExecutionStatus::Scheduled).await; // Simulate DLQ message let envelope = MessageEnvelope::new( MessageType::ExecutionRequested, json!({ "execution_id": execution.id }), ); // Process message handle_execution_requested(&pool, &envelope).await.unwrap(); // Verify execution failed let updated = ExecutionRepository::find_by_id(&pool, execution.id).await.unwrap(); assert_eq!(updated.status, ExecutionStatus::Failed); assert!(updated.result.unwrap()["error"].as_str().unwrap().contains("TTL expired")); } ``` ### Integration Tests ```bash # 1. Start all services docker compose up -d # 2. Create execution targeting stopped worker curl -X POST http://localhost:8080/api/v1/executions \ -H "Content-Type: application/json" \ -d '{ "action_ref": "core.echo", "parameters": {"message": "test"}, "worker_id": 999 # Non-existent worker }' # 3. Wait for TTL expiration (5+ minutes) sleep 330 # 4. Verify execution failed curl http://localhost:8080/api/v1/executions/{id} # Should show status: "failed", error: "Worker queue TTL expired" # 5. Check DLQ processed the message rabbitmqadmin list queues name messages | grep attune.dlx.queue # Should show 0 messages (processed and removed) ``` ## Relationship to Other Phases ### Phase 1 (Completed) - Execution timeout monitor: Handles executions stuck in SCHEDULED - Graceful shutdown: Prevents new tasks to stopping workers - Reduced heartbeat: Faster stale worker detection **Interaction:** Phase 1 timeout monitor acts as a backstop if DLQ processing fails ### Phase 2 (Current) - Worker queue TTL: Automatic message expiration - Dead letter queue: Capture expired messages - Dead letter handler: Process and fail expired executions **Benefit:** More precise failure detection at the message queue level ### Phase 3 (Planned) - Health probes: Proactive worker health checking - Intelligent retry: Retry transient failures - Load balancing: Distribute work across healthy workers **Integration:** Phase 3 will use Phase 2 DLQ data to inform routing decisions ## Benefits 1. **Automatic Failure Detection:** No manual intervention needed for unavailable workers 2. **Precise Timing:** TTL provides exact failure window (vs polling-based Phase 1) 3. **Resource Efficiency:** Prevents message accumulation in worker queues 4. **Debugging Support:** DLQ retains messages for forensic analysis 5. **Graceful Degradation:** System continues functioning even with worker failures ## Limitations 1. **TTL Precision:** RabbitMQ TTL is approximate, not guaranteed to the millisecond 2. **Race Conditions:** Worker may start processing just as TTL expires (rare) 3. **DLQ Capacity:** Very high failure rates may overwhelm DLQ 4. **No Retry Logic:** Phase 2 always fails; Phase 3 will add intelligent retry ## Future Enhancements (Phase 3) - **Conditional Retry:** Retry messages based on failure reason - **Priority DLQ:** Prioritize critical execution failures - **DLQ Analytics:** Aggregate statistics on failure patterns - **Auto-scaling:** Scale workers based on DLQ rate - **Custom TTL:** Per-action or per-execution TTL configuration ## References - RabbitMQ Dead Letter Exchanges: https://www.rabbitmq.com/dlx.html - RabbitMQ TTL: https://www.rabbitmq.com/ttl.html - Phase 1 Documentation: `docs/architecture/worker-availability-handling.md` - Queue Architecture: `docs/architecture/queue-architecture.md`