# Quick Reference: Worker Queue TTL and Dead Letter Queue (Phase 2) ## Overview Phase 2 implements message TTL on worker queues and dead letter queue processing to automatically fail executions when workers are unavailable. **Key Concept:** If a worker doesn't process an execution within 5 minutes, the message expires and the execution is automatically marked as FAILED. ## How It Works ``` Execution → Worker Queue (TTL: 5 min) → Worker Processing ✓ ↓ (if timeout) Dead Letter Exchange ↓ Dead Letter Queue ↓ DLQ Handler (in Executor) ↓ Execution marked FAILED ``` ## Configuration ### Default Settings (All Environments) ```yaml message_queue: rabbitmq: worker_queue_ttl_ms: 300000 # 5 minutes dead_letter: enabled: true exchange: attune.dlx ttl_ms: 86400000 # 24 hours DLQ retention ``` ### Tuning TTL **Worker Queue TTL** (`worker_queue_ttl_ms`): - **Default:** 300000 (5 minutes) - **Purpose:** How long to wait before declaring worker unavailable - **Tuning:** Set to 2-5x your typical execution time - **Too short:** Slow executions fail prematurely - **Too long:** Delayed failure detection for unavailable workers **DLQ Retention** (`dead_letter.ttl_ms`): - **Default:** 86400000 (24 hours) - **Purpose:** How long to keep expired messages for debugging - **Tuning:** Based on your debugging/forensics needs ## Components ### 1. Worker Queue TTL - Applied to all `worker.{id}.executions` queues - Configured via RabbitMQ queue argument `x-message-ttl` - Messages expire if not consumed within TTL - Expired messages routed to dead letter exchange ### 2. Dead Letter Exchange (DLX) - **Name:** `attune.dlx` - **Type:** `direct` - Receives all expired messages from worker queues - Routes to dead letter queue ### 3. Dead Letter Queue (DLQ) - **Name:** `attune.dlx.queue` - Stores expired messages for processing - Retains messages for 24 hours (configurable) - Processed by dead letter handler ### 4. Dead Letter Handler - Runs in executor service - Consumes messages from DLQ - Updates executions to FAILED status - Provides descriptive error messages ## Monitoring ### Key Metrics ```bash # Check DLQ depth rabbitmqadmin list queues name messages | grep attune.dlx.queue # View DLQ rate # Watch for sustained DLQ message rate > 10/min # Check failed executions curl http://localhost:8080/api/v1/executions?status=failed ``` ### Health Checks **Good:** - DLQ depth: 0-10 - DLQ rate: < 5 messages/min - Most executions complete successfully **Warning:** - DLQ depth: 10-100 - DLQ rate: 5-20 messages/min - May indicate worker instability **Critical:** - DLQ depth: > 100 - DLQ rate: > 20 messages/min - Workers likely down or overloaded ## Troubleshooting ### High DLQ Rate **Symptoms:** Many executions failing via DLQ **Common Causes:** 1. Workers stopped or restarting 2. Workers overloaded (not consuming fast enough) 3. TTL too aggressive for your workload 4. Network connectivity issues **Resolution:** ```bash # 1. Check worker status docker compose ps | grep worker docker compose logs -f worker-shell # 2. Verify worker heartbeats psql -c "SELECT name, status, last_heartbeat FROM worker;" # 3. Check worker queue depths rabbitmqadmin list queues name messages | grep "worker\." # 4. Consider increasing TTL if legitimate slow executions # Edit config and restart executor: # worker_queue_ttl_ms: 600000 # 10 minutes ``` ### DLQ Not Processing **Symptoms:** DLQ depth increasing, executions stuck **Common Causes:** 1. Executor service not running 2. DLQ disabled in config 3. Database connection issues **Resolution:** ```bash # 1. Verify executor is running docker compose ps executor docker compose logs -f executor | grep "dead letter" # 2. Check configuration grep -A 3 "dead_letter:" config.docker.yaml # 3. Restart executor if needed docker compose restart executor ``` ### Messages Not Expiring **Symptoms:** Executions stuck in SCHEDULED, DLQ empty **Common Causes:** 1. Worker queues not configured with TTL 2. Worker queues not configured with DLX 3. Infrastructure setup failed **Resolution:** ```bash # 1. Check queue properties rabbitmqadmin show queue name=worker.1.executions # Look for: # - arguments.x-message-ttl: 300000 # - arguments.x-dead-letter-exchange: attune.dlx # 2. Recreate infrastructure (safe, idempotent) docker compose restart executor worker-shell ``` ## Testing ### Manual Test: Verify TTL Expiration ```bash # 1. Stop all workers docker compose stop worker-shell worker-python worker-node # 2. Create execution curl -X POST http://localhost:8080/api/v1/executions \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "action_ref": "core.echo", "parameters": {"message": "test"} }' # 3. Wait for TTL expiration (5+ minutes) sleep 330 # 4. Check execution status curl http://localhost:8080/api/v1/executions/{id} | jq '.data.status' # Should be "failed" # 5. Check error message curl http://localhost:8080/api/v1/executions/{id} | jq '.data.result' # Should contain "Worker queue TTL expired" # 6. Verify DLQ processed it rabbitmqadmin list queues name messages | grep attune.dlx.queue # Should show 0 messages (processed and removed) ``` ## Relationship to Phase 1 **Phase 1 (Timeout Monitor):** - Monitors executions in SCHEDULED state - Fails executions after configured timeout - Acts as backup safety net **Phase 2 (Queue TTL + DLQ):** - Expires messages at queue level - More precise failure detection - Provides better visibility (DLQ metrics) **Together:** Provide defense-in-depth for worker unavailability ## Common Operations ### View DLQ Messages ```bash # Get messages from DLQ (doesn't remove) rabbitmqadmin get queue=attune.dlx.queue count=10 # View x-death header for expiration details rabbitmqadmin get queue=attune.dlx.queue count=1 --format=long ``` ### Manually Purge DLQ ```bash # Use with caution - removes all messages rabbitmqadmin purge queue name=attune.dlx.queue ``` ### Temporarily Disable DLQ ```yaml # config.docker.yaml message_queue: rabbitmq: dead_letter: enabled: false # Disables DLQ handler ``` **Note:** Messages will still expire but won't be processed ### Adjust TTL Without Restart Not possible - queue TTL is set at queue creation time. To change: ```bash # 1. Stop all services docker compose down # 2. Delete worker queues (forces recreation) rabbitmqadmin delete queue name=worker.1.executions # Repeat for all worker queues # 3. Update config # Edit worker_queue_ttl_ms # 4. Restart services (queues recreated with new TTL) docker compose up -d ``` ## Key Files ### Configuration - `config.docker.yaml` - Production settings - `config.development.yaml` - Development settings ### Implementation - `crates/common/src/mq/config.rs` - TTL configuration - `crates/common/src/mq/connection.rs` - Queue setup with TTL - `crates/executor/src/dead_letter_handler.rs` - DLQ processing - `crates/executor/src/service.rs` - DLQ handler integration ### Documentation - `docs/architecture/worker-queue-ttl-dlq.md` - Full architecture - `docs/architecture/worker-availability-handling.md` - Phase 1 (backup) ## When to Use **Enable DLQ (default):** - Production environments - Development with multiple workers - Any environment requiring high reliability **Disable DLQ:** - Local development with single worker - Testing scenarios where you want manual control - Debugging worker behavior ## Next Steps (Phase 3) - **Health probes:** Proactive worker health checking - **Intelligent retry:** Retry transient failures - **Per-action TTL:** Custom timeouts per action type - **DLQ analytics:** Aggregate failure statistics ## See Also - Phase 1 Documentation: `docs/architecture/worker-availability-handling.md` - Queue Architecture: `docs/architecture/queue-architecture.md` - RabbitMQ Dead Letter Exchanges: https://www.rabbitmq.com/dlx.html