15 KiB
Worker Queue TTL and Dead Letter Queue (Phase 2)
Overview
Phase 2 of worker availability handling implements message TTL (time-to-live) on worker-specific queues and dead letter queue (DLQ) processing. This ensures that executions sent to unavailable workers are automatically failed instead of remaining stuck indefinitely.
Architecture
Message Flow
┌─────────────┐
│ Executor │
│ Scheduler │
└──────┬──────┘
│ Publishes ExecutionRequested
│ routing_key: execution.dispatch.worker.{id}
│
▼
┌──────────────────────────────────┐
│ worker.{id}.executions queue │
│ │
│ Properties: │
│ - x-message-ttl: 300000ms (5m) │
│ - x-dead-letter-exchange: dlx │
└──────┬───────────────────┬───────┘
│ │
│ Worker consumes │ TTL expires
│ (normal flow) │ (worker unavailable)
│ │
▼ ▼
┌──────────────┐ ┌──────────────────┐
│ Worker │ │ attune.dlx │
│ Service │ │ (Dead Letter │
│ │ │ Exchange) │
└──────────────┘ └────────┬─────────┘
│
│ Routes to DLQ
│
▼
┌──────────────────────┐
│ attune.dlx.queue │
│ (Dead Letter Queue) │
└────────┬─────────────┘
│
│ Consumes
│
▼
┌──────────────────────┐
│ Dead Letter Handler │
│ (in Executor) │
│ │
│ - Identifies exec │
│ - Marks as FAILED │
│ - Logs failure │
└──────────────────────┘
Components
1. Worker Queue TTL
Configuration:
- Default: 5 minutes (300,000 milliseconds)
- Configurable via
rabbitmq.worker_queue_ttl_ms
Implementation:
- Applied during queue declaration in
Connection::setup_worker_infrastructure() - Uses RabbitMQ's
x-message-ttlqueue argument - Only applies to worker-specific queues (
worker.{id}.executions)
Behavior:
- When a message remains in the queue longer than TTL
- RabbitMQ automatically moves it to the configured dead letter exchange
- Original message properties and headers are preserved
- Includes
x-deathheader with expiration details
2. Dead Letter Exchange (DLX)
Configuration:
- Exchange name:
attune.dlx - Type:
direct - Durable:
true
Setup:
- Created in
Connection::setup_common_infrastructure() - Bound to dead letter queue with routing key
#(all messages) - Shared across all services
3. Dead Letter Queue
Configuration:
- Queue name:
attune.dlx.queue - Durable:
true - TTL: 24 hours (configurable via
rabbitmq.dead_letter.ttl_ms)
Properties:
- Retains messages for debugging and analysis
- Messages auto-expire after retention period
- No DLX on the DLQ itself (prevents infinite loops)
4. Dead Letter Handler
Location: crates/executor/src/dead_letter_handler.rs
Responsibilities:
- Consume messages from
attune.dlx.queue - Deserialize message envelope
- Extract execution ID from payload
- Verify execution is in non-terminal state
- Update execution to FAILED status
- Add descriptive error information
- Acknowledge message (remove from DLQ)
Error Handling:
- Invalid messages: Acknowledged and discarded
- Missing executions: Acknowledged (already processed)
- Terminal state executions: Acknowledged (no action needed)
- Database errors: Nacked with requeue (retry later)
Configuration
RabbitMQ Configuration Structure
message_queue:
rabbitmq:
# Worker queue TTL - how long messages wait before DLX
worker_queue_ttl_ms: 300000 # 5 minutes (default)
# Dead letter configuration
dead_letter:
enabled: true # Enable DLQ system
exchange: attune.dlx # DLX name
ttl_ms: 86400000 # DLQ retention (24 hours)
Environment-Specific Settings
Development (config.development.yaml)
message_queue:
rabbitmq:
worker_queue_ttl_ms: 300000 # 5 minutes
dead_letter:
enabled: true
exchange: attune.dlx
ttl_ms: 86400000 # 24 hours
Production (config.docker.yaml)
message_queue:
rabbitmq:
worker_queue_ttl_ms: 300000 # 5 minutes
dead_letter:
enabled: true
exchange: attune.dlx
ttl_ms: 86400000 # 24 hours
Tuning Guidelines
Worker Queue TTL (worker_queue_ttl_ms):
- Too short: Legitimate slow workers may have executions failed prematurely
- Too long: Unavailable workers cause delayed failure detection
- Recommendation: 2-5x typical execution time, minimum 2 minutes
- Default (5 min): Good balance for most workloads
DLQ Retention (dead_letter.ttl_ms):
- Purpose: Debugging and forensics
- Too short: May lose data before analysis
- Too long: Accumulates stale data
- Recommendation: 24-48 hours in production
- Default (24 hours): Adequate for most troubleshooting
Code Structure
Queue Declaration with TTL
// crates/common/src/mq/connection.rs
pub async fn declare_queue_with_dlx_and_ttl(
&self,
config: &QueueConfig,
dlx_exchange: &str,
ttl_ms: Option<u64>,
) -> MqResult<()> {
let mut args = FieldTable::default();
// Configure DLX
args.insert(
"x-dead-letter-exchange".into(),
AMQPValue::LongString(dlx_exchange.into()),
);
// Configure TTL if specified
if let Some(ttl) = ttl_ms {
args.insert(
"x-message-ttl".into(),
AMQPValue::LongInt(ttl as i64),
);
}
// Declare queue with arguments
channel.queue_declare(&config.name, options, args).await?;
Ok(())
}
Dead Letter Handler
// crates/executor/src/dead_letter_handler.rs
pub struct DeadLetterHandler {
pool: Arc<PgPool>,
consumer: Consumer,
running: Arc<Mutex<bool>>,
}
impl DeadLetterHandler {
pub async fn start(&self) -> Result<(), Error> {
self.consumer.consume_with_handler(|envelope| {
match envelope.message_type {
MessageType::ExecutionRequested => {
handle_execution_requested(&pool, &envelope).await
}
_ => {
// Unexpected message type - acknowledge and discard
Ok(())
}
}
}).await
}
}
async fn handle_execution_requested(
pool: &PgPool,
envelope: &MessageEnvelope<Value>,
) -> MqResult<()> {
// Extract execution ID
let execution_id = envelope.payload.get("execution_id")
.and_then(|v| v.as_i64())
.ok_or_else(|| /* error */)?;
// Fetch current state
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
// Only fail if in non-terminal state
if !execution.status.is_terminal() {
ExecutionRepository::update(pool, execution_id, UpdateExecutionInput {
status: Some(ExecutionStatus::Failed),
result: Some(json!({
"error": "Worker queue TTL expired",
"message": "Worker did not process execution within configured TTL",
})),
ended: Some(Some(Utc::now())),
..Default::default()
}).await?;
}
Ok(())
}
Integration with Executor Service
The dead letter handler is started automatically by the executor service if DLQ is enabled:
// crates/executor/src/service.rs
pub async fn start(&self) -> Result<()> {
// ... other components ...
// Start dead letter handler (if enabled)
if self.inner.mq_config.rabbitmq.dead_letter.enabled {
let dlq_name = format!("{}.queue",
self.inner.mq_config.rabbitmq.dead_letter.exchange);
let dlq_consumer = Consumer::new(
&self.inner.mq_connection,
create_dlq_consumer_config(&dlq_name, "executor.dlq"),
).await?;
let dlq_handler = Arc::new(
DeadLetterHandler::new(self.inner.pool.clone(), dlq_consumer).await?
);
handles.push(tokio::spawn(async move {
dlq_handler.start().await
}));
}
// ... wait for completion ...
}
Operational Considerations
Monitoring
Key Metrics:
- DLQ message rate (messages/sec entering DLQ)
- DLQ queue depth (current messages in DLQ)
- DLQ processing latency (time from DLX to handler)
- Failed execution count (executions failed via DLQ)
Alerting Thresholds:
- DLQ rate > 10/min: Workers may be unhealthy or TTL too aggressive
- DLQ depth > 100: Handler may be falling behind
- High failure rate: Systematic worker availability issues
RabbitMQ Management
View DLQ:
# List messages in DLQ
rabbitmqadmin list queues name messages
# Get DLQ details
rabbitmqadmin show queue name=attune.dlx.queue
# Purge DLQ (use with caution)
rabbitmqadmin purge queue name=attune.dlx.queue
View Dead Letters:
# Get message from DLQ
rabbitmqadmin get queue=attune.dlx.queue count=1
# Check message death history
# Look for x-death header in message properties
Troubleshooting
High DLQ Rate
Symptoms: Many executions failing via DLQ
Causes:
- Workers down or restarting frequently
- Worker queue TTL too aggressive
- Worker overloaded (not consuming fast enough)
- Network issues between executor and workers
Resolution:
- Check worker health and logs
- Verify worker heartbeats in database
- Consider increasing
worker_queue_ttl_ms - Scale worker fleet if overloaded
DLQ Handler Not Processing
Symptoms: DLQ depth increasing, executions stuck
Causes:
- Executor service not running
- DLQ disabled in configuration
- Database connection issues
- Handler crashed or deadlocked
Resolution:
- Check executor service logs
- Verify
dead_letter.enabled = true - Check database connectivity
- Restart executor service if needed
Messages Not Reaching DLQ
Symptoms: Executions stuck, DLQ empty
Causes:
- Worker queues not configured with DLX
- DLX exchange not created
- DLQ not bound to DLX
- TTL not configured on worker queues
Resolution:
- Restart services to recreate infrastructure
- Verify RabbitMQ configuration
- Check queue properties in RabbitMQ management UI
Testing
Unit Tests
#[tokio::test]
async fn test_expired_execution_handling() {
let pool = setup_test_db().await;
// Create execution in SCHEDULED state
let execution = create_test_execution(&pool, ExecutionStatus::Scheduled).await;
// Simulate DLQ message
let envelope = MessageEnvelope::new(
MessageType::ExecutionRequested,
json!({ "execution_id": execution.id }),
);
// Process message
handle_execution_requested(&pool, &envelope).await.unwrap();
// Verify execution failed
let updated = ExecutionRepository::find_by_id(&pool, execution.id).await.unwrap();
assert_eq!(updated.status, ExecutionStatus::Failed);
assert!(updated.result.unwrap()["error"].as_str().unwrap().contains("TTL expired"));
}
Integration Tests
# 1. Start all services
docker compose up -d
# 2. Create execution targeting stopped worker
curl -X POST http://localhost:8080/api/v1/executions \
-H "Content-Type: application/json" \
-d '{
"action_ref": "core.echo",
"parameters": {"message": "test"},
"worker_id": 999 # Non-existent worker
}'
# 3. Wait for TTL expiration (5+ minutes)
sleep 330
# 4. Verify execution failed
curl http://localhost:8080/api/v1/executions/{id}
# Should show status: "failed", error: "Worker queue TTL expired"
# 5. Check DLQ processed the message
rabbitmqadmin list queues name messages | grep attune.dlx.queue
# Should show 0 messages (processed and removed)
Relationship to Other Phases
Phase 1 (Completed)
- Execution timeout monitor: Handles executions stuck in SCHEDULED
- Graceful shutdown: Prevents new tasks to stopping workers
- Reduced heartbeat: Faster stale worker detection
Interaction: Phase 1 timeout monitor acts as a backstop if DLQ processing fails
Phase 2 (Current)
- Worker queue TTL: Automatic message expiration
- Dead letter queue: Capture expired messages
- Dead letter handler: Process and fail expired executions
Benefit: More precise failure detection at the message queue level
Phase 3 (Planned)
- Health probes: Proactive worker health checking
- Intelligent retry: Retry transient failures
- Load balancing: Distribute work across healthy workers
Integration: Phase 3 will use Phase 2 DLQ data to inform routing decisions
Benefits
- Automatic Failure Detection: No manual intervention needed for unavailable workers
- Precise Timing: TTL provides exact failure window (vs polling-based Phase 1)
- Resource Efficiency: Prevents message accumulation in worker queues
- Debugging Support: DLQ retains messages for forensic analysis
- Graceful Degradation: System continues functioning even with worker failures
Limitations
- TTL Precision: RabbitMQ TTL is approximate, not guaranteed to the millisecond
- Race Conditions: Worker may start processing just as TTL expires (rare)
- DLQ Capacity: Very high failure rates may overwhelm DLQ
- No Retry Logic: Phase 2 always fails; Phase 3 will add intelligent retry
Future Enhancements (Phase 3)
- Conditional Retry: Retry messages based on failure reason
- Priority DLQ: Prioritize critical execution failures
- DLQ Analytics: Aggregate statistics on failure patterns
- Auto-scaling: Scale workers based on DLQ rate
- Custom TTL: Per-action or per-execution TTL configuration
References
- RabbitMQ Dead Letter Exchanges: https://www.rabbitmq.com/dlx.html
- RabbitMQ TTL: https://www.rabbitmq.com/ttl.html
- Phase 1 Documentation:
docs/architecture/worker-availability-handling.md - Queue Architecture:
docs/architecture/queue-architecture.md