Files
attune/docs/architecture/worker-queue-ttl-dlq.md

494 lines
15 KiB
Markdown

# Worker Queue TTL and Dead Letter Queue (Phase 2)
## Overview
Phase 2 of worker availability handling implements message TTL (time-to-live) on worker-specific queues and dead letter queue (DLQ) processing. This ensures that executions sent to unavailable workers are automatically failed instead of remaining stuck indefinitely.
## Architecture
### Message Flow
```
┌─────────────┐
│ Executor │
│ Scheduler │
└──────┬──────┘
│ Publishes ExecutionRequested
│ routing_key: execution.dispatch.worker.{id}
┌──────────────────────────────────┐
│ worker.{id}.executions queue │
│ │
│ Properties: │
│ - x-message-ttl: 300000ms (5m) │
│ - x-dead-letter-exchange: dlx │
└──────┬───────────────────┬───────┘
│ │
│ Worker consumes │ TTL expires
│ (normal flow) │ (worker unavailable)
│ │
▼ ▼
┌──────────────┐ ┌──────────────────┐
│ Worker │ │ attune.dlx │
│ Service │ │ (Dead Letter │
│ │ │ Exchange) │
└──────────────┘ └────────┬─────────┘
│ Routes to DLQ
┌──────────────────────┐
│ attune.dlx.queue │
│ (Dead Letter Queue) │
└────────┬─────────────┘
│ Consumes
┌──────────────────────┐
│ Dead Letter Handler │
│ (in Executor) │
│ │
│ - Identifies exec │
│ - Marks as FAILED │
│ - Logs failure │
└──────────────────────┘
```
### Components
#### 1. Worker Queue TTL
**Configuration:**
- Default: 5 minutes (300,000 milliseconds)
- Configurable via `rabbitmq.worker_queue_ttl_ms`
**Implementation:**
- Applied during queue declaration in `Connection::setup_worker_infrastructure()`
- Uses RabbitMQ's `x-message-ttl` queue argument
- Only applies to worker-specific queues (`worker.{id}.executions`)
**Behavior:**
- When a message remains in the queue longer than TTL
- RabbitMQ automatically moves it to the configured dead letter exchange
- Original message properties and headers are preserved
- Includes `x-death` header with expiration details
#### 2. Dead Letter Exchange (DLX)
**Configuration:**
- Exchange name: `attune.dlx`
- Type: `direct`
- Durable: `true`
**Setup:**
- Created in `Connection::setup_common_infrastructure()`
- Bound to dead letter queue with routing key `#` (all messages)
- Shared across all services
#### 3. Dead Letter Queue
**Configuration:**
- Queue name: `attune.dlx.queue`
- Durable: `true`
- TTL: 24 hours (configurable via `rabbitmq.dead_letter.ttl_ms`)
**Properties:**
- Retains messages for debugging and analysis
- Messages auto-expire after retention period
- No DLX on the DLQ itself (prevents infinite loops)
#### 4. Dead Letter Handler
**Location:** `crates/executor/src/dead_letter_handler.rs`
**Responsibilities:**
1. Consume messages from `attune.dlx.queue`
2. Deserialize message envelope
3. Extract execution ID from payload
4. Verify execution is in non-terminal state
5. Update execution to FAILED status
6. Add descriptive error information
7. Acknowledge message (remove from DLQ)
**Error Handling:**
- Invalid messages: Acknowledged and discarded
- Missing executions: Acknowledged (already processed)
- Terminal state executions: Acknowledged (no action needed)
- Database errors: Nacked with requeue (retry later)
## Configuration
### RabbitMQ Configuration Structure
```yaml
message_queue:
rabbitmq:
# Worker queue TTL - how long messages wait before DLX
worker_queue_ttl_ms: 300000 # 5 minutes (default)
# Dead letter configuration
dead_letter:
enabled: true # Enable DLQ system
exchange: attune.dlx # DLX name
ttl_ms: 86400000 # DLQ retention (24 hours)
```
### Environment-Specific Settings
#### Development (`config.development.yaml`)
```yaml
message_queue:
rabbitmq:
worker_queue_ttl_ms: 300000 # 5 minutes
dead_letter:
enabled: true
exchange: attune.dlx
ttl_ms: 86400000 # 24 hours
```
#### Production (`config.docker.yaml`)
```yaml
message_queue:
rabbitmq:
worker_queue_ttl_ms: 300000 # 5 minutes
dead_letter:
enabled: true
exchange: attune.dlx
ttl_ms: 86400000 # 24 hours
```
### Tuning Guidelines
**Worker Queue TTL (`worker_queue_ttl_ms`):**
- **Too short:** Legitimate slow workers may have executions failed prematurely
- **Too long:** Unavailable workers cause delayed failure detection
- **Recommendation:** 2-5x typical execution time, minimum 2 minutes
- **Default (5 min):** Good balance for most workloads
**DLQ Retention (`dead_letter.ttl_ms`):**
- Purpose: Debugging and forensics
- **Too short:** May lose data before analysis
- **Too long:** Accumulates stale data
- **Recommendation:** 24-48 hours in production
- **Default (24 hours):** Adequate for most troubleshooting
## Code Structure
### Queue Declaration with TTL
```rust
// crates/common/src/mq/connection.rs
pub async fn declare_queue_with_dlx_and_ttl(
&self,
config: &QueueConfig,
dlx_exchange: &str,
ttl_ms: Option<u64>,
) -> MqResult<()> {
let mut args = FieldTable::default();
// Configure DLX
args.insert(
"x-dead-letter-exchange".into(),
AMQPValue::LongString(dlx_exchange.into()),
);
// Configure TTL if specified
if let Some(ttl) = ttl_ms {
args.insert(
"x-message-ttl".into(),
AMQPValue::LongInt(ttl as i64),
);
}
// Declare queue with arguments
channel.queue_declare(&config.name, options, args).await?;
Ok(())
}
```
### Dead Letter Handler
```rust
// crates/executor/src/dead_letter_handler.rs
pub struct DeadLetterHandler {
pool: Arc<PgPool>,
consumer: Consumer,
running: Arc<Mutex<bool>>,
}
impl DeadLetterHandler {
pub async fn start(&self) -> Result<(), Error> {
self.consumer.consume_with_handler(|envelope| {
match envelope.message_type {
MessageType::ExecutionRequested => {
handle_execution_requested(&pool, &envelope).await
}
_ => {
// Unexpected message type - acknowledge and discard
Ok(())
}
}
}).await
}
}
async fn handle_execution_requested(
pool: &PgPool,
envelope: &MessageEnvelope<Value>,
) -> MqResult<()> {
// Extract execution ID
let execution_id = envelope.payload.get("execution_id")
.and_then(|v| v.as_i64())
.ok_or_else(|| /* error */)?;
// Fetch current state
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
// Only fail if in non-terminal state
if !execution.status.is_terminal() {
ExecutionRepository::update(pool, execution_id, UpdateExecutionInput {
status: Some(ExecutionStatus::Failed),
result: Some(json!({
"error": "Worker queue TTL expired",
"message": "Worker did not process execution within configured TTL",
})),
ended: Some(Some(Utc::now())),
..Default::default()
}).await?;
}
Ok(())
}
```
## Integration with Executor Service
The dead letter handler is started automatically by the executor service if DLQ is enabled:
```rust
// crates/executor/src/service.rs
pub async fn start(&self) -> Result<()> {
// ... other components ...
// Start dead letter handler (if enabled)
if self.inner.mq_config.rabbitmq.dead_letter.enabled {
let dlq_name = format!("{}.queue",
self.inner.mq_config.rabbitmq.dead_letter.exchange);
let dlq_consumer = Consumer::new(
&self.inner.mq_connection,
create_dlq_consumer_config(&dlq_name, "executor.dlq"),
).await?;
let dlq_handler = Arc::new(
DeadLetterHandler::new(self.inner.pool.clone(), dlq_consumer).await?
);
handles.push(tokio::spawn(async move {
dlq_handler.start().await
}));
}
// ... wait for completion ...
}
```
## Operational Considerations
### Monitoring
**Key Metrics:**
- DLQ message rate (messages/sec entering DLQ)
- DLQ queue depth (current messages in DLQ)
- DLQ processing latency (time from DLX to handler)
- Failed execution count (executions failed via DLQ)
**Alerting Thresholds:**
- DLQ rate > 10/min: Workers may be unhealthy or TTL too aggressive
- DLQ depth > 100: Handler may be falling behind
- High failure rate: Systematic worker availability issues
### RabbitMQ Management
**View DLQ:**
```bash
# List messages in DLQ
rabbitmqadmin list queues name messages
# Get DLQ details
rabbitmqadmin show queue name=attune.dlx.queue
# Purge DLQ (use with caution)
rabbitmqadmin purge queue name=attune.dlx.queue
```
**View Dead Letters:**
```bash
# Get message from DLQ
rabbitmqadmin get queue=attune.dlx.queue count=1
# Check message death history
# Look for x-death header in message properties
```
### Troubleshooting
#### High DLQ Rate
**Symptoms:** Many executions failing via DLQ
**Causes:**
1. Workers down or restarting frequently
2. Worker queue TTL too aggressive
3. Worker overloaded (not consuming fast enough)
4. Network issues between executor and workers
**Resolution:**
1. Check worker health and logs
2. Verify worker heartbeats in database
3. Consider increasing `worker_queue_ttl_ms`
4. Scale worker fleet if overloaded
#### DLQ Handler Not Processing
**Symptoms:** DLQ depth increasing, executions stuck
**Causes:**
1. Executor service not running
2. DLQ disabled in configuration
3. Database connection issues
4. Handler crashed or deadlocked
**Resolution:**
1. Check executor service logs
2. Verify `dead_letter.enabled = true`
3. Check database connectivity
4. Restart executor service if needed
#### Messages Not Reaching DLQ
**Symptoms:** Executions stuck, DLQ empty
**Causes:**
1. Worker queues not configured with DLX
2. DLX exchange not created
3. DLQ not bound to DLX
4. TTL not configured on worker queues
**Resolution:**
1. Restart services to recreate infrastructure
2. Verify RabbitMQ configuration
3. Check queue properties in RabbitMQ management UI
## Testing
### Unit Tests
```rust
#[tokio::test]
async fn test_expired_execution_handling() {
let pool = setup_test_db().await;
// Create execution in SCHEDULED state
let execution = create_test_execution(&pool, ExecutionStatus::Scheduled).await;
// Simulate DLQ message
let envelope = MessageEnvelope::new(
MessageType::ExecutionRequested,
json!({ "execution_id": execution.id }),
);
// Process message
handle_execution_requested(&pool, &envelope).await.unwrap();
// Verify execution failed
let updated = ExecutionRepository::find_by_id(&pool, execution.id).await.unwrap();
assert_eq!(updated.status, ExecutionStatus::Failed);
assert!(updated.result.unwrap()["error"].as_str().unwrap().contains("TTL expired"));
}
```
### Integration Tests
```bash
# 1. Start all services
docker compose up -d
# 2. Create execution targeting stopped worker
curl -X POST http://localhost:8080/api/v1/executions \
-H "Content-Type: application/json" \
-d '{
"action_ref": "core.echo",
"parameters": {"message": "test"},
"worker_id": 999 # Non-existent worker
}'
# 3. Wait for TTL expiration (5+ minutes)
sleep 330
# 4. Verify execution failed
curl http://localhost:8080/api/v1/executions/{id}
# Should show status: "failed", error: "Worker queue TTL expired"
# 5. Check DLQ processed the message
rabbitmqadmin list queues name messages | grep attune.dlx.queue
# Should show 0 messages (processed and removed)
```
## Relationship to Other Phases
### Phase 1 (Completed)
- Execution timeout monitor: Handles executions stuck in SCHEDULED
- Graceful shutdown: Prevents new tasks to stopping workers
- Reduced heartbeat: Faster stale worker detection
**Interaction:** Phase 1 timeout monitor acts as a backstop if DLQ processing fails
### Phase 2 (Current)
- Worker queue TTL: Automatic message expiration
- Dead letter queue: Capture expired messages
- Dead letter handler: Process and fail expired executions
**Benefit:** More precise failure detection at the message queue level
### Phase 3 (Planned)
- Health probes: Proactive worker health checking
- Intelligent retry: Retry transient failures
- Load balancing: Distribute work across healthy workers
**Integration:** Phase 3 will use Phase 2 DLQ data to inform routing decisions
## Benefits
1. **Automatic Failure Detection:** No manual intervention needed for unavailable workers
2. **Precise Timing:** TTL provides exact failure window (vs polling-based Phase 1)
3. **Resource Efficiency:** Prevents message accumulation in worker queues
4. **Debugging Support:** DLQ retains messages for forensic analysis
5. **Graceful Degradation:** System continues functioning even with worker failures
## Limitations
1. **TTL Precision:** RabbitMQ TTL is approximate, not guaranteed to the millisecond
2. **Race Conditions:** Worker may start processing just as TTL expires (rare)
3. **DLQ Capacity:** Very high failure rates may overwhelm DLQ
4. **No Retry Logic:** Phase 2 always fails; Phase 3 will add intelligent retry
## Future Enhancements (Phase 3)
- **Conditional Retry:** Retry messages based on failure reason
- **Priority DLQ:** Prioritize critical execution failures
- **DLQ Analytics:** Aggregate statistics on failure patterns
- **Auto-scaling:** Scale workers based on DLQ rate
- **Custom TTL:** Per-action or per-execution TTL configuration
## References
- RabbitMQ Dead Letter Exchanges: https://www.rabbitmq.com/dlx.html
- RabbitMQ TTL: https://www.rabbitmq.com/ttl.html
- Phase 1 Documentation: `docs/architecture/worker-availability-handling.md`
- Queue Architecture: `docs/architecture/queue-architecture.md`