more internal polish, resilient workers
This commit is contained in:
322
docs/QUICKREF-worker-queue-ttl-dlq.md
Normal file
322
docs/QUICKREF-worker-queue-ttl-dlq.md
Normal file
@@ -0,0 +1,322 @@
|
||||
# Quick Reference: Worker Queue TTL and Dead Letter Queue (Phase 2)
|
||||
|
||||
## Overview
|
||||
|
||||
Phase 2 implements message TTL on worker queues and dead letter queue processing to automatically fail executions when workers are unavailable.
|
||||
|
||||
**Key Concept:** If a worker doesn't process an execution within 5 minutes, the message expires and the execution is automatically marked as FAILED.
|
||||
|
||||
## How It Works
|
||||
|
||||
```
|
||||
Execution → Worker Queue (TTL: 5 min) → Worker Processing ✓
|
||||
↓ (if timeout)
|
||||
Dead Letter Exchange
|
||||
↓
|
||||
Dead Letter Queue
|
||||
↓
|
||||
DLQ Handler (in Executor)
|
||||
↓
|
||||
Execution marked FAILED
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Default Settings (All Environments)
|
||||
|
||||
```yaml
|
||||
message_queue:
|
||||
rabbitmq:
|
||||
worker_queue_ttl_ms: 300000 # 5 minutes
|
||||
dead_letter:
|
||||
enabled: true
|
||||
exchange: attune.dlx
|
||||
ttl_ms: 86400000 # 24 hours DLQ retention
|
||||
```
|
||||
|
||||
### Tuning TTL
|
||||
|
||||
**Worker Queue TTL** (`worker_queue_ttl_ms`):
|
||||
- **Default:** 300000 (5 minutes)
|
||||
- **Purpose:** How long to wait before declaring worker unavailable
|
||||
- **Tuning:** Set to 2-5x your typical execution time
|
||||
- **Too short:** Slow executions fail prematurely
|
||||
- **Too long:** Delayed failure detection for unavailable workers
|
||||
|
||||
**DLQ Retention** (`dead_letter.ttl_ms`):
|
||||
- **Default:** 86400000 (24 hours)
|
||||
- **Purpose:** How long to keep expired messages for debugging
|
||||
- **Tuning:** Based on your debugging/forensics needs
|
||||
|
||||
## Components
|
||||
|
||||
### 1. Worker Queue TTL
|
||||
|
||||
- Applied to all `worker.{id}.executions` queues
|
||||
- Configured via RabbitMQ queue argument `x-message-ttl`
|
||||
- Messages expire if not consumed within TTL
|
||||
- Expired messages routed to dead letter exchange
|
||||
|
||||
### 2. Dead Letter Exchange (DLX)
|
||||
|
||||
- **Name:** `attune.dlx`
|
||||
- **Type:** `direct`
|
||||
- Receives all expired messages from worker queues
|
||||
- Routes to dead letter queue
|
||||
|
||||
### 3. Dead Letter Queue (DLQ)
|
||||
|
||||
- **Name:** `attune.dlx.queue`
|
||||
- Stores expired messages for processing
|
||||
- Retains messages for 24 hours (configurable)
|
||||
- Processed by dead letter handler
|
||||
|
||||
### 4. Dead Letter Handler
|
||||
|
||||
- Runs in executor service
|
||||
- Consumes messages from DLQ
|
||||
- Updates executions to FAILED status
|
||||
- Provides descriptive error messages
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Key Metrics
|
||||
|
||||
```bash
|
||||
# Check DLQ depth
|
||||
rabbitmqadmin list queues name messages | grep attune.dlx.queue
|
||||
|
||||
# View DLQ rate
|
||||
# Watch for sustained DLQ message rate > 10/min
|
||||
|
||||
# Check failed executions
|
||||
curl http://localhost:8080/api/v1/executions?status=failed
|
||||
```
|
||||
|
||||
### Health Checks
|
||||
|
||||
**Good:**
|
||||
- DLQ depth: 0-10
|
||||
- DLQ rate: < 5 messages/min
|
||||
- Most executions complete successfully
|
||||
|
||||
**Warning:**
|
||||
- DLQ depth: 10-100
|
||||
- DLQ rate: 5-20 messages/min
|
||||
- May indicate worker instability
|
||||
|
||||
**Critical:**
|
||||
- DLQ depth: > 100
|
||||
- DLQ rate: > 20 messages/min
|
||||
- Workers likely down or overloaded
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### High DLQ Rate
|
||||
|
||||
**Symptoms:** Many executions failing via DLQ
|
||||
|
||||
**Common Causes:**
|
||||
1. Workers stopped or restarting
|
||||
2. Workers overloaded (not consuming fast enough)
|
||||
3. TTL too aggressive for your workload
|
||||
4. Network connectivity issues
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# 1. Check worker status
|
||||
docker compose ps | grep worker
|
||||
docker compose logs -f worker-shell
|
||||
|
||||
# 2. Verify worker heartbeats
|
||||
psql -c "SELECT name, status, last_heartbeat FROM worker;"
|
||||
|
||||
# 3. Check worker queue depths
|
||||
rabbitmqadmin list queues name messages | grep "worker\."
|
||||
|
||||
# 4. Consider increasing TTL if legitimate slow executions
|
||||
# Edit config and restart executor:
|
||||
# worker_queue_ttl_ms: 600000 # 10 minutes
|
||||
```
|
||||
|
||||
### DLQ Not Processing
|
||||
|
||||
**Symptoms:** DLQ depth increasing, executions stuck
|
||||
|
||||
**Common Causes:**
|
||||
1. Executor service not running
|
||||
2. DLQ disabled in config
|
||||
3. Database connection issues
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# 1. Verify executor is running
|
||||
docker compose ps executor
|
||||
docker compose logs -f executor | grep "dead letter"
|
||||
|
||||
# 2. Check configuration
|
||||
grep -A 3 "dead_letter:" config.docker.yaml
|
||||
|
||||
# 3. Restart executor if needed
|
||||
docker compose restart executor
|
||||
```
|
||||
|
||||
### Messages Not Expiring
|
||||
|
||||
**Symptoms:** Executions stuck in SCHEDULED, DLQ empty
|
||||
|
||||
**Common Causes:**
|
||||
1. Worker queues not configured with TTL
|
||||
2. Worker queues not configured with DLX
|
||||
3. Infrastructure setup failed
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# 1. Check queue properties
|
||||
rabbitmqadmin show queue name=worker.1.executions
|
||||
|
||||
# Look for:
|
||||
# - arguments.x-message-ttl: 300000
|
||||
# - arguments.x-dead-letter-exchange: attune.dlx
|
||||
|
||||
# 2. Recreate infrastructure (safe, idempotent)
|
||||
docker compose restart executor worker-shell
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
### Manual Test: Verify TTL Expiration
|
||||
|
||||
```bash
|
||||
# 1. Stop all workers
|
||||
docker compose stop worker-shell worker-python worker-node
|
||||
|
||||
# 2. Create execution
|
||||
curl -X POST http://localhost:8080/api/v1/executions \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"action_ref": "core.echo",
|
||||
"parameters": {"message": "test"}
|
||||
}'
|
||||
|
||||
# 3. Wait for TTL expiration (5+ minutes)
|
||||
sleep 330
|
||||
|
||||
# 4. Check execution status
|
||||
curl http://localhost:8080/api/v1/executions/{id} | jq '.data.status'
|
||||
# Should be "failed"
|
||||
|
||||
# 5. Check error message
|
||||
curl http://localhost:8080/api/v1/executions/{id} | jq '.data.result'
|
||||
# Should contain "Worker queue TTL expired"
|
||||
|
||||
# 6. Verify DLQ processed it
|
||||
rabbitmqadmin list queues name messages | grep attune.dlx.queue
|
||||
# Should show 0 messages (processed and removed)
|
||||
```
|
||||
|
||||
## Relationship to Phase 1
|
||||
|
||||
**Phase 1 (Timeout Monitor):**
|
||||
- Monitors executions in SCHEDULED state
|
||||
- Fails executions after configured timeout
|
||||
- Acts as backup safety net
|
||||
|
||||
**Phase 2 (Queue TTL + DLQ):**
|
||||
- Expires messages at queue level
|
||||
- More precise failure detection
|
||||
- Provides better visibility (DLQ metrics)
|
||||
|
||||
**Together:** Provide defense-in-depth for worker unavailability
|
||||
|
||||
## Common Operations
|
||||
|
||||
### View DLQ Messages
|
||||
|
||||
```bash
|
||||
# Get messages from DLQ (doesn't remove)
|
||||
rabbitmqadmin get queue=attune.dlx.queue count=10
|
||||
|
||||
# View x-death header for expiration details
|
||||
rabbitmqadmin get queue=attune.dlx.queue count=1 --format=long
|
||||
```
|
||||
|
||||
### Manually Purge DLQ
|
||||
|
||||
```bash
|
||||
# Use with caution - removes all messages
|
||||
rabbitmqadmin purge queue name=attune.dlx.queue
|
||||
```
|
||||
|
||||
### Temporarily Disable DLQ
|
||||
|
||||
```yaml
|
||||
# config.docker.yaml
|
||||
message_queue:
|
||||
rabbitmq:
|
||||
dead_letter:
|
||||
enabled: false # Disables DLQ handler
|
||||
```
|
||||
|
||||
**Note:** Messages will still expire but won't be processed
|
||||
|
||||
### Adjust TTL Without Restart
|
||||
|
||||
Not possible - queue TTL is set at queue creation time. To change:
|
||||
|
||||
```bash
|
||||
# 1. Stop all services
|
||||
docker compose down
|
||||
|
||||
# 2. Delete worker queues (forces recreation)
|
||||
rabbitmqadmin delete queue name=worker.1.executions
|
||||
# Repeat for all worker queues
|
||||
|
||||
# 3. Update config
|
||||
# Edit worker_queue_ttl_ms
|
||||
|
||||
# 4. Restart services (queues recreated with new TTL)
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
## Key Files
|
||||
|
||||
### Configuration
|
||||
- `config.docker.yaml` - Production settings
|
||||
- `config.development.yaml` - Development settings
|
||||
|
||||
### Implementation
|
||||
- `crates/common/src/mq/config.rs` - TTL configuration
|
||||
- `crates/common/src/mq/connection.rs` - Queue setup with TTL
|
||||
- `crates/executor/src/dead_letter_handler.rs` - DLQ processing
|
||||
- `crates/executor/src/service.rs` - DLQ handler integration
|
||||
|
||||
### Documentation
|
||||
- `docs/architecture/worker-queue-ttl-dlq.md` - Full architecture
|
||||
- `docs/architecture/worker-availability-handling.md` - Phase 1 (backup)
|
||||
|
||||
## When to Use
|
||||
|
||||
**Enable DLQ (default):**
|
||||
- Production environments
|
||||
- Development with multiple workers
|
||||
- Any environment requiring high reliability
|
||||
|
||||
**Disable DLQ:**
|
||||
- Local development with single worker
|
||||
- Testing scenarios where you want manual control
|
||||
- Debugging worker behavior
|
||||
|
||||
## Next Steps (Phase 3)
|
||||
|
||||
- **Health probes:** Proactive worker health checking
|
||||
- **Intelligent retry:** Retry transient failures
|
||||
- **Per-action TTL:** Custom timeouts per action type
|
||||
- **DLQ analytics:** Aggregate failure statistics
|
||||
|
||||
## See Also
|
||||
|
||||
- Phase 1 Documentation: `docs/architecture/worker-availability-handling.md`
|
||||
- Queue Architecture: `docs/architecture/queue-architecture.md`
|
||||
- RabbitMQ Dead Letter Exchanges: https://www.rabbitmq.com/dlx.html
|
||||
Reference in New Issue
Block a user