more internal polish, resilient workers

2026-02-09 18:32:34 -06:00
parent 588b319fec
commit e31ecb781b
62 changed files with 9872 additions and 584 deletions
--- a/work-summary/2026-02-09-worker-queue-ttl-phase2.md
+++ b/work-summary/2026-02-09-worker-queue-ttl-phase2.md
@@ -0,0 +1,273 @@
+# Work Summary: Worker Queue TTL and Dead Letter Queue (Phase 2)
+
+**Date:** 2026-02-09  
+**Author:** AI Assistant  
+**Phase:** Worker Availability Handling - Phase 2
+
+## Overview
+
+Implemented Phase 2 of worker availability handling: message TTL (time-to-live) on worker queues and dead letter queue (DLQ) processing. This ensures executions sent to unavailable workers are automatically failed instead of remaining stuck indefinitely.
+
+## Motivation
+
+Phase 1 (timeout monitor) provided a safety net by periodically checking for stale SCHEDULED executions. Phase 2 adds message-level expiration at the queue layer, providing:
+
+1. **More precise timing:** Messages expire exactly after TTL (vs polling interval)
+2. **Better visibility:** DLQ metrics show worker availability issues
+3. **Resource efficiency:** Prevents message accumulation in dead worker queues
+4. **Forensics support:** Expired messages retained in DLQ for debugging
+
+## Changes Made
+
+### 1. Configuration Updates
+
+**Added TTL Configuration:**
+- `crates/common/src/mq/config.rs`:
+  - Added `worker_queue_ttl_ms` field to `RabbitMqConfig` (default: 5 minutes)
+  - Added `worker_queue_ttl()` helper method
+  - Added test for TTL configuration
+
+**Updated Environment Configs:**
+- `config.docker.yaml`: Added RabbitMQ TTL and DLQ settings
+- `config.development.yaml`: Added RabbitMQ TTL and DLQ settings
+
+### 2. Queue Infrastructure
+
+**Enhanced Queue Declaration:**
+- `crates/common/src/mq/connection.rs`:
+  - Added `declare_queue_with_dlx_and_ttl()` method
+  - Updated `declare_queue_with_dlx()` to call new method
+  - Added `declare_queue_with_optional_dlx_and_ttl()` helper
+  - Updated `setup_worker_infrastructure()` to apply TTL to worker queues
+  - Added warning for queues with TTL but no DLX
+
+**Queue Arguments Added:**
+- `x-message-ttl`: Message expiration time (milliseconds)
+- `x-dead-letter-exchange`: Target exchange for expired messages
+
+### 3. Dead Letter Handler
+
+**New Module:** `crates/executor/src/dead_letter_handler.rs`
+
+**Components:**
+- `DeadLetterHandler` struct: Manages DLQ consumption and processing
+- `handle_execution_requested()`: Processes expired execution messages
+- `create_dlq_consumer_config()`: Creates consumer configuration
+
+**Behavior:**
+- Consumes from `attune.dlx.queue`
+- Extracts execution ID from message payload
+- Verifies execution is in non-terminal state (SCHEDULED or RUNNING)
+- Updates execution to FAILED with descriptive error
+- Handles edge cases (missing execution, already terminal, database errors)
+
+**Error Handling:**
+- Invalid messages: Acknowledged and discarded
+- Missing executions: Acknowledged (already processed)
+- Terminal state executions: Acknowledged (no action needed)
+- Database errors: Nacked with requeue for retry
+
+### 4. Service Integration
+
+**Executor Service:**
+- `crates/executor/src/service.rs`:
+  - Integrated `DeadLetterHandler` into startup sequence
+  - Creates DLQ consumer if `dead_letter.enabled = true`
+  - Spawns DLQ handler as background task
+  - Logs DLQ handler status at startup
+
+**Module Declarations:**
+- `crates/executor/src/lib.rs`: Added public exports
+- `crates/executor/src/main.rs`: Added module declaration
+
+### 5. Documentation
+
+**Architecture Documentation:**
+- `docs/architecture/worker-queue-ttl-dlq.md`: Comprehensive 493-line guide
+  - Message flow diagrams
+  - Component descriptions
+  - Configuration reference
+  - Code structure examples
+  - Operational considerations
+  - Monitoring and troubleshooting
+
+**Quick Reference:**
+- `docs/QUICKREF-worker-queue-ttl-dlq.md`: 322-line practical guide
+  - Configuration examples
+  - Monitoring commands
+  - Troubleshooting procedures
+  - Testing procedures
+  - Common operations
+
+## Technical Details
+
+### Message Flow
+
+```
+Executor → worker.{id}.executions (TTL: 5min) → Worker ✓
+                     ↓ (timeout)
+              attune.dlx (DLX)
+                     ↓
+           attune.dlx.queue (DLQ)
+                     ↓
+         Dead Letter Handler → Execution FAILED
+```
+
+### Configuration Structure
+
+```yaml
+message_queue:
+  rabbitmq:
+    worker_queue_ttl_ms: 300000  # 5 minutes
+    dead_letter:
+      enabled: true
+      exchange: attune.dlx
+      ttl_ms: 86400000  # 24 hours
+```
+
+### Key Implementation Details
+
+1. **TTL Type Conversion:** RabbitMQ expects `i32` for `x-message-ttl`, not `i64`
+2. **Queue Recreation:** TTL is set at queue creation time, cannot be changed dynamically
+3. **No Redundant Ended Field:** `UpdateExecutionInput` only supports status, result, executor, workflow_task
+4. **Arc<PgPool> Wrapping:** Dead letter handler requires Arc-wrapped pool
+5. **Module Imports:** Both lib.rs and main.rs need module declarations
+
+## Testing
+
+### Compilation
+- ✅ All crates compile cleanly (`cargo check --workspace`)
+- ✅ No errors, only expected dead_code warnings (public API methods)
+
+### Manual Testing Procedure
+
+```bash
+# 1. Stop all workers
+docker compose stop worker-shell worker-python worker-node
+
+# 2. Create execution
+curl -X POST http://localhost:8080/api/v1/executions \
+  -H "Authorization: Bearer $TOKEN" \
+  -d '{"action_ref": "core.echo", "parameters": {"message": "test"}}'
+
+# 3. Wait 5+ minutes for TTL expiration
+sleep 330
+
+# 4. Verify execution failed with appropriate error
+curl http://localhost:8080/api/v1/executions/{id}
+# Expected: status="failed", result contains "Worker queue TTL expired"
+```
+
+## Benefits
+
+1. **Automatic Failure Detection:** No manual intervention for unavailable workers
+2. **Precise Timing:** Exact TTL-based expiration (not polling-based)
+3. **Operational Visibility:** DLQ metrics expose worker health issues
+4. **Resource Efficiency:** Prevents unbounded queue growth
+5. **Debugging Support:** Expired messages retained for analysis
+6. **Defense in Depth:** Works alongside Phase 1 timeout monitor
+
+## Configuration Recommendations
+
+### Worker Queue TTL
+- **Default:** 300000ms (5 minutes)
+- **Tuning:** 2-5x typical execution time, minimum 2 minutes
+- **Too Short:** Legitimate slow executions fail prematurely
+- **Too Long:** Delayed failure detection for unavailable workers
+
+### DLQ Retention
+- **Default:** 86400000ms (24 hours)
+- **Purpose:** Forensics and debugging
+- **Tuning:** Based on operational needs (24-48 hours recommended)
+
+## Monitoring
+
+### Key Metrics
+- **DLQ message rate:** Messages/sec entering DLQ
+- **DLQ queue depth:** Current messages in DLQ
+- **DLQ processing latency:** Time from expiration to handler
+- **Failed execution count:** Executions failed via DLQ
+
+### Alert Thresholds
+- **Warning:** DLQ rate > 10/min (worker instability)
+- **Critical:** DLQ depth > 100 (handler falling behind)
+
+## Relationship to Other Phases
+
+### Phase 1 (Completed)
+- Execution timeout monitor: Polls for stale executions
+- Graceful shutdown: Prevents new tasks to stopping workers
+- Reduced heartbeat: 10s interval for faster detection
+
+**Interaction:** Phase 1 acts as backup if Phase 2 DLQ processing fails
+
+### Phase 2 (Current)
+- Worker queue TTL: Automatic message expiration
+- Dead letter queue: Captures expired messages
+- Dead letter handler: Processes and fails executions
+
+**Benefit:** More precise and efficient than polling
+
+### Phase 3 (Planned)
+- Health probes: Proactive worker health checking
+- Intelligent retry: Retry transient failures
+- Load balancing: Distribute across healthy workers
+
+**Integration:** Phase 3 will use DLQ data to inform routing decisions
+
+## Known Limitations
+
+1. **TTL Precision:** RabbitMQ TTL is approximate, not millisecond-precise
+2. **Race Conditions:** Worker may consume just as TTL expires (rare, harmless)
+3. **No Dynamic TTL:** Requires queue recreation to change TTL
+4. **Single TTL Value:** All workers use same TTL (Phase 3 may add per-action TTL)
+
+## Files Modified
+
+### Core Implementation
+- `crates/common/src/mq/config.rs` (+25 lines)
+- `crates/common/src/mq/connection.rs` (+60 lines)
+- `crates/executor/src/dead_letter_handler.rs` (+263 lines, new file)
+- `crates/executor/src/service.rs` (+29 lines)
+- `crates/executor/src/lib.rs` (+2 lines)
+- `crates/executor/src/main.rs` (+1 line)
+
+### Configuration
+- `config.docker.yaml` (+6 lines)
+- `config.development.yaml` (+6 lines)
+
+### Documentation
+- `docs/architecture/worker-queue-ttl-dlq.md` (+493 lines, new file)
+- `docs/QUICKREF-worker-queue-ttl-dlq.md` (+322 lines, new file)
+
+### Total Changes
+- **New Files:** 3
+- **Modified Files:** 8
+- **Lines Added:** ~1,207
+- **Lines Removed:** ~10
+
+## Deployment Notes
+
+1. **No Breaking Changes:** Fully backward compatible with existing deployments
+2. **Automatic Setup:** Queue infrastructure created on service startup
+3. **Default Enabled:** DLQ processing enabled by default in all environments
+4. **Idempotent:** Safe to restart services, infrastructure recreates correctly
+
+## Next Steps (Phase 3)
+
+1. **Active Health Probes:** Proactively check worker health
+2. **Intelligent Retry Logic:** Retry transient failures before failing
+3. **Per-Action TTL:** Custom timeouts based on action type
+4. **Worker Load Balancing:** Distribute work across healthy workers
+5. **DLQ Analytics:** Aggregate statistics on failure patterns
+
+## References
+
+- Phase 1 Documentation: `docs/architecture/worker-availability-handling.md`
+- Work Summary: `work-summary/2026-02-09-worker-availability-phase1.md`
+- RabbitMQ DLX: https://www.rabbitmq.com/dlx.html
+- RabbitMQ TTL: https://www.rabbitmq.com/ttl.html
+
+## Conclusion
+
+Phase 2 successfully implements message-level TTL and dead letter queue processing, providing automatic and precise failure detection for unavailable workers. The system now has two complementary mechanisms (Phase 1 timeout monitor + Phase 2 DLQ) working together for robust worker availability handling. The implementation is production-ready, well-documented, and provides a solid foundation for Phase 3 enhancements.