more internal polish, resilient workers

2026-02-09 18:32:34 -06:00
parent 588b319fec
commit e31ecb781b
62 changed files with 9872 additions and 584 deletions
--- a/docs/architecture/executor-service.md
+++ b/docs/architecture/executor-service.md
@@ -87,32 +87,47 @@ Execution Requested → Scheduler → Worker Selection → Execution Scheduled

 ### 3. Execution Manager

-**Purpose**: Manages execution lifecycle and status transitions.
+**Purpose**: Orchestrates execution workflows and handles lifecycle events.

 **Responsibilities**:
 - Listens for `execution.status.*` messages from workers
- Updates execution records with status changes
- Handles execution completion (success, failure, cancellation)
- Orchestrates workflow executions (parent-child relationships)
- Publishes completion notifications for downstream consumers
+- **Does NOT update execution state** (worker owns state after scheduling)
+- Handles execution completion orchestration (triggering child executions)
+- Manages workflow executions (parent-child relationships)
+- Coordinates workflow state transitions
+
+**Ownership Model**:
+- **Executor owns**: Requested → Scheduling → Scheduled (updates DB)
+  - Includes pre-handoff cancellations/failures (before `execution.scheduled` is published)
+- **Worker owns**: Running → Completed/Failed/Cancelled (updates DB)
+  - Includes post-handoff cancellations/failures (after receiving `execution.scheduled`)
+- **Handoff Point**: When `execution.scheduled` message is **published** to worker
+  - Before publish: Executor owns and updates state
+  - After publish: Worker owns and updates state

 **Message Flow**:
 ```
-Worker Status Update → Execution Manager → Database Update → Completion Handler
+Worker Status Update → Execution Manager → Orchestration Logic (Read-Only)
+                                         → Trigger Child Executions
 ```

 **Status Lifecycle**:
 ```
-Requested → Scheduling → Scheduled → Running → Completed/Failed/Cancelled
-                                        │
-                                        └→ Child Executions (workflows)
+Requested → Scheduling → Scheduled → [HANDOFF: execution.scheduled published] → Running → Completed/Failed/Cancelled
+    │                       │                                                     │
+    └─ Executor Updates ───┘                                                     └─ Worker Updates
+    │  (includes pre-handoff                                                     │  (includes post-handoff
+    │   Cancelled)                                                               │   Cancelled/Timeout/Abandoned)
+                                                                                  │
+                                                                                  └→ Child Executions (workflows)
 ```

 **Key Implementation Details**:
- Parses status strings to typed enums for type safety
+- Receives status change notifications for orchestration purposes only
+- Does not update execution state after handoff to worker
 - Handles workflow orchestration (parent-child execution chaining)
 - Only triggers child executions on successful parent completion
- Publishes completion events for notification service
+- Read-only access to execution records for orchestration logic

 ## Message Queue Integration

@@ -123,12 +138,14 @@ The Executor consumes and produces several message types:
 **Consumed**:
 - `enforcement.created` - New enforcement from triggered rules
 - `execution.requested` - Execution scheduling requests
- `execution.status.*` - Status updates from workers
+- `execution.status.changed` - Status change notifications from workers (for orchestration)
+- `execution.completed` - Completion notifications from workers (for queue management)

 **Published**:
 - `execution.requested` - To scheduler (from enforcement processor)
- `execution.scheduled` - To workers (from scheduler)
- `execution.completed` - To notifier (from execution manager)
+- `execution.scheduled` - To workers (from scheduler) **← OWNERSHIP HANDOFF**
+
+**Note**: The executor does NOT publish `execution.completed` messages. This is the worker's responsibility as the authoritative source of execution state after scheduling.

 ### Message Envelope Structure

@@ -186,11 +203,34 @@ use attune_common::repositories::{
 };
 ```

+### Database Update Ownership
+
+**Executor updates execution state** from creation through handoff:
+- Creates execution records (`Requested` status)
+- Updates status during scheduling (`Scheduling` → `Scheduled`)
+- Publishes `execution.scheduled` message to worker **← HANDOFF POINT**
+- **Handles cancellations/failures BEFORE handoff** (before message is published)
+  - Example: User cancels execution while queued by concurrency policy
+  - Executor updates to `Cancelled`, worker never receives message
+
+**Worker updates execution state** after receiving handoff:
+- Receives `execution.scheduled` message (takes ownership)
+- Updates status when execution starts (`Running`)
+- Updates status when execution completes (`Completed`, `Failed`, etc.)
+- **Handles cancellations/failures AFTER handoff** (after receiving message)
+- Updates result data and artifacts
+- Worker only owns executions it has received
+
+**Executor reads execution state** for orchestration after handoff:
+- Receives status change notifications from workers
+- Reads execution records to trigger workflow children
+- Does NOT update execution state after publishing `execution.scheduled`
+
 ### Transaction Support

 Future implementations will use database transactions for multi-step operations:
 - Creating execution + publishing message (atomic)
- Status update + completion handling (atomic)
+- Enforcement processing + execution creation (atomic)

 ## Configuration

--- a/docs/architecture/worker-availability-handling.md
+++ b/docs/architecture/worker-availability-handling.md
@@ -0,0 +1,557 @@
+# Worker Availability Handling
+
+**Status**: Implementation Gap Identified  
+**Priority**: High  
+**Date**: 2026-02-09
+
+## Problem Statement
+
+When workers are stopped or become unavailable, the executor continues attempting to schedule executions to them, resulting in:
+
+1. **Stuck executions**: Executions remain in `SCHEDULING` or `SCHEDULED` status indefinitely
+2. **Queue buildup**: Messages accumulate in worker-specific RabbitMQ queues
+3. **No failure notification**: Users don't know their executions are stuck
+4. **Resource waste**: System resources consumed by queued messages and database records
+
+## Current Architecture
+
+### Heartbeat Mechanism
+
+Workers send heartbeat updates to the database periodically (default: 30 seconds).
+
+```rust
+// From crates/executor/src/scheduler.rs
+const DEFAULT_HEARTBEAT_INTERVAL: u64 = 30;
+const HEARTBEAT_STALENESS_MULTIPLIER: u64 = 3;
+
+fn is_worker_heartbeat_fresh(worker: &Worker) -> bool {
+    // Worker is fresh if heartbeat < 90 seconds old
+    let max_age = Duration::from_secs(
+        DEFAULT_HEARTBEAT_INTERVAL * HEARTBEAT_STALENESS_MULTIPLIER
+    );
+    // ...
+}
+```
+
+### Scheduling Flow
+
+```
+Execution Created (REQUESTED)
+    ↓
+Scheduler receives message
+    ↓
+Find compatible worker with fresh heartbeat
+    ↓
+Update execution to SCHEDULED
+    ↓
+Publish message to worker-specific queue
+    ↓
+Worker consumes and executes
+```
+
+### Failure Points
+
+1. **Worker stops after heartbeat**: Worker has fresh heartbeat but is actually down
+2. **Worker crashes**: No graceful shutdown, heartbeat appears fresh temporarily
+3. **Network partition**: Worker isolated but appears healthy
+4. **Queue accumulation**: Messages sit in worker-specific queues indefinitely
+
+## Current Mitigations (Insufficient)
+
+### 1. Heartbeat Staleness Check
+
+```rust
+fn select_worker(pool: &PgPool, action: &Action) -> Result<Worker> {
+    // Filter by active workers
+    let active_workers: Vec<_> = workers
+        .into_iter()
+        .filter(|w| w.status == WorkerStatus::Active)
+        .collect();
+
+    // Filter by heartbeat freshness
+    let fresh_workers: Vec<_> = active_workers
+        .into_iter()
+        .filter(|w| is_worker_heartbeat_fresh(w))
+        .collect();
+
+    if fresh_workers.is_empty() {
+        return Err(anyhow!("No workers with fresh heartbeats"));
+    }
+
+    // Select first available worker
+    Ok(fresh_workers.into_iter().next().unwrap())
+}
+```
+
+**Gap**: Workers can stop within the 90-second staleness window.
+
+### 2. Message Requeue on Error
+
+```rust
+// From crates/common/src/mq/consumer.rs
+match handler(envelope.clone()).await {
+    Err(e) => {
+        let requeue = e.is_retriable();
+        channel.basic_nack(delivery_tag, BasicNackOptions {
+            requeue,
+            multiple: false,
+        }).await?;
+    }
+}
+```
+
+**Gap**: Only requeues on retriable errors (connection/timeout), not worker unavailability.
+
+### 3. Message TTL Configuration
+
+```rust
+// From crates/common/src/config.rs
+pub struct MessageQueueConfig {
+    #[serde(default = "default_message_ttl")]
+    pub message_ttl: u64,
+}
+
+fn default_message_ttl() -> u64 {
+    3600 // 1 hour
+}
+```
+
+**Gap**: TTL not currently applied to worker queues, and 1 hour is too long.
+
+## Proposed Solutions
+
+### Solution 1: Execution Timeout Mechanism (HIGH PRIORITY)
+
+Add a background task that monitors scheduled executions and fails them if they don't start within a timeout.
+
+**Implementation:**
+
+```rust
+// crates/executor/src/execution_timeout_monitor.rs
+
+pub struct ExecutionTimeoutMonitor {
+    pool: PgPool,
+    publisher: Arc<Publisher>,
+    check_interval: Duration,
+    scheduled_timeout: Duration,
+}
+
+impl ExecutionTimeoutMonitor {
+    pub async fn start(&self) -> Result<()> {
+        let mut interval = tokio::time::interval(self.check_interval);
+
+        loop {
+            interval.tick().await;
+
+            if let Err(e) = self.check_stale_executions().await {
+                error!("Error checking stale executions: {}", e);
+            }
+        }
+    }
+
+    async fn check_stale_executions(&self) -> Result<()> {
+        let cutoff = Utc::now() - chrono::Duration::from_std(self.scheduled_timeout)?;
+
+        // Find executions stuck in SCHEDULED status
+        let stale_executions = sqlx::query_as::<_, Execution>(
+            "SELECT * FROM execution 
+             WHERE status = 'scheduled' 
+             AND updated < $1"
+        )
+        .bind(cutoff)
+        .fetch_all(&self.pool)
+        .await?;
+
+        for execution in stale_executions {
+            warn!(
+                "Execution {} has been scheduled for too long, marking as failed",
+                execution.id
+            );
+
+            self.fail_execution(
+                execution.id,
+                "Execution timeout: worker did not pick up task within timeout"
+            ).await?;
+        }
+
+        Ok(())
+    }
+
+    async fn fail_execution(&self, execution_id: i64, reason: &str) -> Result<()> {
+        // Update execution status
+        sqlx::query(
+            "UPDATE execution 
+             SET status = 'failed', 
+                 result = $2,
+                 updated = NOW() 
+             WHERE id = $1"
+        )
+        .bind(execution_id)
+        .bind(serde_json::json!({
+            "error": reason,
+            "failed_by": "execution_timeout_monitor"
+        }))
+        .execute(&self.pool)
+        .await?;
+
+        // Publish completion notification
+        let payload = ExecutionCompletedPayload {
+            execution_id,
+            status: ExecutionStatus::Failed,
+            result: Some(serde_json::json!({"error": reason})),
+        };
+
+        self.publisher
+            .publish_envelope(
+                MessageType::ExecutionCompleted,
+                payload,
+                "attune.executions",
+            )
+            .await?;
+
+        Ok(())
+    }
+}
+```
+
+**Configuration:**
+
+```yaml
+# config.yaml
+executor:
+  scheduled_timeout: 300  # 5 minutes (fail if not running within 5 min)
+  timeout_check_interval: 60  # Check every minute
+```
+
+### Solution 2: Worker Queue TTL and DLQ (MEDIUM PRIORITY)
+
+Apply message TTL to worker-specific queues with dead letter exchange.
+
+**Implementation:**
+
+```rust
+// When declaring worker-specific queues
+let mut queue_args = FieldTable::default();
+
+// Set message TTL (5 minutes)
+queue_args.insert(
+    "x-message-ttl".into(),
+    AMQPValue::LongInt(300_000) // 5 minutes in milliseconds
+);
+
+// Set dead letter exchange
+queue_args.insert(
+    "x-dead-letter-exchange".into(),
+    AMQPValue::LongString("attune.executions.dlx".into())
+);
+
+channel.queue_declare(
+    &format!("attune.execution.worker.{}", worker_id),
+    QueueDeclareOptions {
+        durable: true,
+        ..Default::default()
+    },
+    queue_args,
+).await?;
+```
+
+**Dead Letter Handler:**
+
+```rust
+// crates/executor/src/dead_letter_handler.rs
+
+pub struct DeadLetterHandler {
+    pool: PgPool,
+    consumer: Arc<Consumer>,
+}
+
+impl DeadLetterHandler {
+    pub async fn start(&self) -> Result<()> {
+        self.consumer
+            .consume_with_handler(|envelope: MessageEnvelope<ExecutionScheduledPayload>| {
+                let pool = self.pool.clone();
+                
+                async move {
+                    warn!("Received dead letter for execution {}", envelope.payload.execution_id);
+                    
+                    // Mark execution as failed
+                    sqlx::query(
+                        "UPDATE execution 
+                         SET status = 'failed', 
+                             result = $2,
+                             updated = NOW() 
+                         WHERE id = $1 AND status = 'scheduled'"
+                    )
+                    .bind(envelope.payload.execution_id)
+                    .bind(serde_json::json!({
+                        "error": "Message expired in worker queue (worker unavailable)",
+                        "failed_by": "dead_letter_handler"
+                    }))
+                    .execute(&pool)
+                    .await?;
+                    
+                    Ok(())
+                }
+            })
+            .await
+    }
+}
+```
+
+### Solution 3: Worker Health Probes (LOW PRIORITY)
+
+Add active health checking instead of relying solely on heartbeats.
+
+**Implementation:**
+
+```rust
+// crates/executor/src/worker_health_checker.rs
+
+pub struct WorkerHealthChecker {
+    pool: PgPool,
+    check_interval: Duration,
+}
+
+impl WorkerHealthChecker {
+    pub async fn start(&self) -> Result<()> {
+        let mut interval = tokio::time::interval(self.check_interval);
+
+        loop {
+            interval.tick().await;
+
+            if let Err(e) = self.check_worker_health().await {
+                error!("Error checking worker health: {}", e);
+            }
+        }
+    }
+
+    async fn check_worker_health(&self) -> Result<()> {
+        let workers = WorkerRepository::find_action_workers(&self.pool).await?;
+
+        for worker in workers {
+            // Skip if heartbeat is very stale (worker is definitely down)
+            if !is_heartbeat_recent(&worker) {
+                continue;
+            }
+
+            // Attempt health check
+            match self.ping_worker(&worker).await {
+                Ok(true) => {
+                    // Worker is healthy, ensure status is Active
+                    if worker.status != Some(WorkerStatus::Active) {
+                        self.update_worker_status(worker.id, WorkerStatus::Active).await?;
+                    }
+                }
+                Ok(false) | Err(_) => {
+                    // Worker is unhealthy, mark as inactive
+                    warn!("Worker {} failed health check", worker.name);
+                    self.update_worker_status(worker.id, WorkerStatus::Inactive).await?;
+                }
+            }
+        }
+
+        Ok(())
+    }
+
+    async fn ping_worker(&self, worker: &Worker) -> Result<bool> {
+        // TODO: Implement health endpoint on worker
+        // For now, check if worker's queue is being consumed
+        Ok(true)
+    }
+}
+```
+
+### Solution 4: Graceful Worker Shutdown (MEDIUM PRIORITY)
+
+Ensure workers mark themselves as inactive before shutdown.
+
+**Implementation:**
+
+```rust
+// In worker service shutdown handler
+impl WorkerService {
+    pub async fn shutdown(&self) -> Result<()> {
+        info!("Worker shutting down gracefully...");
+
+        // Mark worker as inactive
+        sqlx::query(
+            "UPDATE worker SET status = 'inactive', updated = NOW() WHERE id = $1"
+        )
+        .bind(self.worker_id)
+        .execute(&self.pool)
+        .await?;
+
+        // Stop accepting new tasks
+        self.stop_consuming().await?;
+
+        // Wait for in-flight tasks to complete (with timeout)
+        let timeout = Duration::from_secs(30);
+        tokio::time::timeout(timeout, self.wait_for_completion()).await?;
+
+        info!("Worker shutdown complete");
+        Ok(())
+    }
+}
+```
+
+**Docker Signal Handling:**
+
+```yaml
+# docker-compose.yaml
+services:
+  worker-shell:
+    stop_grace_period: 45s  # Give worker time to finish tasks
+```
+
+## Implementation Priority
+
+### Phase 1: Immediate (Week 1)
+1. **Execution Timeout Monitor** - Prevents stuck executions
+2. **Graceful Shutdown** - Marks workers inactive on stop
+
+### Phase 2: Short-term (Week 2)
+3. **Worker Queue TTL + DLQ** - Prevents message buildup
+4. **Dead Letter Handler** - Fails expired executions
+
+### Phase 3: Long-term (Month 1)
+5. **Worker Health Probes** - Active availability verification
+6. **Retry Logic** - Reschedule to different worker on failure
+
+## Configuration
+
+### Recommended Timeouts
+
+```yaml
+executor:
+  # How long an execution can stay SCHEDULED before failing
+  scheduled_timeout: 300  # 5 minutes
+
+  # How often to check for stale executions
+  timeout_check_interval: 60  # 1 minute
+
+  # Message TTL in worker queues
+  worker_queue_ttl: 300  # 5 minutes (match scheduled_timeout)
+
+  # Worker health check interval
+  health_check_interval: 30  # 30 seconds
+
+worker:
+  # How often to send heartbeats
+  heartbeat_interval: 10  # 10 seconds (more frequent)
+
+  # Grace period for shutdown
+  shutdown_timeout: 30  # 30 seconds
+```
+
+### Staleness Calculation
+
+```
+Heartbeat Staleness Threshold = heartbeat_interval * 3
+                               = 10 * 3 = 30 seconds
+
+This means:
+- Worker sends heartbeat every 10s
+- If heartbeat is > 30s old, worker is considered stale
+- Reduces window where stopped worker appears healthy from 90s to 30s
+```
+
+## Monitoring and Observability
+
+### Metrics to Track
+
+1. **Execution timeout rate**: Number of executions failed due to timeout
+2. **Worker downtime**: Time between last heartbeat and status change
+3. **Dead letter queue depth**: Number of expired messages
+4. **Average scheduling latency**: Time from REQUESTED to RUNNING
+
+### Alerts
+
+```yaml
+alerts:
+  - name: high_execution_timeout_rate
+    condition: execution_timeouts > 10 per minute
+    severity: warning
+
+  - name: no_active_workers
+    condition: active_workers == 0
+    severity: critical
+
+  - name: dlq_buildup
+    condition: dlq_depth > 100
+    severity: warning
+
+  - name: stale_executions
+    condition: scheduled_executions_older_than_5min > 0
+    severity: warning
+```
+
+## Testing
+
+### Test Scenarios
+
+1. **Worker stops mid-execution**: Should timeout and fail
+2. **Worker never picks up task**: Should timeout after 5 minutes
+3. **All workers down**: Should immediately fail with "no workers available"
+4. **Worker stops gracefully**: Should mark inactive and not receive new tasks
+5. **Message expires in queue**: Should be moved to DLQ and execution failed
+
+### Integration Test Example
+
+```rust
+#[tokio::test]
+async fn test_execution_timeout_on_worker_down() {
+    let pool = setup_test_db().await;
+    let mq = setup_test_mq().await;
+
+    // Create worker and execution
+    let worker = create_test_worker(&pool).await;
+    let execution = create_test_execution(&pool).await;
+
+    // Schedule execution to worker
+    schedule_execution(&pool, &mq, execution.id, worker.id).await;
+
+    // Stop worker (simulate crash - no graceful shutdown)
+    stop_worker(worker.id).await;
+
+    // Wait for timeout
+    tokio::time::sleep(Duration::from_secs(310)).await;
+
+    // Verify execution is marked as failed
+    let execution = get_execution(&pool, execution.id).await;
+    assert_eq!(execution.status, ExecutionStatus::Failed);
+    assert!(execution.result.unwrap()["error"]
+        .as_str()
+        .unwrap()
+        .contains("timeout"));
+}
+```
+
+## Migration Path
+
+### Step 1: Add Monitoring (No Breaking Changes)
+- Deploy execution timeout monitor
+- Monitor logs for timeout events
+- Tune timeout values based on actual workload
+
+### Step 2: Add DLQ (Requires Queue Reconfiguration)
+- Create dead letter exchange
+- Update queue declarations with TTL and DLX
+- Deploy dead letter handler
+- Monitor DLQ depth
+
+### Step 3: Graceful Shutdown (Worker Update)
+- Add shutdown handler to worker
+- Update Docker Compose stop_grace_period
+- Test worker restarts
+
+### Step 4: Health Probes (Future Enhancement)
+- Add health endpoint to worker
+- Deploy health checker service
+- Transition from heartbeat-only to active probing
+
+## Related Documentation
+
+- [Queue Architecture](./queue-architecture.md)
+- [Worker Service](./worker-service.md)
+- [Executor Service](./executor-service.md)
+- [RabbitMQ Queues Quick Reference](../docs/QUICKREF-rabbitmq-queues.md)
--- a/docs/architecture/worker-queue-ttl-dlq.md
+++ b/docs/architecture/worker-queue-ttl-dlq.md
@@ -0,0 +1,493 @@
+# Worker Queue TTL and Dead Letter Queue (Phase 2)
+
+## Overview
+
+Phase 2 of worker availability handling implements message TTL (time-to-live) on worker-specific queues and dead letter queue (DLQ) processing. This ensures that executions sent to unavailable workers are automatically failed instead of remaining stuck indefinitely.
+
+## Architecture
+
+### Message Flow
+
+```
+┌─────────────┐
+│  Executor   │
+│  Scheduler  │
+└──────┬──────┘
+       │ Publishes ExecutionRequested
+       │ routing_key: execution.dispatch.worker.{id}
+       │
+       ▼
+┌──────────────────────────────────┐
+│  worker.{id}.executions queue    │
+│                                  │
+│  Properties:                     │
+│  - x-message-ttl: 300000ms (5m)  │
+│  - x-dead-letter-exchange: dlx   │
+└──────┬───────────────────┬───────┘
+       │                   │
+       │ Worker consumes   │ TTL expires
+       │ (normal flow)     │ (worker unavailable)
+       │                   │
+       ▼                   ▼
+┌──────────────┐    ┌──────────────────┐
+│   Worker     │    │  attune.dlx      │
+│   Service    │    │  (Dead Letter    │
+│              │    │   Exchange)      │
+└──────────────┘    └────────┬─────────┘
+                             │
+                             │ Routes to DLQ
+                             │
+                             ▼
+                    ┌──────────────────────┐
+                    │  attune.dlx.queue    │
+                    │  (Dead Letter Queue) │
+                    └────────┬─────────────┘
+                             │
+                             │ Consumes
+                             │
+                             ▼
+                    ┌──────────────────────┐
+                    │  Dead Letter Handler │
+                    │  (in Executor)       │
+                    │                      │
+                    │  - Identifies exec   │
+                    │  - Marks as FAILED   │
+                    │  - Logs failure      │
+                    └──────────────────────┘
+```
+
+### Components
+
+#### 1. Worker Queue TTL
+
+**Configuration:**
+- Default: 5 minutes (300,000 milliseconds)
+- Configurable via `rabbitmq.worker_queue_ttl_ms`
+
+**Implementation:**
+- Applied during queue declaration in `Connection::setup_worker_infrastructure()`
+- Uses RabbitMQ's `x-message-ttl` queue argument
+- Only applies to worker-specific queues (`worker.{id}.executions`)
+
+**Behavior:**
+- When a message remains in the queue longer than TTL
+- RabbitMQ automatically moves it to the configured dead letter exchange
+- Original message properties and headers are preserved
+- Includes `x-death` header with expiration details
+
+#### 2. Dead Letter Exchange (DLX)
+
+**Configuration:**
+- Exchange name: `attune.dlx`
+- Type: `direct`
+- Durable: `true`
+
+**Setup:**
+- Created in `Connection::setup_common_infrastructure()`
+- Bound to dead letter queue with routing key `#` (all messages)
+- Shared across all services
+
+#### 3. Dead Letter Queue
+
+**Configuration:**
+- Queue name: `attune.dlx.queue`
+- Durable: `true`
+- TTL: 24 hours (configurable via `rabbitmq.dead_letter.ttl_ms`)
+
+**Properties:**
+- Retains messages for debugging and analysis
+- Messages auto-expire after retention period
+- No DLX on the DLQ itself (prevents infinite loops)
+
+#### 4. Dead Letter Handler
+
+**Location:** `crates/executor/src/dead_letter_handler.rs`
+
+**Responsibilities:**
+1. Consume messages from `attune.dlx.queue`
+2. Deserialize message envelope
+3. Extract execution ID from payload
+4. Verify execution is in non-terminal state
+5. Update execution to FAILED status
+6. Add descriptive error information
+7. Acknowledge message (remove from DLQ)
+
+**Error Handling:**
+- Invalid messages: Acknowledged and discarded
+- Missing executions: Acknowledged (already processed)
+- Terminal state executions: Acknowledged (no action needed)
+- Database errors: Nacked with requeue (retry later)
+
+## Configuration
+
+### RabbitMQ Configuration Structure
+
+```yaml
+message_queue:
+  rabbitmq:
+    # Worker queue TTL - how long messages wait before DLX
+    worker_queue_ttl_ms: 300000  # 5 minutes (default)
+    
+    # Dead letter configuration
+    dead_letter:
+      enabled: true                # Enable DLQ system
+      exchange: attune.dlx         # DLX name
+      ttl_ms: 86400000            # DLQ retention (24 hours)
+```
+
+### Environment-Specific Settings
+
+#### Development (`config.development.yaml`)
+```yaml
+message_queue:
+  rabbitmq:
+    worker_queue_ttl_ms: 300000  # 5 minutes
+    dead_letter:
+      enabled: true
+      exchange: attune.dlx
+      ttl_ms: 86400000  # 24 hours
+```
+
+#### Production (`config.docker.yaml`)
+```yaml
+message_queue:
+  rabbitmq:
+    worker_queue_ttl_ms: 300000  # 5 minutes
+    dead_letter:
+      enabled: true
+      exchange: attune.dlx
+      ttl_ms: 86400000  # 24 hours
+```
+
+### Tuning Guidelines
+
+**Worker Queue TTL (`worker_queue_ttl_ms`):**
+- **Too short:** Legitimate slow workers may have executions failed prematurely
+- **Too long:** Unavailable workers cause delayed failure detection
+- **Recommendation:** 2-5x typical execution time, minimum 2 minutes
+- **Default (5 min):** Good balance for most workloads
+
+**DLQ Retention (`dead_letter.ttl_ms`):**
+- Purpose: Debugging and forensics
+- **Too short:** May lose data before analysis
+- **Too long:** Accumulates stale data
+- **Recommendation:** 24-48 hours in production
+- **Default (24 hours):** Adequate for most troubleshooting
+
+## Code Structure
+
+### Queue Declaration with TTL
+
+```rust
+// crates/common/src/mq/connection.rs
+
+pub async fn declare_queue_with_dlx_and_ttl(
+    &self,
+    config: &QueueConfig,
+    dlx_exchange: &str,
+    ttl_ms: Option<u64>,
+) -> MqResult<()> {
+    let mut args = FieldTable::default();
+    
+    // Configure DLX
+    args.insert(
+        "x-dead-letter-exchange".into(),
+        AMQPValue::LongString(dlx_exchange.into()),
+    );
+    
+    // Configure TTL if specified
+    if let Some(ttl) = ttl_ms {
+        args.insert(
+            "x-message-ttl".into(),
+            AMQPValue::LongInt(ttl as i64),
+        );
+    }
+    
+    // Declare queue with arguments
+    channel.queue_declare(&config.name, options, args).await?;
+    Ok(())
+}
+```
+
+### Dead Letter Handler
+
+```rust
+// crates/executor/src/dead_letter_handler.rs
+
+pub struct DeadLetterHandler {
+    pool: Arc<PgPool>,
+    consumer: Consumer,
+    running: Arc<Mutex<bool>>,
+}
+
+impl DeadLetterHandler {
+    pub async fn start(&self) -> Result<(), Error> {
+        self.consumer.consume_with_handler(|envelope| {
+            match envelope.message_type {
+                MessageType::ExecutionRequested => {
+                    handle_execution_requested(&pool, &envelope).await
+                }
+                _ => {
+                    // Unexpected message type - acknowledge and discard
+                    Ok(())
+                }
+            }
+        }).await
+    }
+}
+
+async fn handle_execution_requested(
+    pool: &PgPool,
+    envelope: &MessageEnvelope<Value>,
+) -> MqResult<()> {
+    // Extract execution ID
+    let execution_id = envelope.payload.get("execution_id")
+        .and_then(|v| v.as_i64())
+        .ok_or_else(|| /* error */)?;
+    
+    // Fetch current state
+    let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
+    
+    // Only fail if in non-terminal state
+    if !execution.status.is_terminal() {
+        ExecutionRepository::update(pool, execution_id, UpdateExecutionInput {
+            status: Some(ExecutionStatus::Failed),
+            result: Some(json!({
+                "error": "Worker queue TTL expired",
+                "message": "Worker did not process execution within configured TTL",
+            })),
+            ended: Some(Some(Utc::now())),
+            ..Default::default()
+        }).await?;
+    }
+    
+    Ok(())
+}
+```
+
+## Integration with Executor Service
+
+The dead letter handler is started automatically by the executor service if DLQ is enabled:
+
+```rust
+// crates/executor/src/service.rs
+
+pub async fn start(&self) -> Result<()> {
+    // ... other components ...
+    
+    // Start dead letter handler (if enabled)
+    if self.inner.mq_config.rabbitmq.dead_letter.enabled {
+        let dlq_name = format!("{}.queue", 
+            self.inner.mq_config.rabbitmq.dead_letter.exchange);
+        let dlq_consumer = Consumer::new(
+            &self.inner.mq_connection,
+            create_dlq_consumer_config(&dlq_name, "executor.dlq"),
+        ).await?;
+        
+        let dlq_handler = Arc::new(
+            DeadLetterHandler::new(self.inner.pool.clone(), dlq_consumer).await?
+        );
+        
+        handles.push(tokio::spawn(async move {
+            dlq_handler.start().await
+        }));
+    }
+    
+    // ... wait for completion ...
+}
+```
+
+## Operational Considerations
+
+### Monitoring
+
+**Key Metrics:**
+- DLQ message rate (messages/sec entering DLQ)
+- DLQ queue depth (current messages in DLQ)
+- DLQ processing latency (time from DLX to handler)
+- Failed execution count (executions failed via DLQ)
+
+**Alerting Thresholds:**
+- DLQ rate > 10/min: Workers may be unhealthy or TTL too aggressive
+- DLQ depth > 100: Handler may be falling behind
+- High failure rate: Systematic worker availability issues
+
+### RabbitMQ Management
+
+**View DLQ:**
+```bash
+# List messages in DLQ
+rabbitmqadmin list queues name messages
+
+# Get DLQ details
+rabbitmqadmin show queue name=attune.dlx.queue
+
+# Purge DLQ (use with caution)
+rabbitmqadmin purge queue name=attune.dlx.queue
+```
+
+**View Dead Letters:**
+```bash
+# Get message from DLQ
+rabbitmqadmin get queue=attune.dlx.queue count=1
+
+# Check message death history
+# Look for x-death header in message properties
+```
+
+### Troubleshooting
+
+#### High DLQ Rate
+
+**Symptoms:** Many executions failing via DLQ
+
+**Causes:**
+1. Workers down or restarting frequently
+2. Worker queue TTL too aggressive
+3. Worker overloaded (not consuming fast enough)
+4. Network issues between executor and workers
+
+**Resolution:**
+1. Check worker health and logs
+2. Verify worker heartbeats in database
+3. Consider increasing `worker_queue_ttl_ms`
+4. Scale worker fleet if overloaded
+
+#### DLQ Handler Not Processing
+
+**Symptoms:** DLQ depth increasing, executions stuck
+
+**Causes:**
+1. Executor service not running
+2. DLQ disabled in configuration
+3. Database connection issues
+4. Handler crashed or deadlocked
+
+**Resolution:**
+1. Check executor service logs
+2. Verify `dead_letter.enabled = true`
+3. Check database connectivity
+4. Restart executor service if needed
+
+#### Messages Not Reaching DLQ
+
+**Symptoms:** Executions stuck, DLQ empty
+
+**Causes:**
+1. Worker queues not configured with DLX
+2. DLX exchange not created
+3. DLQ not bound to DLX
+4. TTL not configured on worker queues
+
+**Resolution:**
+1. Restart services to recreate infrastructure
+2. Verify RabbitMQ configuration
+3. Check queue properties in RabbitMQ management UI
+
+## Testing
+
+### Unit Tests
+
+```rust
+#[tokio::test]
+async fn test_expired_execution_handling() {
+    let pool = setup_test_db().await;
+    
+    // Create execution in SCHEDULED state
+    let execution = create_test_execution(&pool, ExecutionStatus::Scheduled).await;
+    
+    // Simulate DLQ message
+    let envelope = MessageEnvelope::new(
+        MessageType::ExecutionRequested,
+        json!({ "execution_id": execution.id }),
+    );
+    
+    // Process message
+    handle_execution_requested(&pool, &envelope).await.unwrap();
+    
+    // Verify execution failed
+    let updated = ExecutionRepository::find_by_id(&pool, execution.id).await.unwrap();
+    assert_eq!(updated.status, ExecutionStatus::Failed);
+    assert!(updated.result.unwrap()["error"].as_str().unwrap().contains("TTL expired"));
+}
+```
+
+### Integration Tests
+
+```bash
+# 1. Start all services
+docker compose up -d
+
+# 2. Create execution targeting stopped worker
+curl -X POST http://localhost:8080/api/v1/executions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "action_ref": "core.echo",
+    "parameters": {"message": "test"},
+    "worker_id": 999  # Non-existent worker
+  }'
+
+# 3. Wait for TTL expiration (5+ minutes)
+sleep 330
+
+# 4. Verify execution failed
+curl http://localhost:8080/api/v1/executions/{id}
+# Should show status: "failed", error: "Worker queue TTL expired"
+
+# 5. Check DLQ processed the message
+rabbitmqadmin list queues name messages | grep attune.dlx.queue
+# Should show 0 messages (processed and removed)
+```
+
+## Relationship to Other Phases
+
+### Phase 1 (Completed)
+- Execution timeout monitor: Handles executions stuck in SCHEDULED
+- Graceful shutdown: Prevents new tasks to stopping workers
+- Reduced heartbeat: Faster stale worker detection
+
+**Interaction:** Phase 1 timeout monitor acts as a backstop if DLQ processing fails
+
+### Phase 2 (Current)
+- Worker queue TTL: Automatic message expiration
+- Dead letter queue: Capture expired messages
+- Dead letter handler: Process and fail expired executions
+
+**Benefit:** More precise failure detection at the message queue level
+
+### Phase 3 (Planned)
+- Health probes: Proactive worker health checking
+- Intelligent retry: Retry transient failures
+- Load balancing: Distribute work across healthy workers
+
+**Integration:** Phase 3 will use Phase 2 DLQ data to inform routing decisions
+
+## Benefits
+
+1. **Automatic Failure Detection:** No manual intervention needed for unavailable workers
+2. **Precise Timing:** TTL provides exact failure window (vs polling-based Phase 1)
+3. **Resource Efficiency:** Prevents message accumulation in worker queues
+4. **Debugging Support:** DLQ retains messages for forensic analysis
+5. **Graceful Degradation:** System continues functioning even with worker failures
+
+## Limitations
+
+1. **TTL Precision:** RabbitMQ TTL is approximate, not guaranteed to the millisecond
+2. **Race Conditions:** Worker may start processing just as TTL expires (rare)
+3. **DLQ Capacity:** Very high failure rates may overwhelm DLQ
+4. **No Retry Logic:** Phase 2 always fails; Phase 3 will add intelligent retry
+
+## Future Enhancements (Phase 3)
+
+- **Conditional Retry:** Retry messages based on failure reason
+- **Priority DLQ:** Prioritize critical execution failures
+- **DLQ Analytics:** Aggregate statistics on failure patterns
+- **Auto-scaling:** Scale workers based on DLQ rate
+- **Custom TTL:** Per-action or per-execution TTL configuration
+
+## References
+
+- RabbitMQ Dead Letter Exchanges: https://www.rabbitmq.com/dlx.html
+- RabbitMQ TTL: https://www.rabbitmq.com/ttl.html
+- Phase 1 Documentation: `docs/architecture/worker-availability-handling.md`
+- Queue Architecture: `docs/architecture/queue-architecture.md`
--- a/docs/architecture/worker-service.md
+++ b/docs/architecture/worker-service.md
@@ -131,28 +131,38 @@ echo "Hello, $PARAM_NAME!"

 ### 4. Action Executor

-**Purpose**: Orchestrate the complete execution flow for an action.
+**Purpose**: Orchestrate the complete execution flow for an action and own execution state after handoff.

 **Execution Flow**:
 ```
-1. Load execution record from database
-2. Update status to Running
-3. Load action definition by reference
-4. Prepare execution context (parameters, env vars, timeout)
-5. Select and execute in appropriate runtime
-6. Capture results (stdout, stderr, return value)
-7. Store artifacts (logs, results)
-8. Update execution status (Succeeded/Failed)
-9. Publish status update messages
+1. Receive execution.scheduled message from executor
+2. Load execution record from database
+3. Update status to Running (owns state after handoff)
+4. Load action definition by reference
+5. Prepare execution context (parameters, env vars, timeout)
+6. Select and execute in appropriate runtime
+7. Capture results (stdout, stderr, return value)
+8. Store artifacts (logs, results)
+9. Update execution status (Completed/Failed) in database
+10. Publish status change notifications
+11. Publish completion notification for queue management
 ```

+**Ownership Model**:
+- **Worker owns execution state** after receiving `execution.scheduled`
+- **Authoritative source** for all status updates: Running, Completed, Failed, Cancelled, etc.
+- **Updates database directly** for all state changes
+- **Publishes notifications** for orchestration and monitoring
+
 **Responsibilities**:
 - Coordinate execution lifecycle
 - Load action and execution data from database
+- **Update execution state in database** (after handoff from executor)
 - Prepare execution context with parameters and environment
 - Execute action via runtime registry
 - Handle success and failure cases
 - Store execution artifacts
+- Publish status change notifications

 **Key Implementation Details**:
 - Parameters merged: action defaults + execution overrides
@@ -246,7 +256,10 @@ See `docs/secrets-management.md` for comprehensive documentation.
 - Register worker in database
 - Start heartbeat manager
 - Consume execution messages from worker-specific queue
- Publish execution status updates
+- **Own execution state** after receiving scheduled executions
+- **Update execution status in database** (Running, Completed, Failed, etc.)
+- Publish execution status change notifications
+- Publish execution completion notifications
 - Handle graceful shutdown

 **Message Flow**:
@@ -407,8 +420,9 @@ pub struct ExecutionResult {
 ### Error Propagation

 - Runtime errors captured in `ExecutionResult.error`
- Execution status updated to Failed in database
- Error published in status update message
+- **Worker updates** execution status to Failed in database (owns state)
+- Error published in status change notification message
+- Error published in completion notification message
 - Artifacts still stored for failed executions
 - Logs preserved for debugging