more internal polish, resilient workers

2026-02-09 18:32:34 -06:00
parent 588b319fec
commit e31ecb781b
62 changed files with 9872 additions and 584 deletions
--- a/docs/ARCHITECTURE-execution-state-ownership.md
+++ b/docs/ARCHITECTURE-execution-state-ownership.md
@@ -0,0 +1,367 @@
+# Execution State Ownership Model
+
+**Date**: 2026-02-09  
+**Status**: Implemented  
+**Related Issues**: Duplicate completion notifications, unnecessary database updates
+
+## Overview
+
+This document defines the **ownership model** for execution state management in Attune. It clarifies which service is responsible for updating execution records at each stage of the lifecycle, eliminating race conditions and redundant database writes.
+
+## The Problem
+
+Prior to this change, both the executor and worker were updating execution state in the database, causing:
+
+1. **Race conditions** - unclear which service's update would happen first
+2. **Redundant writes** - both services writing the same status value
+3. **Architectural confusion** - no clear ownership boundaries
+4. **Warning logs** - duplicate completion notifications
+
+## The Solution: Lifecycle-Based Ownership
+
+Execution state ownership is divided based on **lifecycle stage**, with a clear handoff point:
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                      EXECUTOR OWNERSHIP                         │
+│                                                                 │
+│  Requested → Scheduling → Scheduled                             │
+│                                    │                            │
+│  (includes cancellations/failures  │                            │
+│   before execution.scheduled       │                            │
+│   message is published)            │                            │
+│                                    │                            │
+│                          Handoff Point:                         │
+│                          execution.scheduled message PUBLISHED  │
+│                                    ▼                            │
+└─────────────────────────────────────────────────────────────────┘
+                                    │
+                                    │ Worker receives message
+                                    │
+                                    ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                       WORKER OWNERSHIP                          │
+│                                                                 │
+│  Running → Completed / Failed / Cancelled / Timeout            │
+│                                                                 │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### Executor Responsibilities
+
+The **Executor Service** owns execution state from creation through scheduling:
+
+- ✅ Creates execution records (`Requested`)
+- ✅ Updates status during scheduling (`Scheduling`)
+- ✅ Updates status when scheduled to worker (`Scheduled`)
+- ✅ Publishes `execution.scheduled` message **← HANDOFF POINT**
+- ✅ Handles cancellations/failures BEFORE `execution.scheduled` is published
+- ❌ Does NOT update status after `execution.scheduled` is published
+
+**Lifecycle stages**: `Requested` → `Scheduling` → `Scheduled`
+
+**Important**: If an execution is cancelled or fails before the executor publishes `execution.scheduled`, the executor is responsible for updating the status (e.g., to `Cancelled`). The worker never learns about executions that don't reach the handoff point.
+
+### Worker Responsibilities
+
+The **Worker Service** owns execution state after receiving the handoff:
+
+- ✅ Receives `execution.scheduled` message **← TAKES OWNERSHIP**
+- ✅ Updates status when execution starts (`Running`)
+- ✅ Updates status when execution completes (`Completed`, `Failed`, etc.)
+- ✅ Handles cancellations AFTER receiving `execution.scheduled`
+- ✅ Updates execution result data
+- ✅ Publishes `execution.status_changed` notifications
+- ✅ Publishes `execution.completed` notifications
+- ❌ Does NOT update status for executions it hasn't received
+
+**Lifecycle stages**: `Running` → `Completed` / `Failed` / `Cancelled` / `Timeout`
+
+**Important**: The worker only owns executions it has received via `execution.scheduled`. If a cancellation happens before this message is sent, the worker is never involved.
+
+## Message Flow
+
+### 1. Executor Creates and Schedules
+
+```
+Executor Service
+  ├─> Creates execution (status: Requested)
+  ├─> Updates status: Scheduling
+  ├─> Selects worker
+  ├─> Updates status: Scheduled
+  └─> Publishes: execution.scheduled → worker-specific queue
+```
+
+### 2. Worker Receives and Executes
+
+```
+Worker Service
+  ├─> Receives: execution.scheduled
+  ├─> Updates DB: Scheduled → Running
+  ├─> Publishes: execution.status_changed (running)
+  ├─> Executes action
+  ├─> Updates DB: Running → Completed/Failed
+  ├─> Publishes: execution.status_changed (completed/failed)
+  └─> Publishes: execution.completed
+```
+
+### 3. Executor Handles Orchestration
+
+```
+Executor Service (ExecutionManager)
+  ├─> Receives: execution.status_changed
+  ├─> Does NOT update database
+  ├─> Handles orchestration logic:
+  │   ├─> Triggers workflow children (if parent completed)
+  │   ├─> Updates workflow state
+  │   └─> Manages parent-child relationships
+  └─> Logs event for monitoring
+```
+
+### 4. Queue Management
+
+```
+Executor Service (CompletionListener)
+  ├─> Receives: execution.completed
+  ├─> Releases queue slot
+  ├─> Notifies waiting executions
+  └─> Updates queue statistics
+```
+
+## Database Update Rules
+
+### Executor (Pre-Scheduling)
+
+**File**: `crates/executor/src/scheduler.rs`
+
+```rust
+// ✅ Executor updates DB before scheduling
+execution.status = ExecutionStatus::Scheduled;
+ExecutionRepository::update(pool, execution.id, execution.into()).await?;
+
+// Publish to worker
+Self::queue_to_worker(...).await?;
+```
+
+### Worker (Post-Scheduling)
+
+**File**: `crates/worker/src/executor.rs`
+
+```rust
+// ✅ Worker updates DB when starting
+async fn execute(&self, execution_id: i64) -> Result<ExecutionResult> {
+    // Update status to running
+    self.update_execution_status(execution_id, ExecutionStatus::Running).await?;
+    
+    // Execute action...
+}
+
+// ✅ Worker updates DB when completing
+async fn handle_execution_success(&self, execution_id: i64, result: &ExecutionResult) -> Result<()> {
+    let input = UpdateExecutionInput {
+        status: Some(ExecutionStatus::Completed),
+        result: Some(result_data),
+        // ...
+    };
+    ExecutionRepository::update(&self.pool, execution_id, input).await?;
+}
+```
+
+### Executor (Post-Scheduling)
+
+**File**: `crates/executor/src/execution_manager.rs`
+
+```rust
+// ❌ Executor does NOT update DB after scheduling
+async fn process_status_change(...) -> Result<()> {
+    // Fetch execution (for orchestration logic only)
+    let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
+    
+    // Handle orchestration, but do NOT update DB
+    match status {
+        ExecutionStatus::Completed | ExecutionStatus::Failed | ExecutionStatus::Cancelled => {
+            Self::handle_completion(pool, publisher, &execution).await?;
+        }
+        _ => {}
+    }
+    
+    Ok(())
+}
+```
+
+## Benefits
+
+### 1. Clear Ownership Boundaries
+
+- No ambiguity about who updates what
+- Easy to reason about system behavior
+- Reduced cognitive load for developers
+
+### 2. Eliminated Race Conditions
+
+- Only one service updates each lifecycle stage
+- No competing writes to same fields
+- Predictable state transitions
+
+### 3. Better Performance
+
+- No redundant database writes
+- Reduced database contention
+- Lower network overhead (fewer queries)
+
+### 4. Cleaner Logs
+
+Before:
+```
+executor | Updated execution 9061 status: Scheduled -> Running
+executor | Updated execution 9061 status: Running -> Running
+executor | Updated execution 9061 status: Completed -> Completed
+executor | WARN: Completion notification for action 3 but active_count is 0
+```
+
+After:
+```
+executor | Execution 9061 scheduled to worker 29
+worker   | Starting execution: 9061
+worker   | Execution 9061 completed successfully in 142ms
+executor | Execution 9061 reached terminal state: Completed, handling orchestration
+```
+
+### 5. Idempotent Message Handling
+
+- Executor can safely receive duplicate status change messages
+- Worker updates are authoritative
+- No special logic needed for retries
+
+## Edge Cases & Error Handling
+
+### Cancellation Before Handoff
+
+**Scenario**: Execution is queued due to concurrency policy, user cancels before scheduling.
+
+**Handling**:
+- Execution in `Requested` or `Scheduling` state
+- Executor updates status: → `Cancelled`
+- Worker never receives `execution.scheduled`
+- No worker resources consumed ✅
+
+### Cancellation After Handoff
+
+**Scenario**: Execution already scheduled to worker, user cancels while running.
+
+**Handling**:
+- Worker has received `execution.scheduled` and owns execution
+- Worker updates status: `Running` → `Cancelled`
+- Worker publishes status change notification
+- Executor handles orchestration (e.g., skip workflow children)
+
+### Worker Crashes Before Updating Status
+
+**Scenario**: Worker receives `execution.scheduled` but crashes before updating status to `Running`.
+
+**Handling**:
+- Execution remains in `Scheduled` state
+- Worker owned the execution but failed to update
+- Executor's heartbeat monitoring detects stale scheduled executions
+- After timeout, executor can reschedule to another worker or mark as abandoned
+- Idempotent: If worker already started, duplicate scheduling is rejected
+
+### Message Delivery Delays
+
+**Scenario**: Worker updates DB but `execution.status_changed` message is delayed.
+
+**Handling**:
+- Database reflects correct state (source of truth)
+- Executor eventually receives notification and handles orchestration
+- Orchestration logic is idempotent (safe to call multiple times)
+- Critical: Workflows may have slight delay, but remain consistent
+
+### Partial Failures
+
+**Scenario**: Worker updates DB successfully but fails to publish notification.
+
+**Handling**:
+- Database has correct state (worker succeeded)
+- Executor won't trigger orchestration until notification arrives
+- Future enhancement: Periodic executor polling for stale completions
+- Workaround: Worker retries message publishing with exponential backoff
+
+## Migration Notes
+
+### Changes Required
+
+1. **Executor Service** (`execution_manager.rs`):
+   - ✅ Removed database updates from `process_status_change()`
+   - ✅ Changed to read-only orchestration handler
+   - ✅ Updated logs to reflect observer role
+
+2. **Worker Service** (`service.rs`):
+   - ✅ Already updates DB directly (no changes needed)
+   - ✅ Updated comment: "we'll update the database directly"
+
+3. **Documentation**:
+   - ✅ Updated module docs to reflect ownership model
+   - ✅ Added ownership boundaries to architecture docs
+
+### Backward Compatibility
+
+- ✅ No breaking changes to external APIs
+- ✅ Message formats unchanged
+- ✅ Database schema unchanged
+- ✅ Workflow behavior unchanged
+
+## Testing Strategy
+
+### Unit Tests
+
+- ✅ Executor tests verify no DB updates after scheduling
+- ✅ Worker tests verify DB updates at all lifecycle stages
+- ✅ Message handler tests verify orchestration without DB writes
+
+### Integration Tests
+
+- Test full execution lifecycle end-to-end
+- Verify status transitions in database
+- Confirm orchestration logic (workflow children) still works
+- Test failure scenarios (worker crashes, message delays)
+
+### Monitoring
+
+Monitor for:
+- Executions stuck in `Scheduled` state (worker not picking up)
+- Large delays between status changes (message queue lag)
+- Workflow children not triggering (orchestration failure)
+
+## Future Enhancements
+
+### 1. Executor Polling for Stale Completions
+
+If `execution.status_changed` messages are lost, executor could periodically poll for completed executions that haven't triggered orchestration.
+
+### 2. Worker Health Checks
+
+More robust detection of worker failures before scheduled executions time out.
+
+### 3. Explicit Handoff Messages
+
+Consider adding `execution.handoff` message to explicitly mark ownership transfer point.
+
+## References
+
+- **Architecture Doc**: `docs/architecture/executor-service.md`
+- **Work Summary**: `work-summary/2026-02-09-duplicate-completion-fix.md`
+- **Bug Fix Doc**: `docs/BUGFIX-duplicate-completion-2026-02-09.md`
+- **ExecutionManager**: `crates/executor/src/execution_manager.rs`
+- **Worker Executor**: `crates/worker/src/executor.rs`
+- **Worker Service**: `crates/worker/src/service.rs`
+
+## Summary
+
+The execution state ownership model provides **clear, lifecycle-based boundaries** for who updates execution records:
+
+- **Executor**: Owns state from creation through scheduling (including pre-handoff cancellations)
+- **Worker**: Owns state after receiving `execution.scheduled` message
+- **Handoff**: Occurs when `execution.scheduled` message is **published to worker**
+- **Key Principle**: Worker only knows about executions it receives; pre-handoff cancellations are executor's responsibility
+
+This eliminates race conditions, reduces database load, and provides a clean architectural foundation for future enhancements.
--- a/docs/BUGFIX-duplicate-completion-2026-02-09.md
+++ b/docs/BUGFIX-duplicate-completion-2026-02-09.md
@@ -0,0 +1,342 @@
+# Bug Fix: Duplicate Completion Notifications & Unnecessary Database Updates
+
+**Date**: 2026-02-09  
+**Component**: Executor Service (ExecutionManager)  
+**Issue Type**: Performance & Correctness
+
+## Overview
+
+Fixed two related inefficiencies in the executor service:
+1. **Duplicate completion notifications** causing queue manager warnings
+2. **Unnecessary database updates** writing unchanged status values
+
+---
+
+## Problem 1: Duplicate Completion Notifications
+
+### Symptom
+```
+WARN crates/executor/src/queue_manager.rs:320: 
+Completion notification for action 3 but active_count is 0
+```
+
+### Before Fix - Message Flow
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│ Worker Service                                                  │
+│                                                                 │
+│  1. Completes action execution                                  │
+│  2. Updates DB: status = "Completed"                            │
+│  3. Publishes: execution.status_changed (status: "completed")   │
+│  4. Publishes: execution.completed ────────────┐                │
+└─────────────────────────────────────────────────┼───────────────┘
+                                                  │
+                 ┌────────────────────────────────┼───────────────┐
+                 │                                │               │
+                 ▼                                ▼               │
+┌─────────────────────────────┐   ┌──────────────────────────────┤
+│ ExecutionManager            │   │ CompletionListener           │
+│                             │   │                              │
+│ Receives:                   │   │ Receives: execution.completed│
+│ execution.status_changed    │   │                              │
+│                             │   │ → notify_completion()        │
+│ → handle_completion()       │   │ → Decrements active_count ✅ │
+│ → publish_completion_notif()│   └──────────────────────────────┘
+│                             │
+│ Publishes: execution.completed ───────┐
+└─────────────────────────────┘         │
+                                        │
+                  ┌─────────────────────┘
+                  │
+                  ▼
+         ┌────────────────────────────┐
+         │ CompletionListener (again) │
+         │                            │
+         │ Receives: execution.completed (2nd time!)
+         │                            │
+         │ → notify_completion()      │
+         │ → active_count already 0   │
+         │ → ⚠️  WARNING LOGGED       │
+         └────────────────────────────┘
+
+Result: 2x completion notifications, 1x warning
+```
+
+### After Fix - Message Flow
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│ Worker Service                                                  │
+│                                                                 │
+│  1. Completes action execution                                  │
+│  2. Updates DB: status = "Completed"                            │
+│  3. Publishes: execution.status_changed (status: "completed")   │
+│  4. Publishes: execution.completed ────────────┐                │
+└─────────────────────────────────────────────────┼───────────────┘
+                                                  │
+                 ┌────────────────────────────────┼───────────────┐
+                 │                                │               │
+                 ▼                                ▼               │
+┌─────────────────────────────┐   ┌──────────────────────────────┤
+│ ExecutionManager            │   │ CompletionListener           │
+│                             │   │                              │
+│ Receives:                   │   │ Receives: execution.completed│
+│ execution.status_changed    │   │                              │
+│                             │   │ → notify_completion()        │
+│ → handle_completion()       │   │ → Decrements active_count ✅ │
+│ → Handles workflow children │   └──────────────────────────────┘
+│ → NO completion publish ✅  │
+└─────────────────────────────┘
+
+Result: 1x completion notification, 0x warnings ✅
+```
+
+---
+
+## Problem 2: Unnecessary Database Updates
+
+### Symptom
+```
+INFO crates/executor/src/execution_manager.rs:108: 
+Updated execution 9061 status: Completed -> Completed
+```
+
+### Before Fix - Status Update Flow
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│ Worker Service                                                  │
+│                                                                 │
+│  1. Completes action execution                                  │
+│  2. ExecutionRepository::update()                               │
+│     status: Running → Completed ✅                              │
+│  3. Publishes: execution.status_changed (status: "completed")   │
+└─────────────────────────────────┬───────────────────────────────┘
+                                  │
+                                  │ Message Queue
+                                  │
+                                  ▼
+┌─────────────────────────────────────────────────────────────────┐
+│ ExecutionManager                                                │
+│                                                                 │
+│  1. Receives: execution.status_changed (status: "completed")    │
+│  2. Fetches execution from DB                                   │
+│     Current status: Completed                                   │
+│  3. Sets: execution.status = Completed (same value)             │
+│  4. ExecutionRepository::update()                               │
+│     status: Completed → Completed ❌                            │
+│  5. Logs: "Updated execution 9061 status: Completed -> Completed"
+└─────────────────────────────────────────────────────────────────┘
+
+Result: 2x database writes for same status value
+```
+
+### After Fix - Status Update Flow
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│ Worker Service                                                  │
+│                                                                 │
+│  1. Completes action execution                                  │
+│  2. ExecutionRepository::update()                               │
+│     status: Running → Completed ✅                              │
+│  3. Publishes: execution.status_changed (status: "completed")   │
+└─────────────────────────────────────┬───────────────────────────┘
+                                      │
+                                      │ Message Queue
+                                      │
+                                      ▼
+┌─────────────────────────────────────────────────────────────────┐
+│ ExecutionManager                                                │
+│                                                                 │
+│  1. Receives: execution.status_changed (status: "completed")    │
+│  2. Fetches execution from DB                                   │
+│     Current status: Completed                                   │
+│  3. Compares: old_status (Completed) == new_status (Completed)  │
+│  4. Skips database update ✅                                    │
+│  5. Still handles orchestration (workflow children)             │
+│  6. Logs: "Execution 9061 status unchanged, skipping update"    │
+└─────────────────────────────────────────────────────────────────┘
+
+Result: 1x database write (only when status changes) ✅
+```
+
+---
+
+## Code Changes
+
+### Change 1: Remove Duplicate Completion Publication
+
+**File**: `crates/executor/src/execution_manager.rs`
+
+```rust
+// BEFORE
+async fn handle_completion(...) -> Result<()> {
+    // Handle workflow children...
+    
+    // Publish completion notification
+    Self::publish_completion_notification(pool, publisher, execution).await?;
+    //                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
+    //                                    DUPLICATE - worker already did this!
+    Ok(())
+}
+```
+
+```rust
+// AFTER
+async fn handle_completion(...) -> Result<()> {
+    // Handle workflow children...
+    
+    // NOTE: Completion notification is published by the worker, not here.
+    // This prevents duplicate execution.completed messages that would cause
+    // the queue manager to decrement active_count twice.
+    
+    Ok(())
+}
+
+// Removed entire publish_completion_notification() method
+```
+
+### Change 2: Skip Unnecessary Database Updates
+
+**File**: `crates/executor/src/execution_manager.rs`
+
+```rust
+// BEFORE
+async fn process_status_change(...) -> Result<()> {
+    let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
+    
+    let old_status = execution.status.clone();
+    execution.status = status;  // Always set, even if same
+    
+    ExecutionRepository::update(pool, execution.id, execution.clone().into()).await?;
+    //                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+    //                           ALWAYS writes, even if unchanged!
+    
+    info!("Updated execution {} status: {:?} -> {:?}", execution_id, old_status, status);
+    
+    // Handle completion logic...
+    Ok(())
+}
+```
+
+```rust
+// AFTER
+async fn process_status_change(...) -> Result<()> {
+    let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
+    
+    let old_status = execution.status.clone();
+    
+    // Skip update if status hasn't changed
+    if old_status == status {
+        debug!("Execution {} status unchanged ({:?}), skipping database update",
+               execution_id, status);
+        
+        // Still handle completion logic for orchestration (e.g., workflow children)
+        if matches!(status, ExecutionStatus::Completed | ExecutionStatus::Failed | ExecutionStatus::Cancelled) {
+            Self::handle_completion(pool, publisher, &execution).await?;
+        }
+        
+        return Ok(());  // Early return - no DB write
+    }
+    
+    execution.status = status;
+    ExecutionRepository::update(pool, execution.id, execution.clone().into()).await?;
+    
+    info!("Updated execution {} status: {:?} -> {:?}", execution_id, old_status, status);
+    
+    // Handle completion logic...
+    Ok(())
+}
+```
+
+---
+
+## Impact & Benefits
+
+### Performance Improvements
+
+| Metric | Before | After | Improvement |
+|--------|--------|-------|-------------|
+| Completion messages per execution | 2 | 1 | **50% reduction** |
+| Queue manager warnings | Frequent | None | **100% elimination** |
+| Database writes (no status change) | Always | Never | **100% elimination** |
+| Log noise | High | Low | **Significant reduction** |
+
+### Typical Execution Flow
+
+**Before fixes**:
+- 1x execution completed
+- 2x `execution.completed` messages published
+- 1x unnecessary database write (Completed → Completed)
+- 1x queue manager warning
+- Noisy logs with redundant "status: Completed -> Completed" messages
+
+**After fixes**:
+- 1x execution completed
+- 1x `execution.completed` message published (worker only)
+- 0x unnecessary database writes
+- 0x queue manager warnings
+- Clean, informative logs
+
+### High-Throughput Scenarios
+
+At **1000 executions/minute**:
+
+**Before**:
+- 2000 completion messages/min
+- ~1000 unnecessary DB writes/min
+- ~1000 warning logs/min
+
+**After**:
+- 1000 completion messages/min (50% reduction)
+- 0 unnecessary DB writes (100% reduction)
+- 0 warning logs (100% reduction)
+
+---
+
+## Testing
+
+✅ All 58 executor unit tests pass  
+✅ Zero compiler warnings  
+✅ No breaking changes to external behavior  
+✅ Orchestration logic (workflow children) still works correctly
+
+---
+
+## Architecture Clarifications
+
+### Separation of Concerns
+
+| Component | Responsibility |
+|-----------|----------------|
+| **Worker** | Authoritative source for execution completion, publishes completion notifications |
+| **Executor** | Orchestration (workflows, child executions), NOT completion notifications |
+| **CompletionListener** | Queue management (releases slots for queued executions) |
+
+### Idempotency
+
+The executor is now **idempotent** with respect to status change messages:
+- Receiving the same status change multiple times has no effect after the first
+- Database is only written when state actually changes
+- Orchestration logic (workflows) runs correctly regardless
+
+---
+
+## Lessons Learned
+
+1. **Message publishers should be explicit** - Only one component should publish a given message type
+2. **Always check for actual changes** - Don't blindly write to database without comparing old/new values
+3. **Separate orchestration from notification** - Workflow logic shouldn't trigger duplicate notifications
+4. **Log levels matter** - Changed redundant updates from INFO to DEBUG to reduce noise
+5. **Trust the source** - Worker owns execution lifecycle; executor shouldn't second-guess it
+
+---
+
+## Related Documentation
+
+- Work Summary: `attune/work-summary/2026-02-09-duplicate-completion-fix.md`
+- Queue Manager: `attune/crates/executor/src/queue_manager.rs`
+- Completion Listener: `attune/crates/executor/src/completion_listener.rs`
+- Execution Manager: `attune/crates/executor/src/execution_manager.rs`
--- a/docs/QUICKREF-dotenv-shell-actions.md
+++ b/docs/QUICKREF-dotenv-shell-actions.md
@@ -0,0 +1,337 @@
+# Quick Reference: DOTENV Shell Actions Pattern
+
+**Purpose:** Standard pattern for writing portable shell actions without external dependencies like `jq`.
+
+## Core Principles
+
+1. **Use POSIX shell** (`#!/bin/sh`), not bash
+2. **Read parameters in DOTENV format** from stdin
+3. **No external JSON parsers** (jq, yq, etc.)
+4. **Minimal dependencies** (only POSIX utilities + curl)
+
+## Complete Template
+
+```sh
+#!/bin/sh
+# Action Name - Core Pack
+# Brief description of what this action does
+#
+# This script uses pure POSIX shell without external dependencies like jq.
+# It reads parameters in DOTENV format from stdin until the delimiter.
+
+set -e
+
+# Initialize variables with defaults
+param1=""
+param2="default_value"
+bool_param="false"
+numeric_param="0"
+
+# Read DOTENV-formatted parameters from stdin until delimiter
+while IFS= read -r line; do
+    # Check for parameter delimiter
+    case "$line" in
+        *"---ATTUNE_PARAMS_END---"*)
+            break
+            ;;
+    esac
+    [ -z "$line" ] && continue
+
+    key="${line%%=*}"
+    value="${line#*=}"
+
+    # Remove quotes if present (both single and double)
+    case "$value" in
+        \"*\")
+            value="${value#\"}"
+            value="${value%\"}"
+            ;;
+        \'*\')
+            value="${value#\'}"
+            value="${value%\'}"
+            ;;
+    esac
+
+    # Process parameters
+    case "$key" in
+        param1)
+            param1="$value"
+            ;;
+        param2)
+            param2="$value"
+            ;;
+        bool_param)
+            bool_param="$value"
+            ;;
+        numeric_param)
+            numeric_param="$value"
+            ;;
+    esac
+done
+
+# Normalize boolean values
+case "$bool_param" in
+    true|True|TRUE|yes|Yes|YES|1) bool_param="true" ;;
+    *) bool_param="false" ;;
+esac
+
+# Validate numeric parameters
+case "$numeric_param" in
+    ''|*[!0-9]*)
+        echo "ERROR: numeric_param must be a positive integer" >&2
+        exit 1
+        ;;
+esac
+
+# Validate required parameters
+if [ -z "$param1" ]; then
+    echo "ERROR: param1 is required" >&2
+    exit 1
+fi
+
+# Action logic goes here
+echo "Processing with param1=$param1, param2=$param2"
+
+# Exit successfully
+exit 0
+```
+
+## YAML Metadata Configuration
+
+```yaml
+ref: core.action_name
+label: "Action Name"
+description: "Brief description"
+enabled: true
+runner_type: shell
+entry_point: action_name.sh
+
+# IMPORTANT: Use dotenv format for POSIX shell compatibility
+parameter_delivery: stdin
+parameter_format: dotenv
+
+# Output format (text or json)
+output_format: text
+
+parameters:
+  type: object
+  properties:
+    param1:
+      type: string
+      description: "First parameter"
+    param2:
+      type: string
+      description: "Second parameter"
+      default: "default_value"
+    bool_param:
+      type: boolean
+      description: "Boolean parameter"
+      default: false
+  required:
+    - param1
+```
+
+## Common Patterns
+
+### 1. Parameter Parsing
+
+**Read until delimiter:**
+```sh
+while IFS= read -r line; do
+    case "$line" in
+        *"---ATTUNE_PARAMS_END---"*) break ;;
+    esac
+done
+```
+
+**Extract key-value:**
+```sh
+key="${line%%=*}"     # Everything before first =
+value="${line#*=}"    # Everything after first =
+```
+
+**Remove quotes:**
+```sh
+case "$value" in
+    \"*\") value="${value#\"}"; value="${value%\"}" ;;
+    \'*\') value="${value#\'}"; value="${value%\'}" ;;
+esac
+```
+
+### 2. Boolean Normalization
+
+```sh
+case "$bool_param" in
+    true|True|TRUE|yes|Yes|YES|1) bool_param="true" ;;
+    *) bool_param="false" ;;
+esac
+```
+
+### 3. Numeric Validation
+
+```sh
+case "$number" in
+    ''|*[!0-9]*)
+        echo "ERROR: must be a number" >&2
+        exit 1
+        ;;
+esac
+```
+
+### 4. JSON Output (without jq)
+
+**Escape special characters:**
+```sh
+escaped=$(printf '%s' "$value" | sed 's/\\/\\\\/g; s/"/\\"/g')
+```
+
+**Build JSON:**
+```sh
+cat <<EOF
+{
+  "field": "$escaped",
+  "boolean": $bool_value,
+  "number": $number
+}
+EOF
+```
+
+### 5. Making HTTP Requests
+
+**With curl and temp files:**
+```sh
+temp_response=$(mktemp)
+cleanup() { rm -f "$temp_response"; }
+trap cleanup EXIT
+
+http_code=$(curl -X POST \
+    -H "Content-Type: application/json" \
+    ${api_token:+-H "Authorization: Bearer ${api_token}"} \
+    -d "$request_body" \
+    -s \
+    -w "%{http_code}" \
+    -o "$temp_response" \
+    --max-time 60 \
+    "${api_url}/api/v1/endpoint" 2>/dev/null || echo "000")
+
+if [ "$http_code" -ge 200 ] && [ "$http_code" -lt 300 ]; then
+    cat "$temp_response"
+    exit 0
+else
+    echo "ERROR: API call failed (HTTP $http_code)" >&2
+    exit 1
+fi
+```
+
+### 6. Extracting JSON Fields (simple cases)
+
+**Extract field value:**
+```sh
+case "$response" in
+    *'"field":'*)
+        value=$(printf '%s' "$response" | sed -n 's/.*"field":\s*"\([^"]*\)".*/\1/p')
+        ;;
+esac
+```
+
+**Note:** For complex JSON, consider having the API return the exact format needed.
+
+## Anti-Patterns (DO NOT DO)
+
+❌ **Using jq:**
+```sh
+value=$(echo "$json" | jq -r '.field')  # NO!
+```
+
+❌ **Using bash-specific features:**
+```sh
+#!/bin/bash  # NO! Use #!/bin/sh
+[[ "$var" == "value" ]]  # NO! Use [ "$var" = "value" ]
+```
+
+❌ **Reading JSON directly from stdin:**
+```yaml
+parameter_format: json  # NO! Use dotenv
+```
+
+❌ **Using Python/Node.js in core pack:**
+```yaml
+runner_type: python  # NO! Use shell for core pack
+```
+
+## Testing Checklist
+
+- [ ] Script has `#!/bin/sh` shebang
+- [ ] Script is executable (`chmod +x`)
+- [ ] All parameters have defaults or validation
+- [ ] Boolean values are normalized
+- [ ] Numeric values are validated
+- [ ] Required parameters are checked
+- [ ] Error messages go to stderr (`>&2`)
+- [ ] Successful output goes to stdout
+- [ ] Temp files are cleaned up (trap handler)
+- [ ] YAML has `parameter_format: dotenv`
+- [ ] YAML has `runner_type: shell`
+- [ ] No `jq`, `yq`, or bash-isms used
+- [ ] Works on Alpine Linux (minimal environment)
+
+## Examples from Core Pack
+
+### Simple Action (echo.sh)
+- Minimal parameter parsing
+- Single string parameter
+- Text output
+
+### Complex Action (http_request.sh)
+- Multiple parameters (headers, query params)
+- HTTP client implementation
+- JSON output construction
+- Error handling
+
+### API Wrapper (register_packs.sh)
+- JSON request body construction
+- API authentication
+- Response parsing
+- Structured error messages
+
+## DOTENV Format Specification
+
+**Format:** Each parameter on a new line as `key=value`
+
+**Example:**
+```
+param1="string value"
+param2=42
+bool_param=true
+---ATTUNE_PARAMS_END---
+```
+
+**Key Rules:**
+- Parameters end with `---ATTUNE_PARAMS_END---` delimiter
+- Values may be quoted (single or double quotes)
+- Empty lines are skipped
+- No multiline values (use base64 if needed)
+- Array/object parameters passed as JSON strings
+
+## When to Use This Pattern
+
+✅ **Use DOTENV shell pattern for:**
+- Core pack actions
+- Simple utility actions
+- Actions that need maximum portability
+- Actions that run in minimal containers
+- Actions that don't need complex JSON parsing
+
+❌ **Consider other runtimes if you need:**
+- Complex JSON manipulation
+- External libraries (AWS SDK, etc.)
+- Advanced string processing
+- Parallel processing
+- Language-specific features
+
+## Further Reading
+
+- `packs/core/actions/echo.sh` - Simplest example
+- `packs/core/actions/http_request.sh` - Complex example
+- `packs/core/actions/register_packs.sh` - API wrapper example
+- `docs/pack-structure.md` - Pack development guide
--- a/docs/QUICKREF-execution-state-ownership.md
+++ b/docs/QUICKREF-execution-state-ownership.md
@@ -0,0 +1,204 @@
+# Quick Reference: Execution State Ownership
+
+**Last Updated**: 2026-02-09
+
+## Ownership Model at a Glance
+
+```
+┌──────────────────────────────────────────────────────────┐
+│  EXECUTOR OWNS                │  WORKER OWNS             │
+│  Requested                    │  Running                 │
+│  Scheduling                   │  Completed               │
+│  Scheduled                    │  Failed                  │
+│  (+ pre-handoff Cancelled)    │  (+ post-handoff         │
+│                               │     Cancelled/Timeout/   │
+│                               │     Abandoned)           │
+└───────────────────────────────┴──────────────────────────┘
+            │                           │
+            └─────── HANDOFF ──────────┘
+        execution.scheduled PUBLISHED
+```
+
+## Who Updates the Database?
+
+### Executor Updates (Pre-Handoff Only)
+- ✅ Creates execution record
+- ✅ Updates status: `Requested` → `Scheduling` → `Scheduled`
+- ✅ Publishes `execution.scheduled` message **← HANDOFF POINT**
+- ✅ Handles cancellations/failures BEFORE handoff (worker never notified)
+- ❌ NEVER updates after `execution.scheduled` is published
+
+### Worker Updates (Post-Handoff Only)
+- ✅ Receives `execution.scheduled` message (takes ownership)
+- ✅ Updates status: `Scheduled` → `Running`
+- ✅ Updates status: `Running` → `Completed`/`Failed`/`Cancelled`/etc.
+- ✅ Handles cancellations/failures AFTER handoff
+- ✅ Updates result data
+- ✅ Writes for every status change after receiving handoff
+
+## Who Publishes Messages?
+
+### Executor Publishes
+- `enforcement.created` (from rules)
+- `execution.requested` (to scheduler)
+- `execution.scheduled` (to worker) **← HANDOFF MESSAGE - OWNERSHIP TRANSFER**
+
+### Worker Publishes
+- `execution.status_changed` (for each status change after handoff)
+- `execution.completed` (when done)
+
+### Executor Receives (But Doesn't Update DB Post-Handoff)
+- `execution.status_changed` → triggers orchestration logic (read-only)
+- `execution.completed` → releases queue slots
+
+## Code Locations
+
+### Executor Updates DB
+```rust
+// crates/executor/src/scheduler.rs
+execution.status = ExecutionStatus::Scheduled;
+ExecutionRepository::update(pool, execution.id, execution.into()).await?;
+```
+
+### Worker Updates DB
+```rust
+// crates/worker/src/executor.rs
+self.update_execution_status(execution_id, ExecutionStatus::Running).await?;
+// ...
+ExecutionRepository::update(&self.pool, execution_id, input).await?;
+```
+
+### Executor Orchestrates (Read-Only)
+```rust
+// crates/executor/src/execution_manager.rs
+async fn process_status_change(...) -> Result<()> {
+    let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
+    // NO UPDATE - just orchestration logic
+    Self::handle_completion(pool, publisher, &execution).await?;
+}
+```
+
+## Decision Tree: Should I Update the DB?
+
+```
+Are you in the Executor?
+├─ Have you published execution.scheduled for this execution?
+│  ├─ NO → Update DB (you own it)
+│  │  └─ Includes: Requested/Scheduling/Scheduled/pre-handoff Cancelled
+│  └─ YES → Don't update DB (worker owns it now)
+│     └─ Just orchestrate (trigger workflows, etc)
+│
+Are you in the Worker?
+├─ Have you received execution.scheduled for this execution?
+│  ├─ YES → Update DB for ALL status changes (you own it)
+│  │  └─ Includes: Running/Completed/Failed/post-handoff Cancelled/etc.
+│  └─ NO → Don't touch this execution (doesn't exist for you yet)
+```
+
+## Common Patterns
+
+### ✅ DO: Worker Updates After Handoff
+```rust
+// Worker receives execution.scheduled
+self.update_execution_status(execution_id, ExecutionStatus::Running).await?;
+self.publish_status_update(execution_id, ExecutionStatus::Running).await?;
+```
+
+### ✅ DO: Executor Orchestrates Without DB Write
+```rust
+// Executor receives execution.status_changed
+let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
+if status == ExecutionStatus::Completed {
+    Self::trigger_child_executions(pool, publisher, &execution).await?;
+}
+```
+
+### ❌ DON'T: Executor Updates After Handoff
+```rust
+// Executor receives execution.status_changed
+execution.status = status;
+ExecutionRepository::update(pool, execution.id, execution).await?; // ❌ WRONG!
+```
+
+### ❌ DON'T: Worker Updates Before Handoff
+```rust
+// Worker updates execution it hasn't received via execution.scheduled
+ExecutionRepository::update(&self.pool, execution_id, input).await?; // ❌ WRONG!
+```
+
+### ✅ DO: Executor Handles Pre-Handoff Cancellation
+```rust
+// User cancels execution before it's scheduled to worker
+// Execution is still in Requested/Scheduling state
+execution.status = ExecutionStatus::Cancelled;
+ExecutionRepository::update(pool, execution_id, execution).await?; // ✅ CORRECT!
+// Worker never receives execution.scheduled, never knows execution existed
+```
+
+### ✅ DO: Worker Handles Post-Handoff Cancellation
+```rust
+// Worker received execution.scheduled, now owns execution
+// User cancels execution while it's running
+execution.status = ExecutionStatus::Cancelled;
+ExecutionRepository::update(&self.pool, execution_id, execution).await?; // ✅ CORRECT!
+self.publish_status_update(execution_id, ExecutionStatus::Cancelled).await?;
+```
+
+## Handoff Checklist
+
+When an execution is scheduled:
+
+**Executor Must**:
+- [x] Update status to `Scheduled`
+- [x] Write to database
+- [x] Publish `execution.scheduled` message **← HANDOFF OCCURS HERE**
+- [x] Stop updating this execution (ownership transferred)
+- [x] Continue to handle orchestration (read-only)
+
+**Worker Must**:
+- [x] Receive `execution.scheduled` message **← OWNERSHIP RECEIVED**
+- [x] Take ownership of execution state
+- [x] Update DB for all future status changes
+- [x] Handle any cancellations/failures after this point
+- [x] Publish status notifications
+
+**Important**: If execution is cancelled BEFORE executor publishes `execution.scheduled`, the executor updates status to `Cancelled` and worker never learns about it.
+
+## Benefits Summary
+
+| Aspect | Benefit |
+|--------|---------|
+| **Race Conditions** | Eliminated - only one owner per stage |
+| **DB Writes** | Reduced by ~50% - no duplicates |
+| **Code Clarity** | Clear boundaries - easy to reason about |
+| **Message Traffic** | Reduced - no duplicate completions |
+| **Idempotency** | Safe to receive duplicate messages |
+
+## Troubleshooting
+
+### Execution Stuck in "Scheduled"
+**Problem**: Worker not updating status to Running  
+**Check**: Was execution.scheduled published? Worker received it? Worker healthy?
+
+### Workflow Children Not Triggering
+**Problem**: Orchestration not running  
+**Check**: Worker published execution.status_changed? Message queue healthy?
+
+### Duplicate Status Updates
+**Problem**: Both services updating DB  
+**Check**: Executor should NOT update after publishing execution.scheduled
+
+### Execution Cancelled But Status Not Updated
+**Problem**: Cancellation not reflected in database  
+**Check**: Was it cancelled before or after handoff?  
+**Fix**: If before handoff → executor updates; if after handoff → worker updates
+
+### Queue Warnings
+**Problem**: Duplicate completion notifications  
+**Check**: Only worker should publish execution.completed
+
+## See Also
+
+- **Full Architecture Doc**: `docs/ARCHITECTURE-execution-state-ownership.md`
+- **Bug Fix Visualization**: `docs/BUGFIX-duplicate-completion-2026-02-09.md`
+- **Work Summary**: `work-summary/2026-02-09-execution-state-ownership.md`
--- a/docs/QUICKREF-phase3-retry-health.md
+++ b/docs/QUICKREF-phase3-retry-health.md
@@ -0,0 +1,460 @@
+# Quick Reference: Phase 3 - Intelligent Retry & Worker Health
+
+## Overview
+
+Phase 3 adds intelligent retry logic and proactive worker health monitoring to automatically recover from transient failures and optimize worker selection.
+
+**Key Features:**
+- **Automatic Retry:** Failed executions automatically retry with exponential backoff
+- **Health-Aware Scheduling:** Prefer healthy workers with low queue depth
+- **Per-Action Configuration:** Custom timeouts and retry limits per action
+- **Failure Classification:** Distinguish retriable vs non-retriable failures
+
+## Quick Start
+
+### Enable Retry for an Action
+
+```yaml
+# packs/mypack/actions/flaky-api.yaml
+name: flaky_api_call
+runtime: python
+entrypoint: actions/flaky_api.py
+timeout_seconds: 120      # Custom timeout (overrides global 5 min)
+max_retries: 3            # Retry up to 3 times on failure
+parameters:
+  url:
+    type: string
+    required: true
+```
+
+### Database Migration
+
+```bash
+# Apply Phase 3 schema changes
+sqlx migrate run
+
+# Or via Docker Compose
+docker compose exec postgres psql -U attune -d attune -f /migrations/20260209000000_phase3_retry_and_health.sql
+```
+
+### Check Worker Health
+
+```bash
+# View healthy workers
+psql -c "SELECT * FROM healthy_workers;"
+
+# Check specific worker health
+psql -c "
+SELECT 
+    name,
+    capabilities->'health'->>'status' as health_status,
+    capabilities->'health'->>'queue_depth' as queue_depth,
+    capabilities->'health'->>'consecutive_failures' as failures
+FROM worker 
+WHERE id = 1;
+"
+```
+
+## Retry Behavior
+
+### Retriable Failures
+
+Executions are automatically retried for:
+- ✓ Worker unavailable (`worker_unavailable`)
+- ✓ Queue timeout/TTL expired (`queue_timeout`)
+- ✓ Worker heartbeat stale (`worker_heartbeat_stale`)
+- ✓ Transient errors (`transient_error`)
+- ✓ Manual retry requested (`manual_retry`)
+
+### Non-Retriable Failures
+
+These failures are NOT retried:
+- ✗ Validation errors
+- ✗ Permission denied
+- ✗ Action not found
+- ✗ Invalid parameters
+- ✗ Explicit action failure
+
+### Retry Backoff
+
+**Strategy:** Exponential backoff with jitter
+
+```
+Attempt 0: ~1 second
+Attempt 1: ~2 seconds
+Attempt 2: ~4 seconds
+Attempt 3: ~8 seconds
+Attempt N: min(base * 2^N, 300 seconds)
+```
+
+**Jitter:** ±20% randomization to avoid thundering herd
+
+### Retry Configuration
+
+```rust
+// Default retry configuration
+RetryConfig {
+    enabled: true,
+    base_backoff_secs: 1,
+    max_backoff_secs: 300,       // 5 minutes max
+    backoff_multiplier: 2.0,
+    jitter_factor: 0.2,          // 20% jitter
+}
+```
+
+## Worker Health
+
+### Health States
+
+**Healthy:**
+- Heartbeat < 30 seconds old
+- Consecutive failures < 3
+- Queue depth < 50
+- Failure rate < 30%
+
+**Degraded:**
+- Consecutive failures: 3-9
+- Queue depth: 50-99
+- Failure rate: 30-69%
+- Still receives tasks but deprioritized
+
+**Unhealthy:**
+- Heartbeat > 30 seconds old
+- Consecutive failures ≥ 10
+- Queue depth ≥ 100
+- Failure rate ≥ 70%
+- Does NOT receive new tasks
+
+### Health Metrics
+
+Workers self-report health in capabilities:
+
+```json
+{
+  "runtimes": ["shell", "python"],
+  "health": {
+    "status": "healthy",
+    "last_check": "2026-02-09T12:00:00Z",
+    "consecutive_failures": 0,
+    "total_executions": 1000,
+    "failed_executions": 20,
+    "average_execution_time_ms": 1500,
+    "queue_depth": 5
+  }
+}
+```
+
+### Worker Selection
+
+**Selection Priority:**
+1. Healthy workers (queue depth ascending)
+2. Degraded workers (queue depth ascending)
+3. Skip unhealthy workers
+
+**Example:**
+```
+Worker A: Healthy, queue=5    ← Selected first
+Worker B: Healthy, queue=20   ← Selected second
+Worker C: Degraded, queue=10  ← Selected third
+Worker D: Unhealthy, queue=0  ← Never selected
+```
+
+## Database Schema
+
+### Execution Retry Fields
+
+```sql
+-- Added to execution table
+retry_count INTEGER NOT NULL DEFAULT 0,
+max_retries INTEGER,
+retry_reason TEXT,
+original_execution BIGINT REFERENCES execution(id)
+```
+
+### Action Configuration Fields
+
+```sql
+-- Added to action table
+timeout_seconds INTEGER,          -- Per-action timeout override
+max_retries INTEGER DEFAULT 0     -- Per-action retry limit
+```
+
+### Helper Functions
+
+```sql
+-- Check if execution can be retried
+SELECT is_execution_retriable(123);
+
+-- Get worker queue depth
+SELECT get_worker_queue_depth(1);
+```
+
+### Views
+
+```sql
+-- Get all healthy workers
+SELECT * FROM healthy_workers;
+```
+
+## Practical Examples
+
+### Example 1: View Retry Chain
+
+```sql
+-- Find all retries for execution 100
+WITH RECURSIVE retry_chain AS (
+    SELECT id, retry_count, retry_reason, original_execution, status
+    FROM execution
+    WHERE id = 100
+    
+    UNION ALL
+    
+    SELECT e.id, e.retry_count, e.retry_reason, e.original_execution, e.status
+    FROM execution e
+    JOIN retry_chain rc ON e.original_execution = rc.id
+)
+SELECT * FROM retry_chain ORDER BY retry_count;
+```
+
+### Example 2: Analyze Retry Success Rate
+
+```sql
+-- Success rate of retries by reason
+SELECT 
+    config->>'retry_reason' as reason,
+    COUNT(*) as total_retries,
+    COUNT(CASE WHEN status = 'completed' THEN 1 END) as succeeded,
+    ROUND(100.0 * COUNT(CASE WHEN status = 'completed' THEN 1 END) / COUNT(*), 2) as success_rate
+FROM execution
+WHERE retry_count > 0
+GROUP BY config->>'retry_reason'
+ORDER BY total_retries DESC;
+```
+
+### Example 3: Find Workers by Health
+
+```sql
+-- Workers sorted by health and load
+SELECT 
+    w.name,
+    w.status,
+    (w.capabilities->'health'->>'status')::TEXT as health,
+    (w.capabilities->'health'->>'queue_depth')::INTEGER as queue,
+    (w.capabilities->'health'->>'consecutive_failures')::INTEGER as failures,
+    w.last_heartbeat
+FROM worker w
+WHERE w.status = 'active'
+ORDER BY 
+    CASE (w.capabilities->'health'->>'status')::TEXT
+        WHEN 'healthy' THEN 1
+        WHEN 'degraded' THEN 2
+        WHEN 'unhealthy' THEN 3
+        ELSE 4
+    END,
+    (w.capabilities->'health'->>'queue_depth')::INTEGER;
+```
+
+### Example 4: Manual Retry via API
+
+```bash
+# Create retry execution
+curl -X POST http://localhost:8080/api/v1/executions \
+  -H "Authorization: Bearer $TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "action_ref": "core.echo",
+    "parameters": {"message": "retry test"},
+    "config": {
+      "retry_of": 123,
+      "retry_count": 1,
+      "max_retries": 3,
+      "retry_reason": "manual_retry",
+      "original_execution": 123
+    }
+  }'
+```
+
+## Monitoring
+
+### Key Metrics
+
+**Retry Metrics:**
+- Retry rate: % of executions that retry
+- Retry success rate: % of retries that succeed
+- Average retries per execution
+- Retry reason distribution
+
+**Health Metrics:**
+- Healthy worker count
+- Degraded worker count
+- Unhealthy worker count
+- Average queue depth per worker
+- Average failure rate per worker
+
+### SQL Queries
+
+```sql
+-- Retry rate over last hour
+SELECT 
+    COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END) as original_executions,
+    COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) as retry_executions,
+    ROUND(100.0 * COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) / 
+          COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END), 2) as retry_rate
+FROM execution
+WHERE created > NOW() - INTERVAL '1 hour';
+
+-- Worker health distribution
+SELECT 
+    COALESCE((capabilities->'health'->>'status')::TEXT, 'unknown') as health_status,
+    COUNT(*) as worker_count,
+    AVG((capabilities->'health'->>'queue_depth')::INTEGER) as avg_queue_depth
+FROM worker
+WHERE status = 'active'
+GROUP BY health_status;
+```
+
+## Configuration
+
+### Retry Configuration
+
+```rust
+// In executor service initialization
+let retry_manager = RetryManager::new(pool.clone(), RetryConfig {
+    enabled: true,
+    base_backoff_secs: 1,
+    max_backoff_secs: 300,
+    backoff_multiplier: 2.0,
+    jitter_factor: 0.2,
+});
+```
+
+### Health Probe Configuration
+
+```rust
+// In executor service initialization
+let health_probe = WorkerHealthProbe::new(pool.clone(), HealthProbeConfig {
+    enabled: true,
+    heartbeat_max_age_secs: 30,
+    degraded_threshold: 3,
+    unhealthy_threshold: 10,
+    queue_depth_degraded: 50,
+    queue_depth_unhealthy: 100,
+    failure_rate_degraded: 0.3,
+    failure_rate_unhealthy: 0.7,
+});
+```
+
+## Troubleshooting
+
+### High Retry Rate
+
+**Symptoms:** Many executions retrying repeatedly
+
+**Causes:**
+- Workers unstable or frequently restarting
+- Network issues causing transient failures
+- Actions not idempotent (retry makes things worse)
+
+**Resolution:**
+1. Check worker stability: `docker compose ps`
+2. Review action idempotency
+3. Adjust `max_retries` if retries are unhelpful
+4. Investigate root cause of failures
+
+### Retries Not Triggering
+
+**Symptoms:** Failed executions not retrying despite max_retries > 0
+
+**Causes:**
+- Action doesn't have `max_retries` set
+- Failure is non-retriable (validation error, etc.)
+- Global retry disabled
+
+**Resolution:**
+1. Check action configuration: `SELECT timeout_seconds, max_retries FROM action WHERE ref = 'action.name';`
+2. Check failure message for retriable patterns
+3. Verify retry enabled in executor config
+
+### Workers Marked Unhealthy
+
+**Symptoms:** Workers not receiving tasks
+
+**Causes:**
+- High queue depth (overloaded)
+- Consecutive failures exceed threshold
+- Heartbeat stale
+
+**Resolution:**
+1. Check worker logs: `docker compose logs -f worker-shell`
+2. Verify heartbeat: `SELECT name, last_heartbeat FROM worker;`
+3. Check queue depth in capabilities
+4. Restart worker if stuck: `docker compose restart worker-shell`
+
+### Retry Loops
+
+**Symptoms:** Execution retries forever or excessive retries
+
+**Causes:**
+- Bug in retry reason detection
+- Action failure always classified as retriable
+- max_retries not being enforced
+
+**Resolution:**
+1. Check retry chain: See Example 1 above
+2. Verify max_retries: `SELECT config FROM execution WHERE id = 123;`
+3. Fix retry reason classification if incorrect
+4. Manually fail execution if stuck
+
+## Integration with Previous Phases
+
+### Phase 1 + Phase 2 + Phase 3 Together
+
+**Defense in Depth:**
+1. **Phase 1 (Timeout Monitor):** Catches stuck SCHEDULED executions (30s-5min)
+2. **Phase 2 (Queue TTL/DLQ):** Expires messages in worker queues (5min)
+3. **Phase 3 (Intelligent Retry):** Retries retriable failures (1s-5min backoff)
+
+**Failure Flow:**
+```
+Execution dispatched → Worker unavailable (Phase 2: 5min TTL)
+    → DLQ handler marks FAILED (Phase 2)
+    → Retry manager creates retry (Phase 3)
+    → Retry dispatched with backoff (Phase 3)
+    → Success or exhaust retries
+```
+
+**Backup Safety Net:**
+If Phase 3 retry fails to create retry, Phase 1 timeout monitor will still catch stuck executions.
+
+## Best Practices
+
+### Action Design for Retries
+
+1. **Make actions idempotent:** Safe to run multiple times
+2. **Set realistic timeouts:** Based on typical execution time
+3. **Configure appropriate max_retries:**
+   - Network calls: 3-5 retries
+   - Database operations: 2-3 retries
+   - External APIs: 3 retries
+   - Local operations: 0-1 retries
+
+### Worker Health Management
+
+1. **Report queue depth regularly:** Update every heartbeat
+2. **Track failure metrics:** Consecutive failures, total/failed counts
+3. **Implement graceful degradation:** Continue working when degraded
+4. **Fail fast when unhealthy:** Stop accepting work if overloaded
+
+### Monitoring Strategy
+
+1. **Alert on high retry rates:** > 20% of executions retrying
+2. **Alert on unhealthy workers:** > 50% workers unhealthy
+3. **Track retry success rate:** Should be > 70%
+4. **Monitor queue depths:** Average should stay < 20
+
+## See Also
+
+- **Architecture:** `docs/architecture/worker-availability-handling.md`
+- **Phase 1 Guide:** `docs/QUICKREF-worker-availability-phase1.md`
+- **Phase 2 Guide:** `docs/QUICKREF-worker-queue-ttl-dlq.md`
+- **Migration:** `migrations/20260209000000_phase3_retry_and_health.sql`
--- a/docs/QUICKREF-worker-heartbeat-monitoring.md
+++ b/docs/QUICKREF-worker-heartbeat-monitoring.md
@@ -0,0 +1,227 @@
+# Quick Reference: Worker Heartbeat Monitoring
+
+**Purpose**: Automatically detect and deactivate workers that have stopped sending heartbeats
+
+## Overview
+
+The executor service includes a background task that monitors worker heartbeats and automatically marks stale workers as inactive. This prevents the scheduler from attempting to assign work to workers that are no longer available.
+
+## How It Works
+
+### Background Monitor Task
+
+- **Location**: `crates/executor/src/service.rs` → `worker_heartbeat_monitor_loop()`
+- **Check Interval**: Every 60 seconds
+- **Staleness Threshold**: 90 seconds (3x the expected 30-second heartbeat interval)
+
+### Detection Logic
+
+The monitor checks all workers with `status = 'active'`:
+
+1. **No Heartbeat**: Workers with `last_heartbeat = NULL` → marked inactive
+2. **Stale Heartbeat**: Workers with heartbeat older than 90 seconds → marked inactive
+3. **Fresh Heartbeat**: Workers with heartbeat within 90 seconds → remain active
+
+### Automatic Deactivation
+
+When a stale worker is detected:
+- Worker status updated to `inactive` in database
+- Warning logged with worker name, ID, and heartbeat age
+- Summary logged with count of deactivated workers
+
+## Configuration
+
+### Constants (in scheduler.rs and service.rs)
+
+```rust
+DEFAULT_HEARTBEAT_INTERVAL: 30 seconds      // Expected worker heartbeat frequency
+HEARTBEAT_STALENESS_MULTIPLIER: 3          // Grace period multiplier
+MAX_STALENESS: 90 seconds                   // Calculated: 30 * 3
+```
+
+### Check Interval
+
+Currently hardcoded to 60 seconds. Configured when spawning the monitor task:
+
+```rust
+Self::worker_heartbeat_monitor_loop(worker_pool, 60).await;
+```
+
+## Worker Lifecycle
+
+### Normal Operation
+
+```
+Worker Starts → Registers → Sends Heartbeats (30s) → Remains Active
+```
+
+### Graceful Shutdown
+
+```
+Worker Stops → No More Heartbeats → Monitor Detects (60s) → Marked Inactive
+```
+
+### Crash/Network Failure
+
+```
+Worker Crashes → Heartbeats Stop → Monitor Detects (60s) → Marked Inactive
+```
+
+## Monitoring
+
+### Check Active Workers
+
+```sql
+SELECT name, worker_role, status, last_heartbeat 
+FROM worker 
+WHERE status = 'active' 
+ORDER BY last_heartbeat DESC;
+```
+
+### Check Recent Deactivations
+
+```sql
+SELECT name, worker_role, status, last_heartbeat, updated
+FROM worker 
+WHERE status = 'inactive' 
+  AND updated > NOW() - INTERVAL '5 minutes'
+ORDER BY updated DESC;
+```
+
+### Count Workers by Status
+
+```sql
+SELECT status, COUNT(*) 
+FROM worker 
+GROUP BY status;
+```
+
+## Logs
+
+### Monitor Startup
+
+```
+INFO: Starting worker heartbeat monitor...
+INFO: Worker heartbeat monitor started (check interval: 60s, staleness threshold: 90s)
+```
+
+### Worker Deactivation
+
+```
+WARN: Worker sensor-77cd23b50478 (ID: 27) heartbeat is stale (1289s old), marking as inactive
+INFO: Deactivated 5 worker(s) with stale heartbeats
+```
+
+### Error Handling
+
+```
+ERROR: Failed to deactivate worker worker-123 (stale heartbeat): <error details>
+ERROR: Failed to query active workers for heartbeat check: <error details>
+```
+
+## Scheduler Integration
+
+The scheduler already filters out stale workers during worker selection:
+
+```rust
+// Filter by heartbeat freshness
+let fresh_workers: Vec<_> = active_workers
+    .into_iter()
+    .filter(|w| Self::is_worker_heartbeat_fresh(w))
+    .collect();
+```
+
+**Before Heartbeat Monitor**: Scheduler filtered at selection time, but workers stayed "active" in DB
+**After Heartbeat Monitor**: Workers marked inactive in DB, scheduler sees accurate state
+
+## Troubleshooting
+
+### Workers Constantly Becoming Inactive
+
+**Symptoms**: Active workers being marked inactive despite running
+**Causes**:
+- Worker heartbeat interval > 30 seconds
+- Network issues preventing heartbeat messages
+- Worker service crash loop
+
+**Solutions**:
+1. Check worker logs for heartbeat send attempts
+2. Verify RabbitMQ connectivity
+3. Check worker configuration for heartbeat interval
+
+### Stale Workers Not Being Deactivated
+
+**Symptoms**: Workers with old heartbeats remain active
+**Causes**:
+- Executor service not running
+- Monitor task crashed
+
+**Solutions**:
+1. Check executor service logs
+2. Verify monitor task started: `grep "heartbeat monitor started" executor.log`
+3. Restart executor service
+
+### Too Many Inactive Workers
+
+**Symptoms**: Database has hundreds of inactive workers
+**Causes**: Historical workers from development/testing
+
+**Solutions**:
+```sql
+-- Delete inactive workers older than 7 days
+DELETE FROM worker 
+WHERE status = 'inactive' 
+  AND updated < NOW() - INTERVAL '7 days';
+```
+
+## Best Practices
+
+### Worker Registration
+
+Workers should:
+- Set appropriate unique name (hostname-based)
+- Send heartbeat every 30 seconds
+- Handle graceful shutdown (optional: mark self inactive)
+
+### Database Maintenance
+
+- Periodically clean up old inactive workers
+- Monitor worker table growth
+- Index on `status` and `last_heartbeat` for efficient queries
+
+### Monitoring & Alerts
+
+- Track worker deactivation rate (should be low in production)
+- Alert on sudden increase in deactivations (infrastructure issue)
+- Monitor active worker count vs. expected
+
+## Related Documentation
+
+- `docs/architecture/worker-service.md` - Worker architecture
+- `docs/architecture/executor-service.md` - Executor architecture
+- `docs/deployment/ops-runbook-queues.md` - Operational procedures
+- `AGENTS.md` - Project rules and conventions
+
+## Implementation Notes
+
+### Why 90 Seconds?
+
+- Worker sends heartbeat every 30 seconds
+- 3x multiplier provides grace period for:
+  - Network latency
+  - Brief load spikes
+  - Temporary connectivity issues
+- Balances responsiveness vs. false positives
+
+### Why Check Every 60 Seconds?
+
+- Allows 1.5 heartbeat intervals between checks
+- Reduces database query frequency
+- Adequate response time (stale workers removed within ~2 minutes)
+
+### Thread Safety
+
+- Monitor runs in separate tokio task
+- Uses connection pool for database access
+- No shared mutable state
+- Safe to run multiple executor instances (each monitors independently)
--- a/docs/QUICKREF-worker-queue-ttl-dlq.md
+++ b/docs/QUICKREF-worker-queue-ttl-dlq.md
@@ -0,0 +1,322 @@
+# Quick Reference: Worker Queue TTL and Dead Letter Queue (Phase 2)
+
+## Overview
+
+Phase 2 implements message TTL on worker queues and dead letter queue processing to automatically fail executions when workers are unavailable.
+
+**Key Concept:** If a worker doesn't process an execution within 5 minutes, the message expires and the execution is automatically marked as FAILED.
+
+## How It Works
+
+```
+Execution → Worker Queue (TTL: 5 min) → Worker Processing ✓
+                    ↓ (if timeout)
+              Dead Letter Exchange
+                    ↓
+              Dead Letter Queue
+                    ↓
+            DLQ Handler (in Executor)
+                    ↓
+          Execution marked FAILED
+```
+
+## Configuration
+
+### Default Settings (All Environments)
+
+```yaml
+message_queue:
+  rabbitmq:
+    worker_queue_ttl_ms: 300000  # 5 minutes
+    dead_letter:
+      enabled: true
+      exchange: attune.dlx
+      ttl_ms: 86400000  # 24 hours DLQ retention
+```
+
+### Tuning TTL
+
+**Worker Queue TTL** (`worker_queue_ttl_ms`):
+- **Default:** 300000 (5 minutes)
+- **Purpose:** How long to wait before declaring worker unavailable
+- **Tuning:** Set to 2-5x your typical execution time
+- **Too short:** Slow executions fail prematurely
+- **Too long:** Delayed failure detection for unavailable workers
+
+**DLQ Retention** (`dead_letter.ttl_ms`):
+- **Default:** 86400000 (24 hours)
+- **Purpose:** How long to keep expired messages for debugging
+- **Tuning:** Based on your debugging/forensics needs
+
+## Components
+
+### 1. Worker Queue TTL
+
+- Applied to all `worker.{id}.executions` queues
+- Configured via RabbitMQ queue argument `x-message-ttl`
+- Messages expire if not consumed within TTL
+- Expired messages routed to dead letter exchange
+
+### 2. Dead Letter Exchange (DLX)
+
+- **Name:** `attune.dlx`
+- **Type:** `direct`
+- Receives all expired messages from worker queues
+- Routes to dead letter queue
+
+### 3. Dead Letter Queue (DLQ)
+
+- **Name:** `attune.dlx.queue`
+- Stores expired messages for processing
+- Retains messages for 24 hours (configurable)
+- Processed by dead letter handler
+
+### 4. Dead Letter Handler
+
+- Runs in executor service
+- Consumes messages from DLQ
+- Updates executions to FAILED status
+- Provides descriptive error messages
+
+## Monitoring
+
+### Key Metrics
+
+```bash
+# Check DLQ depth
+rabbitmqadmin list queues name messages | grep attune.dlx.queue
+
+# View DLQ rate
+# Watch for sustained DLQ message rate > 10/min
+
+# Check failed executions
+curl http://localhost:8080/api/v1/executions?status=failed
+```
+
+### Health Checks
+
+**Good:**
+- DLQ depth: 0-10
+- DLQ rate: < 5 messages/min
+- Most executions complete successfully
+
+**Warning:**
+- DLQ depth: 10-100
+- DLQ rate: 5-20 messages/min
+- May indicate worker instability
+
+**Critical:**
+- DLQ depth: > 100
+- DLQ rate: > 20 messages/min
+- Workers likely down or overloaded
+
+## Troubleshooting
+
+### High DLQ Rate
+
+**Symptoms:** Many executions failing via DLQ
+
+**Common Causes:**
+1. Workers stopped or restarting
+2. Workers overloaded (not consuming fast enough)
+3. TTL too aggressive for your workload
+4. Network connectivity issues
+
+**Resolution:**
+```bash
+# 1. Check worker status
+docker compose ps | grep worker
+docker compose logs -f worker-shell
+
+# 2. Verify worker heartbeats
+psql -c "SELECT name, status, last_heartbeat FROM worker;"
+
+# 3. Check worker queue depths
+rabbitmqadmin list queues name messages | grep "worker\."
+
+# 4. Consider increasing TTL if legitimate slow executions
+# Edit config and restart executor:
+#   worker_queue_ttl_ms: 600000  # 10 minutes
+```
+
+### DLQ Not Processing
+
+**Symptoms:** DLQ depth increasing, executions stuck
+
+**Common Causes:**
+1. Executor service not running
+2. DLQ disabled in config
+3. Database connection issues
+
+**Resolution:**
+```bash
+# 1. Verify executor is running
+docker compose ps executor
+docker compose logs -f executor | grep "dead letter"
+
+# 2. Check configuration
+grep -A 3 "dead_letter:" config.docker.yaml
+
+# 3. Restart executor if needed
+docker compose restart executor
+```
+
+### Messages Not Expiring
+
+**Symptoms:** Executions stuck in SCHEDULED, DLQ empty
+
+**Common Causes:**
+1. Worker queues not configured with TTL
+2. Worker queues not configured with DLX
+3. Infrastructure setup failed
+
+**Resolution:**
+```bash
+# 1. Check queue properties
+rabbitmqadmin show queue name=worker.1.executions
+
+# Look for:
+# - arguments.x-message-ttl: 300000
+# - arguments.x-dead-letter-exchange: attune.dlx
+
+# 2. Recreate infrastructure (safe, idempotent)
+docker compose restart executor worker-shell
+```
+
+## Testing
+
+### Manual Test: Verify TTL Expiration
+
+```bash
+# 1. Stop all workers
+docker compose stop worker-shell worker-python worker-node
+
+# 2. Create execution
+curl -X POST http://localhost:8080/api/v1/executions \
+  -H "Authorization: Bearer $TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "action_ref": "core.echo",
+    "parameters": {"message": "test"}
+  }'
+
+# 3. Wait for TTL expiration (5+ minutes)
+sleep 330
+
+# 4. Check execution status
+curl http://localhost:8080/api/v1/executions/{id} | jq '.data.status'
+# Should be "failed"
+
+# 5. Check error message
+curl http://localhost:8080/api/v1/executions/{id} | jq '.data.result'
+# Should contain "Worker queue TTL expired"
+
+# 6. Verify DLQ processed it
+rabbitmqadmin list queues name messages | grep attune.dlx.queue
+# Should show 0 messages (processed and removed)
+```
+
+## Relationship to Phase 1
+
+**Phase 1 (Timeout Monitor):**
+- Monitors executions in SCHEDULED state
+- Fails executions after configured timeout
+- Acts as backup safety net
+
+**Phase 2 (Queue TTL + DLQ):**
+- Expires messages at queue level
+- More precise failure detection
+- Provides better visibility (DLQ metrics)
+
+**Together:** Provide defense-in-depth for worker unavailability
+
+## Common Operations
+
+### View DLQ Messages
+
+```bash
+# Get messages from DLQ (doesn't remove)
+rabbitmqadmin get queue=attune.dlx.queue count=10
+
+# View x-death header for expiration details
+rabbitmqadmin get queue=attune.dlx.queue count=1 --format=long
+```
+
+### Manually Purge DLQ
+
+```bash
+# Use with caution - removes all messages
+rabbitmqadmin purge queue name=attune.dlx.queue
+```
+
+### Temporarily Disable DLQ
+
+```yaml
+# config.docker.yaml
+message_queue:
+  rabbitmq:
+    dead_letter:
+      enabled: false  # Disables DLQ handler
+```
+
+**Note:** Messages will still expire but won't be processed
+
+### Adjust TTL Without Restart
+
+Not possible - queue TTL is set at queue creation time. To change:
+
+```bash
+# 1. Stop all services
+docker compose down
+
+# 2. Delete worker queues (forces recreation)
+rabbitmqadmin delete queue name=worker.1.executions
+# Repeat for all worker queues
+
+# 3. Update config
+# Edit worker_queue_ttl_ms
+
+# 4. Restart services (queues recreated with new TTL)
+docker compose up -d
+```
+
+## Key Files
+
+### Configuration
+- `config.docker.yaml` - Production settings
+- `config.development.yaml` - Development settings
+
+### Implementation
+- `crates/common/src/mq/config.rs` - TTL configuration
+- `crates/common/src/mq/connection.rs` - Queue setup with TTL
+- `crates/executor/src/dead_letter_handler.rs` - DLQ processing
+- `crates/executor/src/service.rs` - DLQ handler integration
+
+### Documentation
+- `docs/architecture/worker-queue-ttl-dlq.md` - Full architecture
+- `docs/architecture/worker-availability-handling.md` - Phase 1 (backup)
+
+## When to Use
+
+**Enable DLQ (default):**
+- Production environments
+- Development with multiple workers
+- Any environment requiring high reliability
+
+**Disable DLQ:**
+- Local development with single worker
+- Testing scenarios where you want manual control
+- Debugging worker behavior
+
+## Next Steps (Phase 3)
+
+- **Health probes:** Proactive worker health checking
+- **Intelligent retry:** Retry transient failures
+- **Per-action TTL:** Custom timeouts per action type
+- **DLQ analytics:** Aggregate failure statistics
+
+## See Also
+
+- Phase 1 Documentation: `docs/architecture/worker-availability-handling.md`
+- Queue Architecture: `docs/architecture/queue-architecture.md`
+- RabbitMQ Dead Letter Exchanges: https://www.rabbitmq.com/dlx.html
--- a/docs/api/api-executions.md
+++ b/docs/api/api-executions.md
@@ -339,7 +339,7 @@ Understanding the execution lifecycle helps with monitoring and debugging:
 ```
 1. requested   → Action execution requested
 2. scheduling  → Finding available worker
-3. scheduled   → Assigned to worker, queued
+3. scheduled   → Assigned to worker, queued [HANDOFF TO WORKER]
 4. running     → Currently executing
 5. completed   → Finished successfully
   OR
@@ -352,33 +352,78 @@ Understanding the execution lifecycle helps with monitoring and debugging:
   abandoned   → Worker lost
 ```

+### State Ownership Model
+
+Execution state is owned by different services at different lifecycle stages:
+
+**Executor Ownership (Pre-Handoff):**
+- `requested` → `scheduling` → `scheduled`
+- Executor creates and updates execution records
+- Executor selects worker and publishes `execution.scheduled`
+- **Handles cancellations/failures BEFORE handoff** (before `execution.scheduled` is published)
+
+**Handoff Point:**
+- When `execution.scheduled` message is **published to worker**
+- Before handoff: Executor owns and updates state
+- After handoff: Worker owns and updates state
+
+**Worker Ownership (Post-Handoff):**
+- `running` → `completed` / `failed` / `cancelled` / `timeout` / `abandoned`
+- Worker updates execution records directly
+- Worker publishes status change notifications
+- **Handles cancellations/failures AFTER handoff** (after receiving `execution.scheduled`)
+- Worker only owns executions it has received
+
+**Orchestration (Read-Only):**
+- Executor receives status change notifications for orchestration
+- Triggers workflow children, manages parent-child relationships
+- Does NOT update execution state after handoff
+
 ### State Transitions

 **Normal Flow:**
 ```
-requested → scheduling → scheduled → running → completed
+requested → scheduling → scheduled → [HANDOFF] → running → completed
+  └─ Executor Updates ─────────┘      └─ Worker Updates ─┘
 ```

 **Failure Flow:**
 ```
-requested → scheduling → scheduled → running → failed
+requested → scheduling → scheduled → [HANDOFF] → running → failed
+  └─ Executor Updates ─────────┘      └─ Worker Updates ──┘
 ```

-**Cancellation:**
+**Cancellation (depends on handoff):**
 ```
-(any state) → canceling → cancelled
+Before handoff:
+  requested/scheduling/scheduled → cancelled
+  └─ Executor Updates (worker never notified) ──┘
+
+After handoff:
+  running → canceling → cancelled
+           └─ Worker Updates ──┘
 ```

 **Timeout:**
 ```
-scheduled/running → timeout
+scheduled/running → [HANDOFF] → timeout
+                                └─ Worker Updates
 ```

 **Abandonment:**
 ```
-scheduled/running → abandoned
+scheduled/running → [HANDOFF] → abandoned
+                                └─ Worker Updates
 ```

+**Key Points:**
+- Only one service updates each execution stage (no race conditions)
+- Handoff occurs when `execution.scheduled` is **published**, not just when status is set to `scheduled`
+- If cancelled before handoff: Executor updates (worker never knows execution existed)
+- If cancelled after handoff: Worker updates (worker owns execution)
+- Worker is authoritative source for execution state after receiving `execution.scheduled`
+- Status changes are reflected in real-time via notifications
+
 ---

 ## Data Fields
--- a/docs/architecture/executor-service.md
+++ b/docs/architecture/executor-service.md
@@ -87,32 +87,47 @@ Execution Requested → Scheduler → Worker Selection → Execution Scheduled

 ### 3. Execution Manager

-**Purpose**: Manages execution lifecycle and status transitions.
+**Purpose**: Orchestrates execution workflows and handles lifecycle events.

 **Responsibilities**:
 - Listens for `execution.status.*` messages from workers
- Updates execution records with status changes
- Handles execution completion (success, failure, cancellation)
- Orchestrates workflow executions (parent-child relationships)
- Publishes completion notifications for downstream consumers
+- **Does NOT update execution state** (worker owns state after scheduling)
+- Handles execution completion orchestration (triggering child executions)
+- Manages workflow executions (parent-child relationships)
+- Coordinates workflow state transitions
+
+**Ownership Model**:
+- **Executor owns**: Requested → Scheduling → Scheduled (updates DB)
+  - Includes pre-handoff cancellations/failures (before `execution.scheduled` is published)
+- **Worker owns**: Running → Completed/Failed/Cancelled (updates DB)
+  - Includes post-handoff cancellations/failures (after receiving `execution.scheduled`)
+- **Handoff Point**: When `execution.scheduled` message is **published** to worker
+  - Before publish: Executor owns and updates state
+  - After publish: Worker owns and updates state

 **Message Flow**:
 ```
-Worker Status Update → Execution Manager → Database Update → Completion Handler
+Worker Status Update → Execution Manager → Orchestration Logic (Read-Only)
+                                         → Trigger Child Executions
 ```

 **Status Lifecycle**:
 ```
-Requested → Scheduling → Scheduled → Running → Completed/Failed/Cancelled
-                                        │
-                                        └→ Child Executions (workflows)
+Requested → Scheduling → Scheduled → [HANDOFF: execution.scheduled published] → Running → Completed/Failed/Cancelled
+    │                       │                                                     │
+    └─ Executor Updates ───┘                                                     └─ Worker Updates
+    │  (includes pre-handoff                                                     │  (includes post-handoff
+    │   Cancelled)                                                               │   Cancelled/Timeout/Abandoned)
+                                                                                  │
+                                                                                  └→ Child Executions (workflows)
 ```

 **Key Implementation Details**:
- Parses status strings to typed enums for type safety
+- Receives status change notifications for orchestration purposes only
+- Does not update execution state after handoff to worker
 - Handles workflow orchestration (parent-child execution chaining)
 - Only triggers child executions on successful parent completion
- Publishes completion events for notification service
+- Read-only access to execution records for orchestration logic

 ## Message Queue Integration

@@ -123,12 +138,14 @@ The Executor consumes and produces several message types:
 **Consumed**:
 - `enforcement.created` - New enforcement from triggered rules
 - `execution.requested` - Execution scheduling requests
- `execution.status.*` - Status updates from workers
+- `execution.status.changed` - Status change notifications from workers (for orchestration)
+- `execution.completed` - Completion notifications from workers (for queue management)

 **Published**:
 - `execution.requested` - To scheduler (from enforcement processor)
- `execution.scheduled` - To workers (from scheduler)
- `execution.completed` - To notifier (from execution manager)
+- `execution.scheduled` - To workers (from scheduler) **← OWNERSHIP HANDOFF**
+
+**Note**: The executor does NOT publish `execution.completed` messages. This is the worker's responsibility as the authoritative source of execution state after scheduling.

 ### Message Envelope Structure

@@ -186,11 +203,34 @@ use attune_common::repositories::{
 };
 ```

+### Database Update Ownership
+
+**Executor updates execution state** from creation through handoff:
+- Creates execution records (`Requested` status)
+- Updates status during scheduling (`Scheduling` → `Scheduled`)
+- Publishes `execution.scheduled` message to worker **← HANDOFF POINT**
+- **Handles cancellations/failures BEFORE handoff** (before message is published)
+  - Example: User cancels execution while queued by concurrency policy
+  - Executor updates to `Cancelled`, worker never receives message
+
+**Worker updates execution state** after receiving handoff:
+- Receives `execution.scheduled` message (takes ownership)
+- Updates status when execution starts (`Running`)
+- Updates status when execution completes (`Completed`, `Failed`, etc.)
+- **Handles cancellations/failures AFTER handoff** (after receiving message)
+- Updates result data and artifacts
+- Worker only owns executions it has received
+
+**Executor reads execution state** for orchestration after handoff:
+- Receives status change notifications from workers
+- Reads execution records to trigger workflow children
+- Does NOT update execution state after publishing `execution.scheduled`
+
 ### Transaction Support

 Future implementations will use database transactions for multi-step operations:
 - Creating execution + publishing message (atomic)
- Status update + completion handling (atomic)
+- Enforcement processing + execution creation (atomic)

 ## Configuration

--- a/docs/architecture/worker-availability-handling.md
+++ b/docs/architecture/worker-availability-handling.md
@@ -0,0 +1,557 @@
+# Worker Availability Handling
+
+**Status**: Implementation Gap Identified  
+**Priority**: High  
+**Date**: 2026-02-09
+
+## Problem Statement
+
+When workers are stopped or become unavailable, the executor continues attempting to schedule executions to them, resulting in:
+
+1. **Stuck executions**: Executions remain in `SCHEDULING` or `SCHEDULED` status indefinitely
+2. **Queue buildup**: Messages accumulate in worker-specific RabbitMQ queues
+3. **No failure notification**: Users don't know their executions are stuck
+4. **Resource waste**: System resources consumed by queued messages and database records
+
+## Current Architecture
+
+### Heartbeat Mechanism
+
+Workers send heartbeat updates to the database periodically (default: 30 seconds).
+
+```rust
+// From crates/executor/src/scheduler.rs
+const DEFAULT_HEARTBEAT_INTERVAL: u64 = 30;
+const HEARTBEAT_STALENESS_MULTIPLIER: u64 = 3;
+
+fn is_worker_heartbeat_fresh(worker: &Worker) -> bool {
+    // Worker is fresh if heartbeat < 90 seconds old
+    let max_age = Duration::from_secs(
+        DEFAULT_HEARTBEAT_INTERVAL * HEARTBEAT_STALENESS_MULTIPLIER
+    );
+    // ...
+}
+```
+
+### Scheduling Flow
+
+```
+Execution Created (REQUESTED)
+    ↓
+Scheduler receives message
+    ↓
+Find compatible worker with fresh heartbeat
+    ↓
+Update execution to SCHEDULED
+    ↓
+Publish message to worker-specific queue
+    ↓
+Worker consumes and executes
+```
+
+### Failure Points
+
+1. **Worker stops after heartbeat**: Worker has fresh heartbeat but is actually down
+2. **Worker crashes**: No graceful shutdown, heartbeat appears fresh temporarily
+3. **Network partition**: Worker isolated but appears healthy
+4. **Queue accumulation**: Messages sit in worker-specific queues indefinitely
+
+## Current Mitigations (Insufficient)
+
+### 1. Heartbeat Staleness Check
+
+```rust
+fn select_worker(pool: &PgPool, action: &Action) -> Result<Worker> {
+    // Filter by active workers
+    let active_workers: Vec<_> = workers
+        .into_iter()
+        .filter(|w| w.status == WorkerStatus::Active)
+        .collect();
+
+    // Filter by heartbeat freshness
+    let fresh_workers: Vec<_> = active_workers
+        .into_iter()
+        .filter(|w| is_worker_heartbeat_fresh(w))
+        .collect();
+
+    if fresh_workers.is_empty() {
+        return Err(anyhow!("No workers with fresh heartbeats"));
+    }
+
+    // Select first available worker
+    Ok(fresh_workers.into_iter().next().unwrap())
+}
+```
+
+**Gap**: Workers can stop within the 90-second staleness window.
+
+### 2. Message Requeue on Error
+
+```rust
+// From crates/common/src/mq/consumer.rs
+match handler(envelope.clone()).await {
+    Err(e) => {
+        let requeue = e.is_retriable();
+        channel.basic_nack(delivery_tag, BasicNackOptions {
+            requeue,
+            multiple: false,
+        }).await?;
+    }
+}
+```
+
+**Gap**: Only requeues on retriable errors (connection/timeout), not worker unavailability.
+
+### 3. Message TTL Configuration
+
+```rust
+// From crates/common/src/config.rs
+pub struct MessageQueueConfig {
+    #[serde(default = "default_message_ttl")]
+    pub message_ttl: u64,
+}
+
+fn default_message_ttl() -> u64 {
+    3600 // 1 hour
+}
+```
+
+**Gap**: TTL not currently applied to worker queues, and 1 hour is too long.
+
+## Proposed Solutions
+
+### Solution 1: Execution Timeout Mechanism (HIGH PRIORITY)
+
+Add a background task that monitors scheduled executions and fails them if they don't start within a timeout.
+
+**Implementation:**
+
+```rust
+// crates/executor/src/execution_timeout_monitor.rs
+
+pub struct ExecutionTimeoutMonitor {
+    pool: PgPool,
+    publisher: Arc<Publisher>,
+    check_interval: Duration,
+    scheduled_timeout: Duration,
+}
+
+impl ExecutionTimeoutMonitor {
+    pub async fn start(&self) -> Result<()> {
+        let mut interval = tokio::time::interval(self.check_interval);
+
+        loop {
+            interval.tick().await;
+
+            if let Err(e) = self.check_stale_executions().await {
+                error!("Error checking stale executions: {}", e);
+            }
+        }
+    }
+
+    async fn check_stale_executions(&self) -> Result<()> {
+        let cutoff = Utc::now() - chrono::Duration::from_std(self.scheduled_timeout)?;
+
+        // Find executions stuck in SCHEDULED status
+        let stale_executions = sqlx::query_as::<_, Execution>(
+            "SELECT * FROM execution 
+             WHERE status = 'scheduled' 
+             AND updated < $1"
+        )
+        .bind(cutoff)
+        .fetch_all(&self.pool)
+        .await?;
+
+        for execution in stale_executions {
+            warn!(
+                "Execution {} has been scheduled for too long, marking as failed",
+                execution.id
+            );
+
+            self.fail_execution(
+                execution.id,
+                "Execution timeout: worker did not pick up task within timeout"
+            ).await?;
+        }
+
+        Ok(())
+    }
+
+    async fn fail_execution(&self, execution_id: i64, reason: &str) -> Result<()> {
+        // Update execution status
+        sqlx::query(
+            "UPDATE execution 
+             SET status = 'failed', 
+                 result = $2,
+                 updated = NOW() 
+             WHERE id = $1"
+        )
+        .bind(execution_id)
+        .bind(serde_json::json!({
+            "error": reason,
+            "failed_by": "execution_timeout_monitor"
+        }))
+        .execute(&self.pool)
+        .await?;
+
+        // Publish completion notification
+        let payload = ExecutionCompletedPayload {
+            execution_id,
+            status: ExecutionStatus::Failed,
+            result: Some(serde_json::json!({"error": reason})),
+        };
+
+        self.publisher
+            .publish_envelope(
+                MessageType::ExecutionCompleted,
+                payload,
+                "attune.executions",
+            )
+            .await?;
+
+        Ok(())
+    }
+}
+```
+
+**Configuration:**
+
+```yaml
+# config.yaml
+executor:
+  scheduled_timeout: 300  # 5 minutes (fail if not running within 5 min)
+  timeout_check_interval: 60  # Check every minute
+```
+
+### Solution 2: Worker Queue TTL and DLQ (MEDIUM PRIORITY)
+
+Apply message TTL to worker-specific queues with dead letter exchange.
+
+**Implementation:**
+
+```rust
+// When declaring worker-specific queues
+let mut queue_args = FieldTable::default();
+
+// Set message TTL (5 minutes)
+queue_args.insert(
+    "x-message-ttl".into(),
+    AMQPValue::LongInt(300_000) // 5 minutes in milliseconds
+);
+
+// Set dead letter exchange
+queue_args.insert(
+    "x-dead-letter-exchange".into(),
+    AMQPValue::LongString("attune.executions.dlx".into())
+);
+
+channel.queue_declare(
+    &format!("attune.execution.worker.{}", worker_id),
+    QueueDeclareOptions {
+        durable: true,
+        ..Default::default()
+    },
+    queue_args,
+).await?;
+```
+
+**Dead Letter Handler:**
+
+```rust
+// crates/executor/src/dead_letter_handler.rs
+
+pub struct DeadLetterHandler {
+    pool: PgPool,
+    consumer: Arc<Consumer>,
+}
+
+impl DeadLetterHandler {
+    pub async fn start(&self) -> Result<()> {
+        self.consumer
+            .consume_with_handler(|envelope: MessageEnvelope<ExecutionScheduledPayload>| {
+                let pool = self.pool.clone();
+                
+                async move {
+                    warn!("Received dead letter for execution {}", envelope.payload.execution_id);
+                    
+                    // Mark execution as failed
+                    sqlx::query(
+                        "UPDATE execution 
+                         SET status = 'failed', 
+                             result = $2,
+                             updated = NOW() 
+                         WHERE id = $1 AND status = 'scheduled'"
+                    )
+                    .bind(envelope.payload.execution_id)
+                    .bind(serde_json::json!({
+                        "error": "Message expired in worker queue (worker unavailable)",
+                        "failed_by": "dead_letter_handler"
+                    }))
+                    .execute(&pool)
+                    .await?;
+                    
+                    Ok(())
+                }
+            })
+            .await
+    }
+}
+```
+
+### Solution 3: Worker Health Probes (LOW PRIORITY)
+
+Add active health checking instead of relying solely on heartbeats.
+
+**Implementation:**
+
+```rust
+// crates/executor/src/worker_health_checker.rs
+
+pub struct WorkerHealthChecker {
+    pool: PgPool,
+    check_interval: Duration,
+}
+
+impl WorkerHealthChecker {
+    pub async fn start(&self) -> Result<()> {
+        let mut interval = tokio::time::interval(self.check_interval);
+
+        loop {
+            interval.tick().await;
+
+            if let Err(e) = self.check_worker_health().await {
+                error!("Error checking worker health: {}", e);
+            }
+        }
+    }
+
+    async fn check_worker_health(&self) -> Result<()> {
+        let workers = WorkerRepository::find_action_workers(&self.pool).await?;
+
+        for worker in workers {
+            // Skip if heartbeat is very stale (worker is definitely down)
+            if !is_heartbeat_recent(&worker) {
+                continue;
+            }
+
+            // Attempt health check
+            match self.ping_worker(&worker).await {
+                Ok(true) => {
+                    // Worker is healthy, ensure status is Active
+                    if worker.status != Some(WorkerStatus::Active) {
+                        self.update_worker_status(worker.id, WorkerStatus::Active).await?;
+                    }
+                }
+                Ok(false) | Err(_) => {
+                    // Worker is unhealthy, mark as inactive
+                    warn!("Worker {} failed health check", worker.name);
+                    self.update_worker_status(worker.id, WorkerStatus::Inactive).await?;
+                }
+            }
+        }
+
+        Ok(())
+    }
+
+    async fn ping_worker(&self, worker: &Worker) -> Result<bool> {
+        // TODO: Implement health endpoint on worker
+        // For now, check if worker's queue is being consumed
+        Ok(true)
+    }
+}
+```
+
+### Solution 4: Graceful Worker Shutdown (MEDIUM PRIORITY)
+
+Ensure workers mark themselves as inactive before shutdown.
+
+**Implementation:**
+
+```rust
+// In worker service shutdown handler
+impl WorkerService {
+    pub async fn shutdown(&self) -> Result<()> {
+        info!("Worker shutting down gracefully...");
+
+        // Mark worker as inactive
+        sqlx::query(
+            "UPDATE worker SET status = 'inactive', updated = NOW() WHERE id = $1"
+        )
+        .bind(self.worker_id)
+        .execute(&self.pool)
+        .await?;
+
+        // Stop accepting new tasks
+        self.stop_consuming().await?;
+
+        // Wait for in-flight tasks to complete (with timeout)
+        let timeout = Duration::from_secs(30);
+        tokio::time::timeout(timeout, self.wait_for_completion()).await?;
+
+        info!("Worker shutdown complete");
+        Ok(())
+    }
+}
+```
+
+**Docker Signal Handling:**
+
+```yaml
+# docker-compose.yaml
+services:
+  worker-shell:
+    stop_grace_period: 45s  # Give worker time to finish tasks
+```
+
+## Implementation Priority
+
+### Phase 1: Immediate (Week 1)
+1. **Execution Timeout Monitor** - Prevents stuck executions
+2. **Graceful Shutdown** - Marks workers inactive on stop
+
+### Phase 2: Short-term (Week 2)
+3. **Worker Queue TTL + DLQ** - Prevents message buildup
+4. **Dead Letter Handler** - Fails expired executions
+
+### Phase 3: Long-term (Month 1)
+5. **Worker Health Probes** - Active availability verification
+6. **Retry Logic** - Reschedule to different worker on failure
+
+## Configuration
+
+### Recommended Timeouts
+
+```yaml
+executor:
+  # How long an execution can stay SCHEDULED before failing
+  scheduled_timeout: 300  # 5 minutes
+
+  # How often to check for stale executions
+  timeout_check_interval: 60  # 1 minute
+
+  # Message TTL in worker queues
+  worker_queue_ttl: 300  # 5 minutes (match scheduled_timeout)
+
+  # Worker health check interval
+  health_check_interval: 30  # 30 seconds
+
+worker:
+  # How often to send heartbeats
+  heartbeat_interval: 10  # 10 seconds (more frequent)
+
+  # Grace period for shutdown
+  shutdown_timeout: 30  # 30 seconds
+```
+
+### Staleness Calculation
+
+```
+Heartbeat Staleness Threshold = heartbeat_interval * 3
+                               = 10 * 3 = 30 seconds
+
+This means:
+- Worker sends heartbeat every 10s
+- If heartbeat is > 30s old, worker is considered stale
+- Reduces window where stopped worker appears healthy from 90s to 30s
+```
+
+## Monitoring and Observability
+
+### Metrics to Track
+
+1. **Execution timeout rate**: Number of executions failed due to timeout
+2. **Worker downtime**: Time between last heartbeat and status change
+3. **Dead letter queue depth**: Number of expired messages
+4. **Average scheduling latency**: Time from REQUESTED to RUNNING
+
+### Alerts
+
+```yaml
+alerts:
+  - name: high_execution_timeout_rate
+    condition: execution_timeouts > 10 per minute
+    severity: warning
+
+  - name: no_active_workers
+    condition: active_workers == 0
+    severity: critical
+
+  - name: dlq_buildup
+    condition: dlq_depth > 100
+    severity: warning
+
+  - name: stale_executions
+    condition: scheduled_executions_older_than_5min > 0
+    severity: warning
+```
+
+## Testing
+
+### Test Scenarios
+
+1. **Worker stops mid-execution**: Should timeout and fail
+2. **Worker never picks up task**: Should timeout after 5 minutes
+3. **All workers down**: Should immediately fail with "no workers available"
+4. **Worker stops gracefully**: Should mark inactive and not receive new tasks
+5. **Message expires in queue**: Should be moved to DLQ and execution failed
+
+### Integration Test Example
+
+```rust
+#[tokio::test]
+async fn test_execution_timeout_on_worker_down() {
+    let pool = setup_test_db().await;
+    let mq = setup_test_mq().await;
+
+    // Create worker and execution
+    let worker = create_test_worker(&pool).await;
+    let execution = create_test_execution(&pool).await;
+
+    // Schedule execution to worker
+    schedule_execution(&pool, &mq, execution.id, worker.id).await;
+
+    // Stop worker (simulate crash - no graceful shutdown)
+    stop_worker(worker.id).await;
+
+    // Wait for timeout
+    tokio::time::sleep(Duration::from_secs(310)).await;
+
+    // Verify execution is marked as failed
+    let execution = get_execution(&pool, execution.id).await;
+    assert_eq!(execution.status, ExecutionStatus::Failed);
+    assert!(execution.result.unwrap()["error"]
+        .as_str()
+        .unwrap()
+        .contains("timeout"));
+}
+```
+
+## Migration Path
+
+### Step 1: Add Monitoring (No Breaking Changes)
+- Deploy execution timeout monitor
+- Monitor logs for timeout events
+- Tune timeout values based on actual workload
+
+### Step 2: Add DLQ (Requires Queue Reconfiguration)
+- Create dead letter exchange
+- Update queue declarations with TTL and DLX
+- Deploy dead letter handler
+- Monitor DLQ depth
+
+### Step 3: Graceful Shutdown (Worker Update)
+- Add shutdown handler to worker
+- Update Docker Compose stop_grace_period
+- Test worker restarts
+
+### Step 4: Health Probes (Future Enhancement)
+- Add health endpoint to worker
+- Deploy health checker service
+- Transition from heartbeat-only to active probing
+
+## Related Documentation
+
+- [Queue Architecture](./queue-architecture.md)
+- [Worker Service](./worker-service.md)
+- [Executor Service](./executor-service.md)
+- [RabbitMQ Queues Quick Reference](../docs/QUICKREF-rabbitmq-queues.md)
--- a/docs/architecture/worker-queue-ttl-dlq.md
+++ b/docs/architecture/worker-queue-ttl-dlq.md
@@ -0,0 +1,493 @@
+# Worker Queue TTL and Dead Letter Queue (Phase 2)
+
+## Overview
+
+Phase 2 of worker availability handling implements message TTL (time-to-live) on worker-specific queues and dead letter queue (DLQ) processing. This ensures that executions sent to unavailable workers are automatically failed instead of remaining stuck indefinitely.
+
+## Architecture
+
+### Message Flow
+
+```
+┌─────────────┐
+│  Executor   │
+│  Scheduler  │
+└──────┬──────┘
+       │ Publishes ExecutionRequested
+       │ routing_key: execution.dispatch.worker.{id}
+       │
+       ▼
+┌──────────────────────────────────┐
+│  worker.{id}.executions queue    │
+│                                  │
+│  Properties:                     │
+│  - x-message-ttl: 300000ms (5m)  │
+│  - x-dead-letter-exchange: dlx   │
+└──────┬───────────────────┬───────┘
+       │                   │
+       │ Worker consumes   │ TTL expires
+       │ (normal flow)     │ (worker unavailable)
+       │                   │
+       ▼                   ▼
+┌──────────────┐    ┌──────────────────┐
+│   Worker     │    │  attune.dlx      │
+│   Service    │    │  (Dead Letter    │
+│              │    │   Exchange)      │
+└──────────────┘    └────────┬─────────┘
+                             │
+                             │ Routes to DLQ
+                             │
+                             ▼
+                    ┌──────────────────────┐
+                    │  attune.dlx.queue    │
+                    │  (Dead Letter Queue) │
+                    └────────┬─────────────┘
+                             │
+                             │ Consumes
+                             │
+                             ▼
+                    ┌──────────────────────┐
+                    │  Dead Letter Handler │
+                    │  (in Executor)       │
+                    │                      │
+                    │  - Identifies exec   │
+                    │  - Marks as FAILED   │
+                    │  - Logs failure      │
+                    └──────────────────────┘
+```
+
+### Components
+
+#### 1. Worker Queue TTL
+
+**Configuration:**
+- Default: 5 minutes (300,000 milliseconds)
+- Configurable via `rabbitmq.worker_queue_ttl_ms`
+
+**Implementation:**
+- Applied during queue declaration in `Connection::setup_worker_infrastructure()`
+- Uses RabbitMQ's `x-message-ttl` queue argument
+- Only applies to worker-specific queues (`worker.{id}.executions`)
+
+**Behavior:**
+- When a message remains in the queue longer than TTL
+- RabbitMQ automatically moves it to the configured dead letter exchange
+- Original message properties and headers are preserved
+- Includes `x-death` header with expiration details
+
+#### 2. Dead Letter Exchange (DLX)
+
+**Configuration:**
+- Exchange name: `attune.dlx`
+- Type: `direct`
+- Durable: `true`
+
+**Setup:**
+- Created in `Connection::setup_common_infrastructure()`
+- Bound to dead letter queue with routing key `#` (all messages)
+- Shared across all services
+
+#### 3. Dead Letter Queue
+
+**Configuration:**
+- Queue name: `attune.dlx.queue`
+- Durable: `true`
+- TTL: 24 hours (configurable via `rabbitmq.dead_letter.ttl_ms`)
+
+**Properties:**
+- Retains messages for debugging and analysis
+- Messages auto-expire after retention period
+- No DLX on the DLQ itself (prevents infinite loops)
+
+#### 4. Dead Letter Handler
+
+**Location:** `crates/executor/src/dead_letter_handler.rs`
+
+**Responsibilities:**
+1. Consume messages from `attune.dlx.queue`
+2. Deserialize message envelope
+3. Extract execution ID from payload
+4. Verify execution is in non-terminal state
+5. Update execution to FAILED status
+6. Add descriptive error information
+7. Acknowledge message (remove from DLQ)
+
+**Error Handling:**
+- Invalid messages: Acknowledged and discarded
+- Missing executions: Acknowledged (already processed)
+- Terminal state executions: Acknowledged (no action needed)
+- Database errors: Nacked with requeue (retry later)
+
+## Configuration
+
+### RabbitMQ Configuration Structure
+
+```yaml
+message_queue:
+  rabbitmq:
+    # Worker queue TTL - how long messages wait before DLX
+    worker_queue_ttl_ms: 300000  # 5 minutes (default)
+    
+    # Dead letter configuration
+    dead_letter:
+      enabled: true                # Enable DLQ system
+      exchange: attune.dlx         # DLX name
+      ttl_ms: 86400000            # DLQ retention (24 hours)
+```
+
+### Environment-Specific Settings
+
+#### Development (`config.development.yaml`)
+```yaml
+message_queue:
+  rabbitmq:
+    worker_queue_ttl_ms: 300000  # 5 minutes
+    dead_letter:
+      enabled: true
+      exchange: attune.dlx
+      ttl_ms: 86400000  # 24 hours
+```
+
+#### Production (`config.docker.yaml`)
+```yaml
+message_queue:
+  rabbitmq:
+    worker_queue_ttl_ms: 300000  # 5 minutes
+    dead_letter:
+      enabled: true
+      exchange: attune.dlx
+      ttl_ms: 86400000  # 24 hours
+```
+
+### Tuning Guidelines
+
+**Worker Queue TTL (`worker_queue_ttl_ms`):**
+- **Too short:** Legitimate slow workers may have executions failed prematurely
+- **Too long:** Unavailable workers cause delayed failure detection
+- **Recommendation:** 2-5x typical execution time, minimum 2 minutes
+- **Default (5 min):** Good balance for most workloads
+
+**DLQ Retention (`dead_letter.ttl_ms`):**
+- Purpose: Debugging and forensics
+- **Too short:** May lose data before analysis
+- **Too long:** Accumulates stale data
+- **Recommendation:** 24-48 hours in production
+- **Default (24 hours):** Adequate for most troubleshooting
+
+## Code Structure
+
+### Queue Declaration with TTL
+
+```rust
+// crates/common/src/mq/connection.rs
+
+pub async fn declare_queue_with_dlx_and_ttl(
+    &self,
+    config: &QueueConfig,
+    dlx_exchange: &str,
+    ttl_ms: Option<u64>,
+) -> MqResult<()> {
+    let mut args = FieldTable::default();
+    
+    // Configure DLX
+    args.insert(
+        "x-dead-letter-exchange".into(),
+        AMQPValue::LongString(dlx_exchange.into()),
+    );
+    
+    // Configure TTL if specified
+    if let Some(ttl) = ttl_ms {
+        args.insert(
+            "x-message-ttl".into(),
+            AMQPValue::LongInt(ttl as i64),
+        );
+    }
+    
+    // Declare queue with arguments
+    channel.queue_declare(&config.name, options, args).await?;
+    Ok(())
+}
+```
+
+### Dead Letter Handler
+
+```rust
+// crates/executor/src/dead_letter_handler.rs
+
+pub struct DeadLetterHandler {
+    pool: Arc<PgPool>,
+    consumer: Consumer,
+    running: Arc<Mutex<bool>>,
+}
+
+impl DeadLetterHandler {
+    pub async fn start(&self) -> Result<(), Error> {
+        self.consumer.consume_with_handler(|envelope| {
+            match envelope.message_type {
+                MessageType::ExecutionRequested => {
+                    handle_execution_requested(&pool, &envelope).await
+                }
+                _ => {
+                    // Unexpected message type - acknowledge and discard
+                    Ok(())
+                }
+            }
+        }).await
+    }
+}
+
+async fn handle_execution_requested(
+    pool: &PgPool,
+    envelope: &MessageEnvelope<Value>,
+) -> MqResult<()> {
+    // Extract execution ID
+    let execution_id = envelope.payload.get("execution_id")
+        .and_then(|v| v.as_i64())
+        .ok_or_else(|| /* error */)?;
+    
+    // Fetch current state
+    let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
+    
+    // Only fail if in non-terminal state
+    if !execution.status.is_terminal() {
+        ExecutionRepository::update(pool, execution_id, UpdateExecutionInput {
+            status: Some(ExecutionStatus::Failed),
+            result: Some(json!({
+                "error": "Worker queue TTL expired",
+                "message": "Worker did not process execution within configured TTL",
+            })),
+            ended: Some(Some(Utc::now())),
+            ..Default::default()
+        }).await?;
+    }
+    
+    Ok(())
+}
+```
+
+## Integration with Executor Service
+
+The dead letter handler is started automatically by the executor service if DLQ is enabled:
+
+```rust
+// crates/executor/src/service.rs
+
+pub async fn start(&self) -> Result<()> {
+    // ... other components ...
+    
+    // Start dead letter handler (if enabled)
+    if self.inner.mq_config.rabbitmq.dead_letter.enabled {
+        let dlq_name = format!("{}.queue", 
+            self.inner.mq_config.rabbitmq.dead_letter.exchange);
+        let dlq_consumer = Consumer::new(
+            &self.inner.mq_connection,
+            create_dlq_consumer_config(&dlq_name, "executor.dlq"),
+        ).await?;
+        
+        let dlq_handler = Arc::new(
+            DeadLetterHandler::new(self.inner.pool.clone(), dlq_consumer).await?
+        );
+        
+        handles.push(tokio::spawn(async move {
+            dlq_handler.start().await
+        }));
+    }
+    
+    // ... wait for completion ...
+}
+```
+
+## Operational Considerations
+
+### Monitoring
+
+**Key Metrics:**
+- DLQ message rate (messages/sec entering DLQ)
+- DLQ queue depth (current messages in DLQ)
+- DLQ processing latency (time from DLX to handler)
+- Failed execution count (executions failed via DLQ)
+
+**Alerting Thresholds:**
+- DLQ rate > 10/min: Workers may be unhealthy or TTL too aggressive
+- DLQ depth > 100: Handler may be falling behind
+- High failure rate: Systematic worker availability issues
+
+### RabbitMQ Management
+
+**View DLQ:**
+```bash
+# List messages in DLQ
+rabbitmqadmin list queues name messages
+
+# Get DLQ details
+rabbitmqadmin show queue name=attune.dlx.queue
+
+# Purge DLQ (use with caution)
+rabbitmqadmin purge queue name=attune.dlx.queue
+```
+
+**View Dead Letters:**
+```bash
+# Get message from DLQ
+rabbitmqadmin get queue=attune.dlx.queue count=1
+
+# Check message death history
+# Look for x-death header in message properties
+```
+
+### Troubleshooting
+
+#### High DLQ Rate
+
+**Symptoms:** Many executions failing via DLQ
+
+**Causes:**
+1. Workers down or restarting frequently
+2. Worker queue TTL too aggressive
+3. Worker overloaded (not consuming fast enough)
+4. Network issues between executor and workers
+
+**Resolution:**
+1. Check worker health and logs
+2. Verify worker heartbeats in database
+3. Consider increasing `worker_queue_ttl_ms`
+4. Scale worker fleet if overloaded
+
+#### DLQ Handler Not Processing
+
+**Symptoms:** DLQ depth increasing, executions stuck
+
+**Causes:**
+1. Executor service not running
+2. DLQ disabled in configuration
+3. Database connection issues
+4. Handler crashed or deadlocked
+
+**Resolution:**
+1. Check executor service logs
+2. Verify `dead_letter.enabled = true`
+3. Check database connectivity
+4. Restart executor service if needed
+
+#### Messages Not Reaching DLQ
+
+**Symptoms:** Executions stuck, DLQ empty
+
+**Causes:**
+1. Worker queues not configured with DLX
+2. DLX exchange not created
+3. DLQ not bound to DLX
+4. TTL not configured on worker queues
+
+**Resolution:**
+1. Restart services to recreate infrastructure
+2. Verify RabbitMQ configuration
+3. Check queue properties in RabbitMQ management UI
+
+## Testing
+
+### Unit Tests
+
+```rust
+#[tokio::test]
+async fn test_expired_execution_handling() {
+    let pool = setup_test_db().await;
+    
+    // Create execution in SCHEDULED state
+    let execution = create_test_execution(&pool, ExecutionStatus::Scheduled).await;
+    
+    // Simulate DLQ message
+    let envelope = MessageEnvelope::new(
+        MessageType::ExecutionRequested,
+        json!({ "execution_id": execution.id }),
+    );
+    
+    // Process message
+    handle_execution_requested(&pool, &envelope).await.unwrap();
+    
+    // Verify execution failed
+    let updated = ExecutionRepository::find_by_id(&pool, execution.id).await.unwrap();
+    assert_eq!(updated.status, ExecutionStatus::Failed);
+    assert!(updated.result.unwrap()["error"].as_str().unwrap().contains("TTL expired"));
+}
+```
+
+### Integration Tests
+
+```bash
+# 1. Start all services
+docker compose up -d
+
+# 2. Create execution targeting stopped worker
+curl -X POST http://localhost:8080/api/v1/executions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "action_ref": "core.echo",
+    "parameters": {"message": "test"},
+    "worker_id": 999  # Non-existent worker
+  }'
+
+# 3. Wait for TTL expiration (5+ minutes)
+sleep 330
+
+# 4. Verify execution failed
+curl http://localhost:8080/api/v1/executions/{id}
+# Should show status: "failed", error: "Worker queue TTL expired"
+
+# 5. Check DLQ processed the message
+rabbitmqadmin list queues name messages | grep attune.dlx.queue
+# Should show 0 messages (processed and removed)
+```
+
+## Relationship to Other Phases
+
+### Phase 1 (Completed)
+- Execution timeout monitor: Handles executions stuck in SCHEDULED
+- Graceful shutdown: Prevents new tasks to stopping workers
+- Reduced heartbeat: Faster stale worker detection
+
+**Interaction:** Phase 1 timeout monitor acts as a backstop if DLQ processing fails
+
+### Phase 2 (Current)
+- Worker queue TTL: Automatic message expiration
+- Dead letter queue: Capture expired messages
+- Dead letter handler: Process and fail expired executions
+
+**Benefit:** More precise failure detection at the message queue level
+
+### Phase 3 (Planned)
+- Health probes: Proactive worker health checking
+- Intelligent retry: Retry transient failures
+- Load balancing: Distribute work across healthy workers
+
+**Integration:** Phase 3 will use Phase 2 DLQ data to inform routing decisions
+
+## Benefits
+
+1. **Automatic Failure Detection:** No manual intervention needed for unavailable workers
+2. **Precise Timing:** TTL provides exact failure window (vs polling-based Phase 1)
+3. **Resource Efficiency:** Prevents message accumulation in worker queues
+4. **Debugging Support:** DLQ retains messages for forensic analysis
+5. **Graceful Degradation:** System continues functioning even with worker failures
+
+## Limitations
+
+1. **TTL Precision:** RabbitMQ TTL is approximate, not guaranteed to the millisecond
+2. **Race Conditions:** Worker may start processing just as TTL expires (rare)
+3. **DLQ Capacity:** Very high failure rates may overwhelm DLQ
+4. **No Retry Logic:** Phase 2 always fails; Phase 3 will add intelligent retry
+
+## Future Enhancements (Phase 3)
+
+- **Conditional Retry:** Retry messages based on failure reason
+- **Priority DLQ:** Prioritize critical execution failures
+- **DLQ Analytics:** Aggregate statistics on failure patterns
+- **Auto-scaling:** Scale workers based on DLQ rate
+- **Custom TTL:** Per-action or per-execution TTL configuration
+
+## References
+
+- RabbitMQ Dead Letter Exchanges: https://www.rabbitmq.com/dlx.html
+- RabbitMQ TTL: https://www.rabbitmq.com/ttl.html
+- Phase 1 Documentation: `docs/architecture/worker-availability-handling.md`
+- Queue Architecture: `docs/architecture/queue-architecture.md`
--- a/docs/architecture/worker-service.md
+++ b/docs/architecture/worker-service.md
@@ -131,28 +131,38 @@ echo "Hello, $PARAM_NAME!"

 ### 4. Action Executor

-**Purpose**: Orchestrate the complete execution flow for an action.
+**Purpose**: Orchestrate the complete execution flow for an action and own execution state after handoff.

 **Execution Flow**:
 ```
-1. Load execution record from database
-2. Update status to Running
-3. Load action definition by reference
-4. Prepare execution context (parameters, env vars, timeout)
-5. Select and execute in appropriate runtime
-6. Capture results (stdout, stderr, return value)
-7. Store artifacts (logs, results)
-8. Update execution status (Succeeded/Failed)
-9. Publish status update messages
+1. Receive execution.scheduled message from executor
+2. Load execution record from database
+3. Update status to Running (owns state after handoff)
+4. Load action definition by reference
+5. Prepare execution context (parameters, env vars, timeout)
+6. Select and execute in appropriate runtime
+7. Capture results (stdout, stderr, return value)
+8. Store artifacts (logs, results)
+9. Update execution status (Completed/Failed) in database
+10. Publish status change notifications
+11. Publish completion notification for queue management
 ```

+**Ownership Model**:
+- **Worker owns execution state** after receiving `execution.scheduled`
+- **Authoritative source** for all status updates: Running, Completed, Failed, Cancelled, etc.
+- **Updates database directly** for all state changes
+- **Publishes notifications** for orchestration and monitoring
+
 **Responsibilities**:
 - Coordinate execution lifecycle
 - Load action and execution data from database
+- **Update execution state in database** (after handoff from executor)
 - Prepare execution context with parameters and environment
 - Execute action via runtime registry
 - Handle success and failure cases
 - Store execution artifacts
+- Publish status change notifications

 **Key Implementation Details**:
 - Parameters merged: action defaults + execution overrides
@@ -246,7 +256,10 @@ See `docs/secrets-management.md` for comprehensive documentation.
 - Register worker in database
 - Start heartbeat manager
 - Consume execution messages from worker-specific queue
- Publish execution status updates
+- **Own execution state** after receiving scheduled executions
+- **Update execution status in database** (Running, Completed, Failed, etc.)
+- Publish execution status change notifications
+- Publish execution completion notifications
 - Handle graceful shutdown

 **Message Flow**:
@@ -407,8 +420,9 @@ pub struct ExecutionResult {
 ### Error Propagation

 - Runtime errors captured in `ExecutionResult.error`
- Execution status updated to Failed in database
- Error published in status update message
+- **Worker updates** execution status to Failed in database (owns state)
+- Error published in status change notification message
+- Error published in completion notification message
 - Artifacts still stored for failed executions
 - Logs preserved for debugging

--- a/docs/examples/history-page-url-examples.md
+++ b/docs/examples/history-page-url-examples.md
@@ -0,0 +1,227 @@
+# History Page URL Query Parameter Examples
+
+This document provides practical examples of using URL query parameters to deep-link to filtered views in the Attune web UI history pages.
+
+## Executions Page Examples
+
+### Basic Filtering
+
+**Filter by action:**
+```
+http://localhost:3000/executions?action_ref=core.echo
+```
+Shows all executions of the `core.echo` action.
+
+**Filter by rule:**
+```
+http://localhost:3000/executions?rule_ref=core.on_timer
+```
+Shows all executions triggered by the `core.on_timer` rule.
+
+**Filter by status:**
+```
+http://localhost:3000/executions?status=failed
+```
+Shows all failed executions.
+
+**Filter by pack:**
+```
+http://localhost:3000/executions?pack_name=core
+```
+Shows all executions from the `core` pack.
+
+### Combined Filters
+
+**Rule + Status:**
+```
+http://localhost:3000/executions?rule_ref=core.on_timer&status=completed
+```
+Shows completed executions from a specific rule.
+
+**Action + Pack:**
+```
+http://localhost:3000/executions?action_ref=core.echo&pack_name=core
+```
+Shows executions of a specific action in a pack (useful when multiple packs have similarly named actions).
+
+**Multiple Filters:**
+```
+http://localhost:3000/executions?pack_name=core&status=running&trigger_ref=core.webhook
+```
+Shows currently running executions from the core pack triggered by webhooks.
+
+### Troubleshooting Scenarios
+
+**Find all failed executions for an action:**
+```
+http://localhost:3000/executions?action_ref=mypack.problematic_action&status=failed
+```
+
+**Check running executions for a specific executor:**
+```
+http://localhost:3000/executions?executor=1&status=running
+```
+
+**View all webhook-triggered executions:**
+```
+http://localhost:3000/executions?trigger_ref=core.webhook
+```
+
+## Events Page Examples
+
+### Basic Filtering
+
+**Filter by trigger:**
+```
+http://localhost:3000/events?trigger_ref=core.webhook
+```
+Shows all webhook events.
+
+**Timer events:**
+```
+http://localhost:3000/events?trigger_ref=core.timer
+```
+Shows all timer-based events.
+
+**Custom trigger:**
+```
+http://localhost:3000/events?trigger_ref=mypack.custom_trigger
+```
+Shows events from a custom trigger.
+
+## Enforcements Page Examples
+
+### Basic Filtering
+
+**Filter by rule:**
+```
+http://localhost:3000/enforcements?rule_ref=core.on_timer
+```
+Shows all enforcements (rule activations) for a specific rule.
+
+**Filter by trigger:**
+```
+http://localhost:3000/enforcements?trigger_ref=core.webhook
+```
+Shows all enforcements triggered by webhook events.
+
+**Filter by event:**
+```
+http://localhost:3000/enforcements?event=123
+```
+Shows the enforcement created by a specific event (useful for tracing event → enforcement → execution flow).
+
+**Filter by status:**
+```
+http://localhost:3000/enforcements?status=processed
+```
+Shows processed enforcements.
+
+### Combined Filters
+
+**Rule + Status:**
+```
+http://localhost:3000/enforcements?rule_ref=core.on_timer&status=processed
+```
+Shows successfully processed enforcements for a specific rule.
+
+**Trigger + Event:**
+```
+http://localhost:3000/enforcements?trigger_ref=core.webhook&event=456
+```
+Shows enforcements from a specific webhook event.
+
+## Practical Use Cases
+
+### Debugging a Rule
+
+1. **Check the event was created:**
+   ```
+   http://localhost:3000/events?trigger_ref=core.timer
+   ```
+
+2. **Check the enforcement was created:**
+   ```
+   http://localhost:3000/enforcements?rule_ref=core.on_timer
+   ```
+
+3. **Check the execution was triggered:**
+   ```
+   http://localhost:3000/executions?rule_ref=core.on_timer
+   ```
+
+### Monitoring Action Performance
+
+**See all executions of an action:**
+```
+http://localhost:3000/executions?action_ref=core.http_request
+```
+
+**See failures:**
+```
+http://localhost:3000/executions?action_ref=core.http_request&status=failed
+```
+
+**See currently running:**
+```
+http://localhost:3000/executions?action_ref=core.http_request&status=running
+```
+
+### Auditing Webhook Activity
+
+1. **View all webhook events:**
+   ```
+   http://localhost:3000/events?trigger_ref=core.webhook
+   ```
+
+2. **View enforcements from webhooks:**
+   ```
+   http://localhost:3000/enforcements?trigger_ref=core.webhook
+   ```
+
+3. **View executions triggered by webhooks:**
+   ```
+   http://localhost:3000/executions?trigger_ref=core.webhook
+   ```
+
+### Sharing Views with Team Members
+
+**Share failed executions for investigation:**
+```
+http://localhost:3000/executions?action_ref=mypack.critical_action&status=failed
+```
+
+**Share rule activity for review:**
+```
+http://localhost:3000/enforcements?rule_ref=mypack.important_rule&status=processed
+```
+
+## Tips and Notes
+
+1. **URL Encoding**: If your pack, action, rule, or trigger names contain special characters, they will be automatically URL-encoded by the browser.
+
+2. **Case Sensitivity**: Parameter names and values are case-sensitive. Use lowercase for status values (e.g., `status=failed`, not `status=Failed`).
+
+3. **Invalid Values**: Invalid parameter values are silently ignored, and the filter will default to empty (showing all results).
+
+4. **Bookmarking**: Save frequently used URLs as browser bookmarks for quick access to common filtered views.
+
+5. **Browser History**: The URL doesn't change as you modify filters in the UI, so the browser's back button won't undo filter changes within a page.
+
+6. **Multiple Status Filters**: While the UI allows selecting multiple statuses, only one status can be specified via URL parameter. Use the UI to select multiple statuses after the page loads.
+
+## Parameter Reference Quick Table
+
+| Page | Parameter | Example Value |
+|------|-----------|---------------|
+| Executions | `action_ref` | `core.echo` |
+| Executions | `rule_ref` | `core.on_timer` |
+| Executions | `trigger_ref` | `core.webhook` |
+| Executions | `pack_name` | `core` |
+| Executions | `executor` | `1` |
+| Executions | `status` | `failed`, `running`, `completed` |
+| Events | `trigger_ref` | `core.webhook` |
+| Enforcements | `rule_ref` | `core.on_timer` |
+| Enforcements | `trigger_ref` | `core.webhook` |
+| Enforcements | `event` | `123` |
+| Enforcements | `status` | `processed`, `created`, `disabled` |
--- a/docs/parameters/dotenv-parameter-format.md
+++ b/docs/parameters/dotenv-parameter-format.md
@@ -0,0 +1,365 @@
+# DOTENV Parameter Format
+
+## Overview
+
+The DOTENV parameter format is used to pass action parameters securely via stdin in a shell-compatible format. This format is particularly useful for shell scripts that need to parse parameters without relying on external tools like `jq`.
+
+## Format Specification
+
+### Basic Format
+
+Parameters are formatted as `key='value'` pairs, one per line:
+
+```bash
+url='https://example.com'
+method='GET'
+timeout='30'
+verify_ssl='true'
+```
+
+### Nested Object Flattening
+
+Nested JSON objects are automatically flattened using dot notation. This allows shell scripts to easily parse complex parameter structures.
+
+**Input JSON:**
+```json
+{
+  "url": "https://example.com",
+  "headers": {
+    "Content-Type": "application/json",
+    "Authorization": "Bearer token123"
+  },
+  "query_params": {
+    "page": "1",
+    "size": "10"
+  }
+}
+```
+
+**Output DOTENV:**
+```bash
+headers.Authorization='Bearer token123'
+headers.Content-Type='application/json'
+query_params.page='1'
+query_params.size='10'
+url='https://example.com'
+```
+
+### Empty Objects
+
+Empty objects (`{}`) are omitted from the output entirely. They do not produce any dotenv entries.
+
+**Input:**
+```json
+{
+  "url": "https://example.com",
+  "headers": {},
+  "query_params": {}
+}
+```
+
+**Output:**
+```bash
+url='https://example.com'
+```
+
+### Arrays
+
+Arrays are serialized as JSON strings:
+
+**Input:**
+```json
+{
+  "tags": ["web", "api", "production"]
+}
+```
+
+**Output:**
+```bash
+tags='["web","api","production"]'
+```
+
+### Special Characters
+
+Single quotes in values are escaped using the shell-safe `'\''` pattern:
+
+**Input:**
+```json
+{
+  "message": "It's working!"
+}
+```
+
+**Output:**
+```bash
+message='It'\''s working!'
+```
+
+## Shell Script Parsing
+
+### Basic Parameter Parsing
+
+```bash
+#!/bin/sh
+
+# Read DOTENV-formatted parameters from stdin
+while IFS= read -r line; do
+    case "$line" in
+        *"---ATTUNE_PARAMS_END---"*) break ;;
+    esac
+    [ -z "$line" ] && continue
+
+    key="${line%%=*}"
+    value="${line#*=}"
+
+    # Remove quotes
+    case "$value" in
+        \"*\") value="${value#\"}"; value="${value%\"}" ;;
+        \'*\') value="${value#\'}"; value="${value%\'}" ;;
+    esac
+
+    # Process parameters
+    case "$key" in
+        url) url="$value" ;;
+        method) method="$value" ;;
+        timeout) timeout="$value" ;;
+    esac
+done
+```
+
+### Parsing Nested Objects
+
+For flattened nested objects, use pattern matching on the key prefix:
+
+```bash
+# Create temporary files for nested data
+headers_file=$(mktemp)
+query_params_file=$(mktemp)
+
+while IFS= read -r line; do
+    case "$line" in
+        *"---ATTUNE_PARAMS_END---"*) break ;;
+    esac
+    [ -z "$line" ] && continue
+
+    key="${line%%=*}"
+    value="${line#*=}"
+
+    # Remove quotes
+    case "$value" in
+        \'*\') value="${value#\'}"; value="${value%\'}" ;;
+    esac
+
+    # Process parameters
+    case "$key" in
+        url) url="$value" ;;
+        method) method="$value" ;;
+        headers.*)
+            # Extract nested key (e.g., "Content-Type" from "headers.Content-Type")
+            nested_key="${key#headers.}"
+            printf '%s: %s\n' "$nested_key" "$value" >> "$headers_file"
+            ;;
+        query_params.*)
+            nested_key="${key#query_params.}"
+            printf '%s=%s\n' "$nested_key" "$value" >> "$query_params_file"
+            ;;
+    esac
+done
+
+# Use the parsed data
+if [ -s "$headers_file" ]; then
+    while IFS= read -r header; do
+        curl_args="$curl_args -H '$header'"
+    done < "$headers_file"
+fi
+```
+
+## Configuration
+
+### Action YAML Configuration
+
+Specify DOTENV format in your action YAML:
+
+```yaml
+ref: mypack.myaction
+entry_point: myaction.sh
+parameter_delivery: stdin
+parameter_format: dotenv  # Use dotenv format
+output_format: json
+```
+
+### Supported Formats
+
+- `dotenv` - Shell-friendly key='value' format with nested object flattening
+- `json` - Standard JSON format
+- `yaml` - YAML format
+
+### Supported Delivery Methods
+
+- `stdin` - Parameters passed via stdin (recommended for security)
+- `file` - Parameters written to a temporary file
+
+## Security Considerations
+
+### Why DOTENV + STDIN?
+
+This combination provides several security benefits:
+
+1. **No process list exposure**: Parameters don't appear in `ps aux` output
+2. **No shell escaping issues**: Values are properly quoted
+3. **Secret protection**: Sensitive values passed via stdin, not environment variables
+4. **No external dependencies**: Pure POSIX shell parsing without `jq` or other tools
+
+### Secret Handling
+
+Secrets are passed separately via stdin after parameters. They are never included in environment variables or parameter files.
+
+```bash
+# Parameters are sent first
+url='https://api.example.com'
+---ATTUNE_PARAMS_END---
+# Then secrets (as JSON)
+{"api_key":"secret123","password":"hunter2"}
+```
+
+## Examples
+
+### Example 1: HTTP Request Action
+
+**Action Configuration:**
+```yaml
+ref: core.http_request
+parameter_delivery: stdin
+parameter_format: dotenv
+```
+
+**Execution Parameters:**
+```json
+{
+  "url": "https://api.example.com/users",
+  "method": "POST",
+  "headers": {
+    "Content-Type": "application/json",
+    "User-Agent": "Attune/1.0"
+  },
+  "query_params": {
+    "page": "1",
+    "limit": "10"
+  }
+}
+```
+
+**Stdin Input:**
+```bash
+headers.Content-Type='application/json'
+headers.User-Agent='Attune/1.0'
+method='POST'
+query_params.limit='10'
+query_params.page='1'
+url='https://api.example.com/users'
+---ATTUNE_PARAMS_END---
+```
+
+### Example 2: Simple Shell Action
+
+**Action Configuration:**
+```yaml
+ref: mypack.greet
+parameter_delivery: stdin
+parameter_format: dotenv
+```
+
+**Execution Parameters:**
+```json
+{
+  "name": "Alice",
+  "greeting": "Hello"
+}
+```
+
+**Stdin Input:**
+```bash
+greeting='Hello'
+name='Alice'
+---ATTUNE_PARAMS_END---
+```
+
+## Troubleshooting
+
+### Issue: Parameters Not Received
+
+**Symptom:** Action receives empty or incorrect parameter values.
+
+**Solution:** Ensure you're reading until the `---ATTUNE_PARAMS_END---` delimiter:
+
+```bash
+while IFS= read -r line; do
+    case "$line" in
+        *"---ATTUNE_PARAMS_END---"*) break ;;  # Important!
+    esac
+    # ... parse line
+done
+```
+
+### Issue: Nested Objects Not Parsed
+
+**Symptom:** Headers or query params not being set correctly.
+
+**Solution:** Use pattern matching to detect dotted keys:
+
+```bash
+case "$key" in
+    headers.*)
+        nested_key="${key#headers.}"
+        # Process nested key
+        ;;
+esac
+```
+
+### Issue: Special Characters Corrupted
+
+**Symptom:** Values with single quotes are malformed.
+
+**Solution:** The worker automatically escapes single quotes using `'\''`. Make sure to remove quotes correctly:
+
+```bash
+# Remove quotes (handles escaped quotes correctly)
+case "$value" in
+    \'*\') value="${value#\'}"; value="${value%\'}" ;;
+esac
+```
+
+## Best Practices
+
+1. **Always read until delimiter**: Don't stop reading stdin early
+2. **Handle empty objects**: Check if files are empty before processing
+3. **Use temporary files**: For nested objects, write to temp files for easier processing
+4. **Validate required parameters**: Check that required values are present
+5. **Clean up temp files**: Use `trap` to ensure cleanup on exit
+
+```bash
+#!/bin/sh
+set -e
+
+# Setup cleanup
+headers_file=$(mktemp)
+trap "rm -f $headers_file" EXIT
+
+# Parse parameters...
+```
+
+## Implementation Details
+
+The parameter flattening is implemented in `crates/worker/src/runtime/parameter_passing.rs`:
+
+- Nested objects are recursively flattened with dot notation
+- Empty objects produce no output entries
+- Arrays are JSON-serialized as strings
+- Output is sorted alphabetically for consistency
+- Single quotes are escaped using shell-safe `'\''` pattern
+
+## See Also
+
+- [Action Parameter Schema](../packs/pack-structure.md#parameters)
+- [Secrets Management](../authentication/secrets-management.md)
+- [Shell Runtime](../architecture/worker-service.md#shell-runtime)
--- a/docs/web-ui/history-page-query-params.md
+++ b/docs/web-ui/history-page-query-params.md
@@ -0,0 +1,130 @@
+# History Page URL Query Parameters
+
+This document describes the URL query parameters supported by the history pages (Executions, Events, Enforcements) in the Attune web UI.
+
+## Overview
+
+All history pages support deep linking via URL query parameters. When navigating to a history page with query parameters, the page will automatically initialize its filters with the provided values.
+
+## Executions Page
+
+**Path**: `/executions`
+
+### Supported Query Parameters
+
+| Parameter | Description | Example |
+|-----------|-------------|---------|
+| `action_ref` | Filter by action reference | `?action_ref=core.echo` |
+| `rule_ref` | Filter by rule reference | `?rule_ref=core.on_timer` |
+| `trigger_ref` | Filter by trigger reference | `?trigger_ref=core.webhook` |
+| `pack_name` | Filter by pack name | `?pack_name=core` |
+| `executor` | Filter by executor ID | `?executor=1` |
+| `status` | Filter by execution status | `?status=running` |
+
+### Valid Status Values
+
+- `requested`
+- `scheduling`
+- `scheduled`
+- `running`
+- `completed`
+- `failed`
+- `canceling`
+- `cancelled`
+- `timeout`
+- `abandoned`
+
+### Examples
+
+```
+# Filter by action
+http://localhost:3000/executions?action_ref=core.echo
+
+# Filter by rule and status
+http://localhost:3000/executions?rule_ref=core.on_timer&status=completed
+
+# Multiple filters
+http://localhost:3000/executions?pack_name=core&status=running&action_ref=core.echo
+```
+
+## Events Page
+
+**Path**: `/events`
+
+### Supported Query Parameters
+
+| Parameter | Description | Example |
+|-----------|-------------|---------|
+| `trigger_ref` | Filter by trigger reference | `?trigger_ref=core.webhook` |
+
+### Examples
+
+```
+# Filter by trigger
+http://localhost:3000/events?trigger_ref=core.webhook
+
+# Filter by timer trigger
+http://localhost:3000/events?trigger_ref=core.timer
+```
+
+## Enforcements Page
+
+**Path**: `/enforcements`
+
+### Supported Query Parameters
+
+| Parameter | Description | Example |
+|-----------|-------------|---------|
+| `rule_ref` | Filter by rule reference | `?rule_ref=core.on_timer` |
+| `trigger_ref` | Filter by trigger reference | `?trigger_ref=core.webhook` |
+| `event` | Filter by event ID | `?event=123` |
+| `status` | Filter by enforcement status | `?status=processed` |
+
+### Valid Status Values
+
+- `created`
+- `processed`
+- `disabled`
+
+### Examples
+
+```
+# Filter by rule
+http://localhost:3000/enforcements?rule_ref=core.on_timer
+
+# Filter by event
+http://localhost:3000/enforcements?event=123
+
+# Multiple filters
+http://localhost:3000/enforcements?rule_ref=core.on_timer&status=processed
+```
+
+## Usage Patterns
+
+### Deep Linking from Detail Pages
+
+When viewing a specific execution, event, or enforcement detail page, you can click on related entities (actions, rules, triggers) to navigate to the history page with the appropriate filter pre-applied.
+
+### Sharing Filtered Views
+
+You can share URLs with query parameters to help others view specific filtered data sets:
+
+```
+# Share a view of all failed executions for a specific action
+http://localhost:3000/executions?action_ref=core.http_request&status=failed
+
+# Share enforcements for a specific rule
+http://localhost:3000/enforcements?rule_ref=my_pack.important_rule
+```
+
+### Bookmarking
+
+Save frequently used filter combinations as browser bookmarks for quick access.
+
+## Implementation Notes
+
+- Query parameters are read on page load and initialize the filter state
+- Changing filters in the UI does **not** update the URL (stateless filtering)
+- Multiple query parameters can be combined
+- Invalid parameter values are ignored (filters default to empty)
+- Parameter names match the API field names for consistency