more internal polish, resilient workers

This commit is contained in:
2026-02-09 18:32:34 -06:00
parent 588b319fec
commit e31ecb781b
62 changed files with 9872 additions and 584 deletions

View File

@@ -87,32 +87,47 @@ Execution Requested → Scheduler → Worker Selection → Execution Scheduled
### 3. Execution Manager
**Purpose**: Manages execution lifecycle and status transitions.
**Purpose**: Orchestrates execution workflows and handles lifecycle events.
**Responsibilities**:
- Listens for `execution.status.*` messages from workers
- Updates execution records with status changes
- Handles execution completion (success, failure, cancellation)
- Orchestrates workflow executions (parent-child relationships)
- Publishes completion notifications for downstream consumers
- **Does NOT update execution state** (worker owns state after scheduling)
- Handles execution completion orchestration (triggering child executions)
- Manages workflow executions (parent-child relationships)
- Coordinates workflow state transitions
**Ownership Model**:
- **Executor owns**: Requested → Scheduling → Scheduled (updates DB)
- Includes pre-handoff cancellations/failures (before `execution.scheduled` is published)
- **Worker owns**: Running → Completed/Failed/Cancelled (updates DB)
- Includes post-handoff cancellations/failures (after receiving `execution.scheduled`)
- **Handoff Point**: When `execution.scheduled` message is **published** to worker
- Before publish: Executor owns and updates state
- After publish: Worker owns and updates state
**Message Flow**:
```
Worker Status Update → Execution Manager → Database Update → Completion Handler
Worker Status Update → Execution Manager → Orchestration Logic (Read-Only)
→ Trigger Child Executions
```
**Status Lifecycle**:
```
Requested → Scheduling → Scheduled → Running → Completed/Failed/Cancelled
└→ Child Executions (workflows)
Requested → Scheduling → Scheduled → [HANDOFF: execution.scheduled published] → Running → Completed/Failed/Cancelled
│ │
└─ Executor Updates ───┘ └─ Worker Updates
│ (includes pre-handoff │ (includes post-handoff
│ Cancelled) │ Cancelled/Timeout/Abandoned)
└→ Child Executions (workflows)
```
**Key Implementation Details**:
- Parses status strings to typed enums for type safety
- Receives status change notifications for orchestration purposes only
- Does not update execution state after handoff to worker
- Handles workflow orchestration (parent-child execution chaining)
- Only triggers child executions on successful parent completion
- Publishes completion events for notification service
- Read-only access to execution records for orchestration logic
## Message Queue Integration
@@ -123,12 +138,14 @@ The Executor consumes and produces several message types:
**Consumed**:
- `enforcement.created` - New enforcement from triggered rules
- `execution.requested` - Execution scheduling requests
- `execution.status.*` - Status updates from workers
- `execution.status.changed` - Status change notifications from workers (for orchestration)
- `execution.completed` - Completion notifications from workers (for queue management)
**Published**:
- `execution.requested` - To scheduler (from enforcement processor)
- `execution.scheduled` - To workers (from scheduler)
- `execution.completed` - To notifier (from execution manager)
- `execution.scheduled` - To workers (from scheduler) **← OWNERSHIP HANDOFF**
**Note**: The executor does NOT publish `execution.completed` messages. This is the worker's responsibility as the authoritative source of execution state after scheduling.
### Message Envelope Structure
@@ -186,11 +203,34 @@ use attune_common::repositories::{
};
```
### Database Update Ownership
**Executor updates execution state** from creation through handoff:
- Creates execution records (`Requested` status)
- Updates status during scheduling (`Scheduling``Scheduled`)
- Publishes `execution.scheduled` message to worker **← HANDOFF POINT**
- **Handles cancellations/failures BEFORE handoff** (before message is published)
- Example: User cancels execution while queued by concurrency policy
- Executor updates to `Cancelled`, worker never receives message
**Worker updates execution state** after receiving handoff:
- Receives `execution.scheduled` message (takes ownership)
- Updates status when execution starts (`Running`)
- Updates status when execution completes (`Completed`, `Failed`, etc.)
- **Handles cancellations/failures AFTER handoff** (after receiving message)
- Updates result data and artifacts
- Worker only owns executions it has received
**Executor reads execution state** for orchestration after handoff:
- Receives status change notifications from workers
- Reads execution records to trigger workflow children
- Does NOT update execution state after publishing `execution.scheduled`
### Transaction Support
Future implementations will use database transactions for multi-step operations:
- Creating execution + publishing message (atomic)
- Status update + completion handling (atomic)
- Enforcement processing + execution creation (atomic)
## Configuration

View File

@@ -0,0 +1,557 @@
# Worker Availability Handling
**Status**: Implementation Gap Identified
**Priority**: High
**Date**: 2026-02-09
## Problem Statement
When workers are stopped or become unavailable, the executor continues attempting to schedule executions to them, resulting in:
1. **Stuck executions**: Executions remain in `SCHEDULING` or `SCHEDULED` status indefinitely
2. **Queue buildup**: Messages accumulate in worker-specific RabbitMQ queues
3. **No failure notification**: Users don't know their executions are stuck
4. **Resource waste**: System resources consumed by queued messages and database records
## Current Architecture
### Heartbeat Mechanism
Workers send heartbeat updates to the database periodically (default: 30 seconds).
```rust
// From crates/executor/src/scheduler.rs
const DEFAULT_HEARTBEAT_INTERVAL: u64 = 30;
const HEARTBEAT_STALENESS_MULTIPLIER: u64 = 3;
fn is_worker_heartbeat_fresh(worker: &Worker) -> bool {
// Worker is fresh if heartbeat < 90 seconds old
let max_age = Duration::from_secs(
DEFAULT_HEARTBEAT_INTERVAL * HEARTBEAT_STALENESS_MULTIPLIER
);
// ...
}
```
### Scheduling Flow
```
Execution Created (REQUESTED)
Scheduler receives message
Find compatible worker with fresh heartbeat
Update execution to SCHEDULED
Publish message to worker-specific queue
Worker consumes and executes
```
### Failure Points
1. **Worker stops after heartbeat**: Worker has fresh heartbeat but is actually down
2. **Worker crashes**: No graceful shutdown, heartbeat appears fresh temporarily
3. **Network partition**: Worker isolated but appears healthy
4. **Queue accumulation**: Messages sit in worker-specific queues indefinitely
## Current Mitigations (Insufficient)
### 1. Heartbeat Staleness Check
```rust
fn select_worker(pool: &PgPool, action: &Action) -> Result<Worker> {
// Filter by active workers
let active_workers: Vec<_> = workers
.into_iter()
.filter(|w| w.status == WorkerStatus::Active)
.collect();
// Filter by heartbeat freshness
let fresh_workers: Vec<_> = active_workers
.into_iter()
.filter(|w| is_worker_heartbeat_fresh(w))
.collect();
if fresh_workers.is_empty() {
return Err(anyhow!("No workers with fresh heartbeats"));
}
// Select first available worker
Ok(fresh_workers.into_iter().next().unwrap())
}
```
**Gap**: Workers can stop within the 90-second staleness window.
### 2. Message Requeue on Error
```rust
// From crates/common/src/mq/consumer.rs
match handler(envelope.clone()).await {
Err(e) => {
let requeue = e.is_retriable();
channel.basic_nack(delivery_tag, BasicNackOptions {
requeue,
multiple: false,
}).await?;
}
}
```
**Gap**: Only requeues on retriable errors (connection/timeout), not worker unavailability.
### 3. Message TTL Configuration
```rust
// From crates/common/src/config.rs
pub struct MessageQueueConfig {
#[serde(default = "default_message_ttl")]
pub message_ttl: u64,
}
fn default_message_ttl() -> u64 {
3600 // 1 hour
}
```
**Gap**: TTL not currently applied to worker queues, and 1 hour is too long.
## Proposed Solutions
### Solution 1: Execution Timeout Mechanism (HIGH PRIORITY)
Add a background task that monitors scheduled executions and fails them if they don't start within a timeout.
**Implementation:**
```rust
// crates/executor/src/execution_timeout_monitor.rs
pub struct ExecutionTimeoutMonitor {
pool: PgPool,
publisher: Arc<Publisher>,
check_interval: Duration,
scheduled_timeout: Duration,
}
impl ExecutionTimeoutMonitor {
pub async fn start(&self) -> Result<()> {
let mut interval = tokio::time::interval(self.check_interval);
loop {
interval.tick().await;
if let Err(e) = self.check_stale_executions().await {
error!("Error checking stale executions: {}", e);
}
}
}
async fn check_stale_executions(&self) -> Result<()> {
let cutoff = Utc::now() - chrono::Duration::from_std(self.scheduled_timeout)?;
// Find executions stuck in SCHEDULED status
let stale_executions = sqlx::query_as::<_, Execution>(
"SELECT * FROM execution
WHERE status = 'scheduled'
AND updated < $1"
)
.bind(cutoff)
.fetch_all(&self.pool)
.await?;
for execution in stale_executions {
warn!(
"Execution {} has been scheduled for too long, marking as failed",
execution.id
);
self.fail_execution(
execution.id,
"Execution timeout: worker did not pick up task within timeout"
).await?;
}
Ok(())
}
async fn fail_execution(&self, execution_id: i64, reason: &str) -> Result<()> {
// Update execution status
sqlx::query(
"UPDATE execution
SET status = 'failed',
result = $2,
updated = NOW()
WHERE id = $1"
)
.bind(execution_id)
.bind(serde_json::json!({
"error": reason,
"failed_by": "execution_timeout_monitor"
}))
.execute(&self.pool)
.await?;
// Publish completion notification
let payload = ExecutionCompletedPayload {
execution_id,
status: ExecutionStatus::Failed,
result: Some(serde_json::json!({"error": reason})),
};
self.publisher
.publish_envelope(
MessageType::ExecutionCompleted,
payload,
"attune.executions",
)
.await?;
Ok(())
}
}
```
**Configuration:**
```yaml
# config.yaml
executor:
scheduled_timeout: 300 # 5 minutes (fail if not running within 5 min)
timeout_check_interval: 60 # Check every minute
```
### Solution 2: Worker Queue TTL and DLQ (MEDIUM PRIORITY)
Apply message TTL to worker-specific queues with dead letter exchange.
**Implementation:**
```rust
// When declaring worker-specific queues
let mut queue_args = FieldTable::default();
// Set message TTL (5 minutes)
queue_args.insert(
"x-message-ttl".into(),
AMQPValue::LongInt(300_000) // 5 minutes in milliseconds
);
// Set dead letter exchange
queue_args.insert(
"x-dead-letter-exchange".into(),
AMQPValue::LongString("attune.executions.dlx".into())
);
channel.queue_declare(
&format!("attune.execution.worker.{}", worker_id),
QueueDeclareOptions {
durable: true,
..Default::default()
},
queue_args,
).await?;
```
**Dead Letter Handler:**
```rust
// crates/executor/src/dead_letter_handler.rs
pub struct DeadLetterHandler {
pool: PgPool,
consumer: Arc<Consumer>,
}
impl DeadLetterHandler {
pub async fn start(&self) -> Result<()> {
self.consumer
.consume_with_handler(|envelope: MessageEnvelope<ExecutionScheduledPayload>| {
let pool = self.pool.clone();
async move {
warn!("Received dead letter for execution {}", envelope.payload.execution_id);
// Mark execution as failed
sqlx::query(
"UPDATE execution
SET status = 'failed',
result = $2,
updated = NOW()
WHERE id = $1 AND status = 'scheduled'"
)
.bind(envelope.payload.execution_id)
.bind(serde_json::json!({
"error": "Message expired in worker queue (worker unavailable)",
"failed_by": "dead_letter_handler"
}))
.execute(&pool)
.await?;
Ok(())
}
})
.await
}
}
```
### Solution 3: Worker Health Probes (LOW PRIORITY)
Add active health checking instead of relying solely on heartbeats.
**Implementation:**
```rust
// crates/executor/src/worker_health_checker.rs
pub struct WorkerHealthChecker {
pool: PgPool,
check_interval: Duration,
}
impl WorkerHealthChecker {
pub async fn start(&self) -> Result<()> {
let mut interval = tokio::time::interval(self.check_interval);
loop {
interval.tick().await;
if let Err(e) = self.check_worker_health().await {
error!("Error checking worker health: {}", e);
}
}
}
async fn check_worker_health(&self) -> Result<()> {
let workers = WorkerRepository::find_action_workers(&self.pool).await?;
for worker in workers {
// Skip if heartbeat is very stale (worker is definitely down)
if !is_heartbeat_recent(&worker) {
continue;
}
// Attempt health check
match self.ping_worker(&worker).await {
Ok(true) => {
// Worker is healthy, ensure status is Active
if worker.status != Some(WorkerStatus::Active) {
self.update_worker_status(worker.id, WorkerStatus::Active).await?;
}
}
Ok(false) | Err(_) => {
// Worker is unhealthy, mark as inactive
warn!("Worker {} failed health check", worker.name);
self.update_worker_status(worker.id, WorkerStatus::Inactive).await?;
}
}
}
Ok(())
}
async fn ping_worker(&self, worker: &Worker) -> Result<bool> {
// TODO: Implement health endpoint on worker
// For now, check if worker's queue is being consumed
Ok(true)
}
}
```
### Solution 4: Graceful Worker Shutdown (MEDIUM PRIORITY)
Ensure workers mark themselves as inactive before shutdown.
**Implementation:**
```rust
// In worker service shutdown handler
impl WorkerService {
pub async fn shutdown(&self) -> Result<()> {
info!("Worker shutting down gracefully...");
// Mark worker as inactive
sqlx::query(
"UPDATE worker SET status = 'inactive', updated = NOW() WHERE id = $1"
)
.bind(self.worker_id)
.execute(&self.pool)
.await?;
// Stop accepting new tasks
self.stop_consuming().await?;
// Wait for in-flight tasks to complete (with timeout)
let timeout = Duration::from_secs(30);
tokio::time::timeout(timeout, self.wait_for_completion()).await?;
info!("Worker shutdown complete");
Ok(())
}
}
```
**Docker Signal Handling:**
```yaml
# docker-compose.yaml
services:
worker-shell:
stop_grace_period: 45s # Give worker time to finish tasks
```
## Implementation Priority
### Phase 1: Immediate (Week 1)
1. **Execution Timeout Monitor** - Prevents stuck executions
2. **Graceful Shutdown** - Marks workers inactive on stop
### Phase 2: Short-term (Week 2)
3. **Worker Queue TTL + DLQ** - Prevents message buildup
4. **Dead Letter Handler** - Fails expired executions
### Phase 3: Long-term (Month 1)
5. **Worker Health Probes** - Active availability verification
6. **Retry Logic** - Reschedule to different worker on failure
## Configuration
### Recommended Timeouts
```yaml
executor:
# How long an execution can stay SCHEDULED before failing
scheduled_timeout: 300 # 5 minutes
# How often to check for stale executions
timeout_check_interval: 60 # 1 minute
# Message TTL in worker queues
worker_queue_ttl: 300 # 5 minutes (match scheduled_timeout)
# Worker health check interval
health_check_interval: 30 # 30 seconds
worker:
# How often to send heartbeats
heartbeat_interval: 10 # 10 seconds (more frequent)
# Grace period for shutdown
shutdown_timeout: 30 # 30 seconds
```
### Staleness Calculation
```
Heartbeat Staleness Threshold = heartbeat_interval * 3
= 10 * 3 = 30 seconds
This means:
- Worker sends heartbeat every 10s
- If heartbeat is > 30s old, worker is considered stale
- Reduces window where stopped worker appears healthy from 90s to 30s
```
## Monitoring and Observability
### Metrics to Track
1. **Execution timeout rate**: Number of executions failed due to timeout
2. **Worker downtime**: Time between last heartbeat and status change
3. **Dead letter queue depth**: Number of expired messages
4. **Average scheduling latency**: Time from REQUESTED to RUNNING
### Alerts
```yaml
alerts:
- name: high_execution_timeout_rate
condition: execution_timeouts > 10 per minute
severity: warning
- name: no_active_workers
condition: active_workers == 0
severity: critical
- name: dlq_buildup
condition: dlq_depth > 100
severity: warning
- name: stale_executions
condition: scheduled_executions_older_than_5min > 0
severity: warning
```
## Testing
### Test Scenarios
1. **Worker stops mid-execution**: Should timeout and fail
2. **Worker never picks up task**: Should timeout after 5 minutes
3. **All workers down**: Should immediately fail with "no workers available"
4. **Worker stops gracefully**: Should mark inactive and not receive new tasks
5. **Message expires in queue**: Should be moved to DLQ and execution failed
### Integration Test Example
```rust
#[tokio::test]
async fn test_execution_timeout_on_worker_down() {
let pool = setup_test_db().await;
let mq = setup_test_mq().await;
// Create worker and execution
let worker = create_test_worker(&pool).await;
let execution = create_test_execution(&pool).await;
// Schedule execution to worker
schedule_execution(&pool, &mq, execution.id, worker.id).await;
// Stop worker (simulate crash - no graceful shutdown)
stop_worker(worker.id).await;
// Wait for timeout
tokio::time::sleep(Duration::from_secs(310)).await;
// Verify execution is marked as failed
let execution = get_execution(&pool, execution.id).await;
assert_eq!(execution.status, ExecutionStatus::Failed);
assert!(execution.result.unwrap()["error"]
.as_str()
.unwrap()
.contains("timeout"));
}
```
## Migration Path
### Step 1: Add Monitoring (No Breaking Changes)
- Deploy execution timeout monitor
- Monitor logs for timeout events
- Tune timeout values based on actual workload
### Step 2: Add DLQ (Requires Queue Reconfiguration)
- Create dead letter exchange
- Update queue declarations with TTL and DLX
- Deploy dead letter handler
- Monitor DLQ depth
### Step 3: Graceful Shutdown (Worker Update)
- Add shutdown handler to worker
- Update Docker Compose stop_grace_period
- Test worker restarts
### Step 4: Health Probes (Future Enhancement)
- Add health endpoint to worker
- Deploy health checker service
- Transition from heartbeat-only to active probing
## Related Documentation
- [Queue Architecture](./queue-architecture.md)
- [Worker Service](./worker-service.md)
- [Executor Service](./executor-service.md)
- [RabbitMQ Queues Quick Reference](../docs/QUICKREF-rabbitmq-queues.md)

View File

@@ -0,0 +1,493 @@
# Worker Queue TTL and Dead Letter Queue (Phase 2)
## Overview
Phase 2 of worker availability handling implements message TTL (time-to-live) on worker-specific queues and dead letter queue (DLQ) processing. This ensures that executions sent to unavailable workers are automatically failed instead of remaining stuck indefinitely.
## Architecture
### Message Flow
```
┌─────────────┐
│ Executor │
│ Scheduler │
└──────┬──────┘
│ Publishes ExecutionRequested
│ routing_key: execution.dispatch.worker.{id}
┌──────────────────────────────────┐
│ worker.{id}.executions queue │
│ │
│ Properties: │
│ - x-message-ttl: 300000ms (5m) │
│ - x-dead-letter-exchange: dlx │
└──────┬───────────────────┬───────┘
│ │
│ Worker consumes │ TTL expires
│ (normal flow) │ (worker unavailable)
│ │
▼ ▼
┌──────────────┐ ┌──────────────────┐
│ Worker │ │ attune.dlx │
│ Service │ │ (Dead Letter │
│ │ │ Exchange) │
└──────────────┘ └────────┬─────────┘
│ Routes to DLQ
┌──────────────────────┐
│ attune.dlx.queue │
│ (Dead Letter Queue) │
└────────┬─────────────┘
│ Consumes
┌──────────────────────┐
│ Dead Letter Handler │
│ (in Executor) │
│ │
│ - Identifies exec │
│ - Marks as FAILED │
│ - Logs failure │
└──────────────────────┘
```
### Components
#### 1. Worker Queue TTL
**Configuration:**
- Default: 5 minutes (300,000 milliseconds)
- Configurable via `rabbitmq.worker_queue_ttl_ms`
**Implementation:**
- Applied during queue declaration in `Connection::setup_worker_infrastructure()`
- Uses RabbitMQ's `x-message-ttl` queue argument
- Only applies to worker-specific queues (`worker.{id}.executions`)
**Behavior:**
- When a message remains in the queue longer than TTL
- RabbitMQ automatically moves it to the configured dead letter exchange
- Original message properties and headers are preserved
- Includes `x-death` header with expiration details
#### 2. Dead Letter Exchange (DLX)
**Configuration:**
- Exchange name: `attune.dlx`
- Type: `direct`
- Durable: `true`
**Setup:**
- Created in `Connection::setup_common_infrastructure()`
- Bound to dead letter queue with routing key `#` (all messages)
- Shared across all services
#### 3. Dead Letter Queue
**Configuration:**
- Queue name: `attune.dlx.queue`
- Durable: `true`
- TTL: 24 hours (configurable via `rabbitmq.dead_letter.ttl_ms`)
**Properties:**
- Retains messages for debugging and analysis
- Messages auto-expire after retention period
- No DLX on the DLQ itself (prevents infinite loops)
#### 4. Dead Letter Handler
**Location:** `crates/executor/src/dead_letter_handler.rs`
**Responsibilities:**
1. Consume messages from `attune.dlx.queue`
2. Deserialize message envelope
3. Extract execution ID from payload
4. Verify execution is in non-terminal state
5. Update execution to FAILED status
6. Add descriptive error information
7. Acknowledge message (remove from DLQ)
**Error Handling:**
- Invalid messages: Acknowledged and discarded
- Missing executions: Acknowledged (already processed)
- Terminal state executions: Acknowledged (no action needed)
- Database errors: Nacked with requeue (retry later)
## Configuration
### RabbitMQ Configuration Structure
```yaml
message_queue:
rabbitmq:
# Worker queue TTL - how long messages wait before DLX
worker_queue_ttl_ms: 300000 # 5 minutes (default)
# Dead letter configuration
dead_letter:
enabled: true # Enable DLQ system
exchange: attune.dlx # DLX name
ttl_ms: 86400000 # DLQ retention (24 hours)
```
### Environment-Specific Settings
#### Development (`config.development.yaml`)
```yaml
message_queue:
rabbitmq:
worker_queue_ttl_ms: 300000 # 5 minutes
dead_letter:
enabled: true
exchange: attune.dlx
ttl_ms: 86400000 # 24 hours
```
#### Production (`config.docker.yaml`)
```yaml
message_queue:
rabbitmq:
worker_queue_ttl_ms: 300000 # 5 minutes
dead_letter:
enabled: true
exchange: attune.dlx
ttl_ms: 86400000 # 24 hours
```
### Tuning Guidelines
**Worker Queue TTL (`worker_queue_ttl_ms`):**
- **Too short:** Legitimate slow workers may have executions failed prematurely
- **Too long:** Unavailable workers cause delayed failure detection
- **Recommendation:** 2-5x typical execution time, minimum 2 minutes
- **Default (5 min):** Good balance for most workloads
**DLQ Retention (`dead_letter.ttl_ms`):**
- Purpose: Debugging and forensics
- **Too short:** May lose data before analysis
- **Too long:** Accumulates stale data
- **Recommendation:** 24-48 hours in production
- **Default (24 hours):** Adequate for most troubleshooting
## Code Structure
### Queue Declaration with TTL
```rust
// crates/common/src/mq/connection.rs
pub async fn declare_queue_with_dlx_and_ttl(
&self,
config: &QueueConfig,
dlx_exchange: &str,
ttl_ms: Option<u64>,
) -> MqResult<()> {
let mut args = FieldTable::default();
// Configure DLX
args.insert(
"x-dead-letter-exchange".into(),
AMQPValue::LongString(dlx_exchange.into()),
);
// Configure TTL if specified
if let Some(ttl) = ttl_ms {
args.insert(
"x-message-ttl".into(),
AMQPValue::LongInt(ttl as i64),
);
}
// Declare queue with arguments
channel.queue_declare(&config.name, options, args).await?;
Ok(())
}
```
### Dead Letter Handler
```rust
// crates/executor/src/dead_letter_handler.rs
pub struct DeadLetterHandler {
pool: Arc<PgPool>,
consumer: Consumer,
running: Arc<Mutex<bool>>,
}
impl DeadLetterHandler {
pub async fn start(&self) -> Result<(), Error> {
self.consumer.consume_with_handler(|envelope| {
match envelope.message_type {
MessageType::ExecutionRequested => {
handle_execution_requested(&pool, &envelope).await
}
_ => {
// Unexpected message type - acknowledge and discard
Ok(())
}
}
}).await
}
}
async fn handle_execution_requested(
pool: &PgPool,
envelope: &MessageEnvelope<Value>,
) -> MqResult<()> {
// Extract execution ID
let execution_id = envelope.payload.get("execution_id")
.and_then(|v| v.as_i64())
.ok_or_else(|| /* error */)?;
// Fetch current state
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
// Only fail if in non-terminal state
if !execution.status.is_terminal() {
ExecutionRepository::update(pool, execution_id, UpdateExecutionInput {
status: Some(ExecutionStatus::Failed),
result: Some(json!({
"error": "Worker queue TTL expired",
"message": "Worker did not process execution within configured TTL",
})),
ended: Some(Some(Utc::now())),
..Default::default()
}).await?;
}
Ok(())
}
```
## Integration with Executor Service
The dead letter handler is started automatically by the executor service if DLQ is enabled:
```rust
// crates/executor/src/service.rs
pub async fn start(&self) -> Result<()> {
// ... other components ...
// Start dead letter handler (if enabled)
if self.inner.mq_config.rabbitmq.dead_letter.enabled {
let dlq_name = format!("{}.queue",
self.inner.mq_config.rabbitmq.dead_letter.exchange);
let dlq_consumer = Consumer::new(
&self.inner.mq_connection,
create_dlq_consumer_config(&dlq_name, "executor.dlq"),
).await?;
let dlq_handler = Arc::new(
DeadLetterHandler::new(self.inner.pool.clone(), dlq_consumer).await?
);
handles.push(tokio::spawn(async move {
dlq_handler.start().await
}));
}
// ... wait for completion ...
}
```
## Operational Considerations
### Monitoring
**Key Metrics:**
- DLQ message rate (messages/sec entering DLQ)
- DLQ queue depth (current messages in DLQ)
- DLQ processing latency (time from DLX to handler)
- Failed execution count (executions failed via DLQ)
**Alerting Thresholds:**
- DLQ rate > 10/min: Workers may be unhealthy or TTL too aggressive
- DLQ depth > 100: Handler may be falling behind
- High failure rate: Systematic worker availability issues
### RabbitMQ Management
**View DLQ:**
```bash
# List messages in DLQ
rabbitmqadmin list queues name messages
# Get DLQ details
rabbitmqadmin show queue name=attune.dlx.queue
# Purge DLQ (use with caution)
rabbitmqadmin purge queue name=attune.dlx.queue
```
**View Dead Letters:**
```bash
# Get message from DLQ
rabbitmqadmin get queue=attune.dlx.queue count=1
# Check message death history
# Look for x-death header in message properties
```
### Troubleshooting
#### High DLQ Rate
**Symptoms:** Many executions failing via DLQ
**Causes:**
1. Workers down or restarting frequently
2. Worker queue TTL too aggressive
3. Worker overloaded (not consuming fast enough)
4. Network issues between executor and workers
**Resolution:**
1. Check worker health and logs
2. Verify worker heartbeats in database
3. Consider increasing `worker_queue_ttl_ms`
4. Scale worker fleet if overloaded
#### DLQ Handler Not Processing
**Symptoms:** DLQ depth increasing, executions stuck
**Causes:**
1. Executor service not running
2. DLQ disabled in configuration
3. Database connection issues
4. Handler crashed or deadlocked
**Resolution:**
1. Check executor service logs
2. Verify `dead_letter.enabled = true`
3. Check database connectivity
4. Restart executor service if needed
#### Messages Not Reaching DLQ
**Symptoms:** Executions stuck, DLQ empty
**Causes:**
1. Worker queues not configured with DLX
2. DLX exchange not created
3. DLQ not bound to DLX
4. TTL not configured on worker queues
**Resolution:**
1. Restart services to recreate infrastructure
2. Verify RabbitMQ configuration
3. Check queue properties in RabbitMQ management UI
## Testing
### Unit Tests
```rust
#[tokio::test]
async fn test_expired_execution_handling() {
let pool = setup_test_db().await;
// Create execution in SCHEDULED state
let execution = create_test_execution(&pool, ExecutionStatus::Scheduled).await;
// Simulate DLQ message
let envelope = MessageEnvelope::new(
MessageType::ExecutionRequested,
json!({ "execution_id": execution.id }),
);
// Process message
handle_execution_requested(&pool, &envelope).await.unwrap();
// Verify execution failed
let updated = ExecutionRepository::find_by_id(&pool, execution.id).await.unwrap();
assert_eq!(updated.status, ExecutionStatus::Failed);
assert!(updated.result.unwrap()["error"].as_str().unwrap().contains("TTL expired"));
}
```
### Integration Tests
```bash
# 1. Start all services
docker compose up -d
# 2. Create execution targeting stopped worker
curl -X POST http://localhost:8080/api/v1/executions \
-H "Content-Type: application/json" \
-d '{
"action_ref": "core.echo",
"parameters": {"message": "test"},
"worker_id": 999 # Non-existent worker
}'
# 3. Wait for TTL expiration (5+ minutes)
sleep 330
# 4. Verify execution failed
curl http://localhost:8080/api/v1/executions/{id}
# Should show status: "failed", error: "Worker queue TTL expired"
# 5. Check DLQ processed the message
rabbitmqadmin list queues name messages | grep attune.dlx.queue
# Should show 0 messages (processed and removed)
```
## Relationship to Other Phases
### Phase 1 (Completed)
- Execution timeout monitor: Handles executions stuck in SCHEDULED
- Graceful shutdown: Prevents new tasks to stopping workers
- Reduced heartbeat: Faster stale worker detection
**Interaction:** Phase 1 timeout monitor acts as a backstop if DLQ processing fails
### Phase 2 (Current)
- Worker queue TTL: Automatic message expiration
- Dead letter queue: Capture expired messages
- Dead letter handler: Process and fail expired executions
**Benefit:** More precise failure detection at the message queue level
### Phase 3 (Planned)
- Health probes: Proactive worker health checking
- Intelligent retry: Retry transient failures
- Load balancing: Distribute work across healthy workers
**Integration:** Phase 3 will use Phase 2 DLQ data to inform routing decisions
## Benefits
1. **Automatic Failure Detection:** No manual intervention needed for unavailable workers
2. **Precise Timing:** TTL provides exact failure window (vs polling-based Phase 1)
3. **Resource Efficiency:** Prevents message accumulation in worker queues
4. **Debugging Support:** DLQ retains messages for forensic analysis
5. **Graceful Degradation:** System continues functioning even with worker failures
## Limitations
1. **TTL Precision:** RabbitMQ TTL is approximate, not guaranteed to the millisecond
2. **Race Conditions:** Worker may start processing just as TTL expires (rare)
3. **DLQ Capacity:** Very high failure rates may overwhelm DLQ
4. **No Retry Logic:** Phase 2 always fails; Phase 3 will add intelligent retry
## Future Enhancements (Phase 3)
- **Conditional Retry:** Retry messages based on failure reason
- **Priority DLQ:** Prioritize critical execution failures
- **DLQ Analytics:** Aggregate statistics on failure patterns
- **Auto-scaling:** Scale workers based on DLQ rate
- **Custom TTL:** Per-action or per-execution TTL configuration
## References
- RabbitMQ Dead Letter Exchanges: https://www.rabbitmq.com/dlx.html
- RabbitMQ TTL: https://www.rabbitmq.com/ttl.html
- Phase 1 Documentation: `docs/architecture/worker-availability-handling.md`
- Queue Architecture: `docs/architecture/queue-architecture.md`

View File

@@ -131,28 +131,38 @@ echo "Hello, $PARAM_NAME!"
### 4. Action Executor
**Purpose**: Orchestrate the complete execution flow for an action.
**Purpose**: Orchestrate the complete execution flow for an action and own execution state after handoff.
**Execution Flow**:
```
1. Load execution record from database
2. Update status to Running
3. Load action definition by reference
4. Prepare execution context (parameters, env vars, timeout)
5. Select and execute in appropriate runtime
6. Capture results (stdout, stderr, return value)
7. Store artifacts (logs, results)
8. Update execution status (Succeeded/Failed)
9. Publish status update messages
1. Receive execution.scheduled message from executor
2. Load execution record from database
3. Update status to Running (owns state after handoff)
4. Load action definition by reference
5. Prepare execution context (parameters, env vars, timeout)
6. Select and execute in appropriate runtime
7. Capture results (stdout, stderr, return value)
8. Store artifacts (logs, results)
9. Update execution status (Completed/Failed) in database
10. Publish status change notifications
11. Publish completion notification for queue management
```
**Ownership Model**:
- **Worker owns execution state** after receiving `execution.scheduled`
- **Authoritative source** for all status updates: Running, Completed, Failed, Cancelled, etc.
- **Updates database directly** for all state changes
- **Publishes notifications** for orchestration and monitoring
**Responsibilities**:
- Coordinate execution lifecycle
- Load action and execution data from database
- **Update execution state in database** (after handoff from executor)
- Prepare execution context with parameters and environment
- Execute action via runtime registry
- Handle success and failure cases
- Store execution artifacts
- Publish status change notifications
**Key Implementation Details**:
- Parameters merged: action defaults + execution overrides
@@ -246,7 +256,10 @@ See `docs/secrets-management.md` for comprehensive documentation.
- Register worker in database
- Start heartbeat manager
- Consume execution messages from worker-specific queue
- Publish execution status updates
- **Own execution state** after receiving scheduled executions
- **Update execution status in database** (Running, Completed, Failed, etc.)
- Publish execution status change notifications
- Publish execution completion notifications
- Handle graceful shutdown
**Message Flow**:
@@ -407,8 +420,9 @@ pub struct ExecutionResult {
### Error Propagation
- Runtime errors captured in `ExecutionResult.error`
- Execution status updated to Failed in database
- Error published in status update message
- **Worker updates** execution status to Failed in database (owns state)
- Error published in status change notification message
- Error published in completion notification message
- Artifacts still stored for failed executions
- Logs preserved for debugging