more internal polish, resilient workers
This commit is contained in:
@@ -87,32 +87,47 @@ Execution Requested → Scheduler → Worker Selection → Execution Scheduled
|
||||
|
||||
### 3. Execution Manager
|
||||
|
||||
**Purpose**: Manages execution lifecycle and status transitions.
|
||||
**Purpose**: Orchestrates execution workflows and handles lifecycle events.
|
||||
|
||||
**Responsibilities**:
|
||||
- Listens for `execution.status.*` messages from workers
|
||||
- Updates execution records with status changes
|
||||
- Handles execution completion (success, failure, cancellation)
|
||||
- Orchestrates workflow executions (parent-child relationships)
|
||||
- Publishes completion notifications for downstream consumers
|
||||
- **Does NOT update execution state** (worker owns state after scheduling)
|
||||
- Handles execution completion orchestration (triggering child executions)
|
||||
- Manages workflow executions (parent-child relationships)
|
||||
- Coordinates workflow state transitions
|
||||
|
||||
**Ownership Model**:
|
||||
- **Executor owns**: Requested → Scheduling → Scheduled (updates DB)
|
||||
- Includes pre-handoff cancellations/failures (before `execution.scheduled` is published)
|
||||
- **Worker owns**: Running → Completed/Failed/Cancelled (updates DB)
|
||||
- Includes post-handoff cancellations/failures (after receiving `execution.scheduled`)
|
||||
- **Handoff Point**: When `execution.scheduled` message is **published** to worker
|
||||
- Before publish: Executor owns and updates state
|
||||
- After publish: Worker owns and updates state
|
||||
|
||||
**Message Flow**:
|
||||
```
|
||||
Worker Status Update → Execution Manager → Database Update → Completion Handler
|
||||
Worker Status Update → Execution Manager → Orchestration Logic (Read-Only)
|
||||
→ Trigger Child Executions
|
||||
```
|
||||
|
||||
**Status Lifecycle**:
|
||||
```
|
||||
Requested → Scheduling → Scheduled → Running → Completed/Failed/Cancelled
|
||||
│
|
||||
└→ Child Executions (workflows)
|
||||
Requested → Scheduling → Scheduled → [HANDOFF: execution.scheduled published] → Running → Completed/Failed/Cancelled
|
||||
│ │ │
|
||||
└─ Executor Updates ───┘ └─ Worker Updates
|
||||
│ (includes pre-handoff │ (includes post-handoff
|
||||
│ Cancelled) │ Cancelled/Timeout/Abandoned)
|
||||
│
|
||||
└→ Child Executions (workflows)
|
||||
```
|
||||
|
||||
**Key Implementation Details**:
|
||||
- Parses status strings to typed enums for type safety
|
||||
- Receives status change notifications for orchestration purposes only
|
||||
- Does not update execution state after handoff to worker
|
||||
- Handles workflow orchestration (parent-child execution chaining)
|
||||
- Only triggers child executions on successful parent completion
|
||||
- Publishes completion events for notification service
|
||||
- Read-only access to execution records for orchestration logic
|
||||
|
||||
## Message Queue Integration
|
||||
|
||||
@@ -123,12 +138,14 @@ The Executor consumes and produces several message types:
|
||||
**Consumed**:
|
||||
- `enforcement.created` - New enforcement from triggered rules
|
||||
- `execution.requested` - Execution scheduling requests
|
||||
- `execution.status.*` - Status updates from workers
|
||||
- `execution.status.changed` - Status change notifications from workers (for orchestration)
|
||||
- `execution.completed` - Completion notifications from workers (for queue management)
|
||||
|
||||
**Published**:
|
||||
- `execution.requested` - To scheduler (from enforcement processor)
|
||||
- `execution.scheduled` - To workers (from scheduler)
|
||||
- `execution.completed` - To notifier (from execution manager)
|
||||
- `execution.scheduled` - To workers (from scheduler) **← OWNERSHIP HANDOFF**
|
||||
|
||||
**Note**: The executor does NOT publish `execution.completed` messages. This is the worker's responsibility as the authoritative source of execution state after scheduling.
|
||||
|
||||
### Message Envelope Structure
|
||||
|
||||
@@ -186,11 +203,34 @@ use attune_common::repositories::{
|
||||
};
|
||||
```
|
||||
|
||||
### Database Update Ownership
|
||||
|
||||
**Executor updates execution state** from creation through handoff:
|
||||
- Creates execution records (`Requested` status)
|
||||
- Updates status during scheduling (`Scheduling` → `Scheduled`)
|
||||
- Publishes `execution.scheduled` message to worker **← HANDOFF POINT**
|
||||
- **Handles cancellations/failures BEFORE handoff** (before message is published)
|
||||
- Example: User cancels execution while queued by concurrency policy
|
||||
- Executor updates to `Cancelled`, worker never receives message
|
||||
|
||||
**Worker updates execution state** after receiving handoff:
|
||||
- Receives `execution.scheduled` message (takes ownership)
|
||||
- Updates status when execution starts (`Running`)
|
||||
- Updates status when execution completes (`Completed`, `Failed`, etc.)
|
||||
- **Handles cancellations/failures AFTER handoff** (after receiving message)
|
||||
- Updates result data and artifacts
|
||||
- Worker only owns executions it has received
|
||||
|
||||
**Executor reads execution state** for orchestration after handoff:
|
||||
- Receives status change notifications from workers
|
||||
- Reads execution records to trigger workflow children
|
||||
- Does NOT update execution state after publishing `execution.scheduled`
|
||||
|
||||
### Transaction Support
|
||||
|
||||
Future implementations will use database transactions for multi-step operations:
|
||||
- Creating execution + publishing message (atomic)
|
||||
- Status update + completion handling (atomic)
|
||||
- Enforcement processing + execution creation (atomic)
|
||||
|
||||
## Configuration
|
||||
|
||||
|
||||
557
docs/architecture/worker-availability-handling.md
Normal file
557
docs/architecture/worker-availability-handling.md
Normal file
@@ -0,0 +1,557 @@
|
||||
# Worker Availability Handling
|
||||
|
||||
**Status**: Implementation Gap Identified
|
||||
**Priority**: High
|
||||
**Date**: 2026-02-09
|
||||
|
||||
## Problem Statement
|
||||
|
||||
When workers are stopped or become unavailable, the executor continues attempting to schedule executions to them, resulting in:
|
||||
|
||||
1. **Stuck executions**: Executions remain in `SCHEDULING` or `SCHEDULED` status indefinitely
|
||||
2. **Queue buildup**: Messages accumulate in worker-specific RabbitMQ queues
|
||||
3. **No failure notification**: Users don't know their executions are stuck
|
||||
4. **Resource waste**: System resources consumed by queued messages and database records
|
||||
|
||||
## Current Architecture
|
||||
|
||||
### Heartbeat Mechanism
|
||||
|
||||
Workers send heartbeat updates to the database periodically (default: 30 seconds).
|
||||
|
||||
```rust
|
||||
// From crates/executor/src/scheduler.rs
|
||||
const DEFAULT_HEARTBEAT_INTERVAL: u64 = 30;
|
||||
const HEARTBEAT_STALENESS_MULTIPLIER: u64 = 3;
|
||||
|
||||
fn is_worker_heartbeat_fresh(worker: &Worker) -> bool {
|
||||
// Worker is fresh if heartbeat < 90 seconds old
|
||||
let max_age = Duration::from_secs(
|
||||
DEFAULT_HEARTBEAT_INTERVAL * HEARTBEAT_STALENESS_MULTIPLIER
|
||||
);
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
### Scheduling Flow
|
||||
|
||||
```
|
||||
Execution Created (REQUESTED)
|
||||
↓
|
||||
Scheduler receives message
|
||||
↓
|
||||
Find compatible worker with fresh heartbeat
|
||||
↓
|
||||
Update execution to SCHEDULED
|
||||
↓
|
||||
Publish message to worker-specific queue
|
||||
↓
|
||||
Worker consumes and executes
|
||||
```
|
||||
|
||||
### Failure Points
|
||||
|
||||
1. **Worker stops after heartbeat**: Worker has fresh heartbeat but is actually down
|
||||
2. **Worker crashes**: No graceful shutdown, heartbeat appears fresh temporarily
|
||||
3. **Network partition**: Worker isolated but appears healthy
|
||||
4. **Queue accumulation**: Messages sit in worker-specific queues indefinitely
|
||||
|
||||
## Current Mitigations (Insufficient)
|
||||
|
||||
### 1. Heartbeat Staleness Check
|
||||
|
||||
```rust
|
||||
fn select_worker(pool: &PgPool, action: &Action) -> Result<Worker> {
|
||||
// Filter by active workers
|
||||
let active_workers: Vec<_> = workers
|
||||
.into_iter()
|
||||
.filter(|w| w.status == WorkerStatus::Active)
|
||||
.collect();
|
||||
|
||||
// Filter by heartbeat freshness
|
||||
let fresh_workers: Vec<_> = active_workers
|
||||
.into_iter()
|
||||
.filter(|w| is_worker_heartbeat_fresh(w))
|
||||
.collect();
|
||||
|
||||
if fresh_workers.is_empty() {
|
||||
return Err(anyhow!("No workers with fresh heartbeats"));
|
||||
}
|
||||
|
||||
// Select first available worker
|
||||
Ok(fresh_workers.into_iter().next().unwrap())
|
||||
}
|
||||
```
|
||||
|
||||
**Gap**: Workers can stop within the 90-second staleness window.
|
||||
|
||||
### 2. Message Requeue on Error
|
||||
|
||||
```rust
|
||||
// From crates/common/src/mq/consumer.rs
|
||||
match handler(envelope.clone()).await {
|
||||
Err(e) => {
|
||||
let requeue = e.is_retriable();
|
||||
channel.basic_nack(delivery_tag, BasicNackOptions {
|
||||
requeue,
|
||||
multiple: false,
|
||||
}).await?;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Gap**: Only requeues on retriable errors (connection/timeout), not worker unavailability.
|
||||
|
||||
### 3. Message TTL Configuration
|
||||
|
||||
```rust
|
||||
// From crates/common/src/config.rs
|
||||
pub struct MessageQueueConfig {
|
||||
#[serde(default = "default_message_ttl")]
|
||||
pub message_ttl: u64,
|
||||
}
|
||||
|
||||
fn default_message_ttl() -> u64 {
|
||||
3600 // 1 hour
|
||||
}
|
||||
```
|
||||
|
||||
**Gap**: TTL not currently applied to worker queues, and 1 hour is too long.
|
||||
|
||||
## Proposed Solutions
|
||||
|
||||
### Solution 1: Execution Timeout Mechanism (HIGH PRIORITY)
|
||||
|
||||
Add a background task that monitors scheduled executions and fails them if they don't start within a timeout.
|
||||
|
||||
**Implementation:**
|
||||
|
||||
```rust
|
||||
// crates/executor/src/execution_timeout_monitor.rs
|
||||
|
||||
pub struct ExecutionTimeoutMonitor {
|
||||
pool: PgPool,
|
||||
publisher: Arc<Publisher>,
|
||||
check_interval: Duration,
|
||||
scheduled_timeout: Duration,
|
||||
}
|
||||
|
||||
impl ExecutionTimeoutMonitor {
|
||||
pub async fn start(&self) -> Result<()> {
|
||||
let mut interval = tokio::time::interval(self.check_interval);
|
||||
|
||||
loop {
|
||||
interval.tick().await;
|
||||
|
||||
if let Err(e) = self.check_stale_executions().await {
|
||||
error!("Error checking stale executions: {}", e);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
async fn check_stale_executions(&self) -> Result<()> {
|
||||
let cutoff = Utc::now() - chrono::Duration::from_std(self.scheduled_timeout)?;
|
||||
|
||||
// Find executions stuck in SCHEDULED status
|
||||
let stale_executions = sqlx::query_as::<_, Execution>(
|
||||
"SELECT * FROM execution
|
||||
WHERE status = 'scheduled'
|
||||
AND updated < $1"
|
||||
)
|
||||
.bind(cutoff)
|
||||
.fetch_all(&self.pool)
|
||||
.await?;
|
||||
|
||||
for execution in stale_executions {
|
||||
warn!(
|
||||
"Execution {} has been scheduled for too long, marking as failed",
|
||||
execution.id
|
||||
);
|
||||
|
||||
self.fail_execution(
|
||||
execution.id,
|
||||
"Execution timeout: worker did not pick up task within timeout"
|
||||
).await?;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
async fn fail_execution(&self, execution_id: i64, reason: &str) -> Result<()> {
|
||||
// Update execution status
|
||||
sqlx::query(
|
||||
"UPDATE execution
|
||||
SET status = 'failed',
|
||||
result = $2,
|
||||
updated = NOW()
|
||||
WHERE id = $1"
|
||||
)
|
||||
.bind(execution_id)
|
||||
.bind(serde_json::json!({
|
||||
"error": reason,
|
||||
"failed_by": "execution_timeout_monitor"
|
||||
}))
|
||||
.execute(&self.pool)
|
||||
.await?;
|
||||
|
||||
// Publish completion notification
|
||||
let payload = ExecutionCompletedPayload {
|
||||
execution_id,
|
||||
status: ExecutionStatus::Failed,
|
||||
result: Some(serde_json::json!({"error": reason})),
|
||||
};
|
||||
|
||||
self.publisher
|
||||
.publish_envelope(
|
||||
MessageType::ExecutionCompleted,
|
||||
payload,
|
||||
"attune.executions",
|
||||
)
|
||||
.await?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Configuration:**
|
||||
|
||||
```yaml
|
||||
# config.yaml
|
||||
executor:
|
||||
scheduled_timeout: 300 # 5 minutes (fail if not running within 5 min)
|
||||
timeout_check_interval: 60 # Check every minute
|
||||
```
|
||||
|
||||
### Solution 2: Worker Queue TTL and DLQ (MEDIUM PRIORITY)
|
||||
|
||||
Apply message TTL to worker-specific queues with dead letter exchange.
|
||||
|
||||
**Implementation:**
|
||||
|
||||
```rust
|
||||
// When declaring worker-specific queues
|
||||
let mut queue_args = FieldTable::default();
|
||||
|
||||
// Set message TTL (5 minutes)
|
||||
queue_args.insert(
|
||||
"x-message-ttl".into(),
|
||||
AMQPValue::LongInt(300_000) // 5 minutes in milliseconds
|
||||
);
|
||||
|
||||
// Set dead letter exchange
|
||||
queue_args.insert(
|
||||
"x-dead-letter-exchange".into(),
|
||||
AMQPValue::LongString("attune.executions.dlx".into())
|
||||
);
|
||||
|
||||
channel.queue_declare(
|
||||
&format!("attune.execution.worker.{}", worker_id),
|
||||
QueueDeclareOptions {
|
||||
durable: true,
|
||||
..Default::default()
|
||||
},
|
||||
queue_args,
|
||||
).await?;
|
||||
```
|
||||
|
||||
**Dead Letter Handler:**
|
||||
|
||||
```rust
|
||||
// crates/executor/src/dead_letter_handler.rs
|
||||
|
||||
pub struct DeadLetterHandler {
|
||||
pool: PgPool,
|
||||
consumer: Arc<Consumer>,
|
||||
}
|
||||
|
||||
impl DeadLetterHandler {
|
||||
pub async fn start(&self) -> Result<()> {
|
||||
self.consumer
|
||||
.consume_with_handler(|envelope: MessageEnvelope<ExecutionScheduledPayload>| {
|
||||
let pool = self.pool.clone();
|
||||
|
||||
async move {
|
||||
warn!("Received dead letter for execution {}", envelope.payload.execution_id);
|
||||
|
||||
// Mark execution as failed
|
||||
sqlx::query(
|
||||
"UPDATE execution
|
||||
SET status = 'failed',
|
||||
result = $2,
|
||||
updated = NOW()
|
||||
WHERE id = $1 AND status = 'scheduled'"
|
||||
)
|
||||
.bind(envelope.payload.execution_id)
|
||||
.bind(serde_json::json!({
|
||||
"error": "Message expired in worker queue (worker unavailable)",
|
||||
"failed_by": "dead_letter_handler"
|
||||
}))
|
||||
.execute(&pool)
|
||||
.await?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
})
|
||||
.await
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Solution 3: Worker Health Probes (LOW PRIORITY)
|
||||
|
||||
Add active health checking instead of relying solely on heartbeats.
|
||||
|
||||
**Implementation:**
|
||||
|
||||
```rust
|
||||
// crates/executor/src/worker_health_checker.rs
|
||||
|
||||
pub struct WorkerHealthChecker {
|
||||
pool: PgPool,
|
||||
check_interval: Duration,
|
||||
}
|
||||
|
||||
impl WorkerHealthChecker {
|
||||
pub async fn start(&self) -> Result<()> {
|
||||
let mut interval = tokio::time::interval(self.check_interval);
|
||||
|
||||
loop {
|
||||
interval.tick().await;
|
||||
|
||||
if let Err(e) = self.check_worker_health().await {
|
||||
error!("Error checking worker health: {}", e);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
async fn check_worker_health(&self) -> Result<()> {
|
||||
let workers = WorkerRepository::find_action_workers(&self.pool).await?;
|
||||
|
||||
for worker in workers {
|
||||
// Skip if heartbeat is very stale (worker is definitely down)
|
||||
if !is_heartbeat_recent(&worker) {
|
||||
continue;
|
||||
}
|
||||
|
||||
// Attempt health check
|
||||
match self.ping_worker(&worker).await {
|
||||
Ok(true) => {
|
||||
// Worker is healthy, ensure status is Active
|
||||
if worker.status != Some(WorkerStatus::Active) {
|
||||
self.update_worker_status(worker.id, WorkerStatus::Active).await?;
|
||||
}
|
||||
}
|
||||
Ok(false) | Err(_) => {
|
||||
// Worker is unhealthy, mark as inactive
|
||||
warn!("Worker {} failed health check", worker.name);
|
||||
self.update_worker_status(worker.id, WorkerStatus::Inactive).await?;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
async fn ping_worker(&self, worker: &Worker) -> Result<bool> {
|
||||
// TODO: Implement health endpoint on worker
|
||||
// For now, check if worker's queue is being consumed
|
||||
Ok(true)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Solution 4: Graceful Worker Shutdown (MEDIUM PRIORITY)
|
||||
|
||||
Ensure workers mark themselves as inactive before shutdown.
|
||||
|
||||
**Implementation:**
|
||||
|
||||
```rust
|
||||
// In worker service shutdown handler
|
||||
impl WorkerService {
|
||||
pub async fn shutdown(&self) -> Result<()> {
|
||||
info!("Worker shutting down gracefully...");
|
||||
|
||||
// Mark worker as inactive
|
||||
sqlx::query(
|
||||
"UPDATE worker SET status = 'inactive', updated = NOW() WHERE id = $1"
|
||||
)
|
||||
.bind(self.worker_id)
|
||||
.execute(&self.pool)
|
||||
.await?;
|
||||
|
||||
// Stop accepting new tasks
|
||||
self.stop_consuming().await?;
|
||||
|
||||
// Wait for in-flight tasks to complete (with timeout)
|
||||
let timeout = Duration::from_secs(30);
|
||||
tokio::time::timeout(timeout, self.wait_for_completion()).await?;
|
||||
|
||||
info!("Worker shutdown complete");
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Docker Signal Handling:**
|
||||
|
||||
```yaml
|
||||
# docker-compose.yaml
|
||||
services:
|
||||
worker-shell:
|
||||
stop_grace_period: 45s # Give worker time to finish tasks
|
||||
```
|
||||
|
||||
## Implementation Priority
|
||||
|
||||
### Phase 1: Immediate (Week 1)
|
||||
1. **Execution Timeout Monitor** - Prevents stuck executions
|
||||
2. **Graceful Shutdown** - Marks workers inactive on stop
|
||||
|
||||
### Phase 2: Short-term (Week 2)
|
||||
3. **Worker Queue TTL + DLQ** - Prevents message buildup
|
||||
4. **Dead Letter Handler** - Fails expired executions
|
||||
|
||||
### Phase 3: Long-term (Month 1)
|
||||
5. **Worker Health Probes** - Active availability verification
|
||||
6. **Retry Logic** - Reschedule to different worker on failure
|
||||
|
||||
## Configuration
|
||||
|
||||
### Recommended Timeouts
|
||||
|
||||
```yaml
|
||||
executor:
|
||||
# How long an execution can stay SCHEDULED before failing
|
||||
scheduled_timeout: 300 # 5 minutes
|
||||
|
||||
# How often to check for stale executions
|
||||
timeout_check_interval: 60 # 1 minute
|
||||
|
||||
# Message TTL in worker queues
|
||||
worker_queue_ttl: 300 # 5 minutes (match scheduled_timeout)
|
||||
|
||||
# Worker health check interval
|
||||
health_check_interval: 30 # 30 seconds
|
||||
|
||||
worker:
|
||||
# How often to send heartbeats
|
||||
heartbeat_interval: 10 # 10 seconds (more frequent)
|
||||
|
||||
# Grace period for shutdown
|
||||
shutdown_timeout: 30 # 30 seconds
|
||||
```
|
||||
|
||||
### Staleness Calculation
|
||||
|
||||
```
|
||||
Heartbeat Staleness Threshold = heartbeat_interval * 3
|
||||
= 10 * 3 = 30 seconds
|
||||
|
||||
This means:
|
||||
- Worker sends heartbeat every 10s
|
||||
- If heartbeat is > 30s old, worker is considered stale
|
||||
- Reduces window where stopped worker appears healthy from 90s to 30s
|
||||
```
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### Metrics to Track
|
||||
|
||||
1. **Execution timeout rate**: Number of executions failed due to timeout
|
||||
2. **Worker downtime**: Time between last heartbeat and status change
|
||||
3. **Dead letter queue depth**: Number of expired messages
|
||||
4. **Average scheduling latency**: Time from REQUESTED to RUNNING
|
||||
|
||||
### Alerts
|
||||
|
||||
```yaml
|
||||
alerts:
|
||||
- name: high_execution_timeout_rate
|
||||
condition: execution_timeouts > 10 per minute
|
||||
severity: warning
|
||||
|
||||
- name: no_active_workers
|
||||
condition: active_workers == 0
|
||||
severity: critical
|
||||
|
||||
- name: dlq_buildup
|
||||
condition: dlq_depth > 100
|
||||
severity: warning
|
||||
|
||||
- name: stale_executions
|
||||
condition: scheduled_executions_older_than_5min > 0
|
||||
severity: warning
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
### Test Scenarios
|
||||
|
||||
1. **Worker stops mid-execution**: Should timeout and fail
|
||||
2. **Worker never picks up task**: Should timeout after 5 minutes
|
||||
3. **All workers down**: Should immediately fail with "no workers available"
|
||||
4. **Worker stops gracefully**: Should mark inactive and not receive new tasks
|
||||
5. **Message expires in queue**: Should be moved to DLQ and execution failed
|
||||
|
||||
### Integration Test Example
|
||||
|
||||
```rust
|
||||
#[tokio::test]
|
||||
async fn test_execution_timeout_on_worker_down() {
|
||||
let pool = setup_test_db().await;
|
||||
let mq = setup_test_mq().await;
|
||||
|
||||
// Create worker and execution
|
||||
let worker = create_test_worker(&pool).await;
|
||||
let execution = create_test_execution(&pool).await;
|
||||
|
||||
// Schedule execution to worker
|
||||
schedule_execution(&pool, &mq, execution.id, worker.id).await;
|
||||
|
||||
// Stop worker (simulate crash - no graceful shutdown)
|
||||
stop_worker(worker.id).await;
|
||||
|
||||
// Wait for timeout
|
||||
tokio::time::sleep(Duration::from_secs(310)).await;
|
||||
|
||||
// Verify execution is marked as failed
|
||||
let execution = get_execution(&pool, execution.id).await;
|
||||
assert_eq!(execution.status, ExecutionStatus::Failed);
|
||||
assert!(execution.result.unwrap()["error"]
|
||||
.as_str()
|
||||
.unwrap()
|
||||
.contains("timeout"));
|
||||
}
|
||||
```
|
||||
|
||||
## Migration Path
|
||||
|
||||
### Step 1: Add Monitoring (No Breaking Changes)
|
||||
- Deploy execution timeout monitor
|
||||
- Monitor logs for timeout events
|
||||
- Tune timeout values based on actual workload
|
||||
|
||||
### Step 2: Add DLQ (Requires Queue Reconfiguration)
|
||||
- Create dead letter exchange
|
||||
- Update queue declarations with TTL and DLX
|
||||
- Deploy dead letter handler
|
||||
- Monitor DLQ depth
|
||||
|
||||
### Step 3: Graceful Shutdown (Worker Update)
|
||||
- Add shutdown handler to worker
|
||||
- Update Docker Compose stop_grace_period
|
||||
- Test worker restarts
|
||||
|
||||
### Step 4: Health Probes (Future Enhancement)
|
||||
- Add health endpoint to worker
|
||||
- Deploy health checker service
|
||||
- Transition from heartbeat-only to active probing
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Queue Architecture](./queue-architecture.md)
|
||||
- [Worker Service](./worker-service.md)
|
||||
- [Executor Service](./executor-service.md)
|
||||
- [RabbitMQ Queues Quick Reference](../docs/QUICKREF-rabbitmq-queues.md)
|
||||
493
docs/architecture/worker-queue-ttl-dlq.md
Normal file
493
docs/architecture/worker-queue-ttl-dlq.md
Normal file
@@ -0,0 +1,493 @@
|
||||
# Worker Queue TTL and Dead Letter Queue (Phase 2)
|
||||
|
||||
## Overview
|
||||
|
||||
Phase 2 of worker availability handling implements message TTL (time-to-live) on worker-specific queues and dead letter queue (DLQ) processing. This ensures that executions sent to unavailable workers are automatically failed instead of remaining stuck indefinitely.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Message Flow
|
||||
|
||||
```
|
||||
┌─────────────┐
|
||||
│ Executor │
|
||||
│ Scheduler │
|
||||
└──────┬──────┘
|
||||
│ Publishes ExecutionRequested
|
||||
│ routing_key: execution.dispatch.worker.{id}
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────┐
|
||||
│ worker.{id}.executions queue │
|
||||
│ │
|
||||
│ Properties: │
|
||||
│ - x-message-ttl: 300000ms (5m) │
|
||||
│ - x-dead-letter-exchange: dlx │
|
||||
└──────┬───────────────────┬───────┘
|
||||
│ │
|
||||
│ Worker consumes │ TTL expires
|
||||
│ (normal flow) │ (worker unavailable)
|
||||
│ │
|
||||
▼ ▼
|
||||
┌──────────────┐ ┌──────────────────┐
|
||||
│ Worker │ │ attune.dlx │
|
||||
│ Service │ │ (Dead Letter │
|
||||
│ │ │ Exchange) │
|
||||
└──────────────┘ └────────┬─────────┘
|
||||
│
|
||||
│ Routes to DLQ
|
||||
│
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ attune.dlx.queue │
|
||||
│ (Dead Letter Queue) │
|
||||
└────────┬─────────────┘
|
||||
│
|
||||
│ Consumes
|
||||
│
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ Dead Letter Handler │
|
||||
│ (in Executor) │
|
||||
│ │
|
||||
│ - Identifies exec │
|
||||
│ - Marks as FAILED │
|
||||
│ - Logs failure │
|
||||
└──────────────────────┘
|
||||
```
|
||||
|
||||
### Components
|
||||
|
||||
#### 1. Worker Queue TTL
|
||||
|
||||
**Configuration:**
|
||||
- Default: 5 minutes (300,000 milliseconds)
|
||||
- Configurable via `rabbitmq.worker_queue_ttl_ms`
|
||||
|
||||
**Implementation:**
|
||||
- Applied during queue declaration in `Connection::setup_worker_infrastructure()`
|
||||
- Uses RabbitMQ's `x-message-ttl` queue argument
|
||||
- Only applies to worker-specific queues (`worker.{id}.executions`)
|
||||
|
||||
**Behavior:**
|
||||
- When a message remains in the queue longer than TTL
|
||||
- RabbitMQ automatically moves it to the configured dead letter exchange
|
||||
- Original message properties and headers are preserved
|
||||
- Includes `x-death` header with expiration details
|
||||
|
||||
#### 2. Dead Letter Exchange (DLX)
|
||||
|
||||
**Configuration:**
|
||||
- Exchange name: `attune.dlx`
|
||||
- Type: `direct`
|
||||
- Durable: `true`
|
||||
|
||||
**Setup:**
|
||||
- Created in `Connection::setup_common_infrastructure()`
|
||||
- Bound to dead letter queue with routing key `#` (all messages)
|
||||
- Shared across all services
|
||||
|
||||
#### 3. Dead Letter Queue
|
||||
|
||||
**Configuration:**
|
||||
- Queue name: `attune.dlx.queue`
|
||||
- Durable: `true`
|
||||
- TTL: 24 hours (configurable via `rabbitmq.dead_letter.ttl_ms`)
|
||||
|
||||
**Properties:**
|
||||
- Retains messages for debugging and analysis
|
||||
- Messages auto-expire after retention period
|
||||
- No DLX on the DLQ itself (prevents infinite loops)
|
||||
|
||||
#### 4. Dead Letter Handler
|
||||
|
||||
**Location:** `crates/executor/src/dead_letter_handler.rs`
|
||||
|
||||
**Responsibilities:**
|
||||
1. Consume messages from `attune.dlx.queue`
|
||||
2. Deserialize message envelope
|
||||
3. Extract execution ID from payload
|
||||
4. Verify execution is in non-terminal state
|
||||
5. Update execution to FAILED status
|
||||
6. Add descriptive error information
|
||||
7. Acknowledge message (remove from DLQ)
|
||||
|
||||
**Error Handling:**
|
||||
- Invalid messages: Acknowledged and discarded
|
||||
- Missing executions: Acknowledged (already processed)
|
||||
- Terminal state executions: Acknowledged (no action needed)
|
||||
- Database errors: Nacked with requeue (retry later)
|
||||
|
||||
## Configuration
|
||||
|
||||
### RabbitMQ Configuration Structure
|
||||
|
||||
```yaml
|
||||
message_queue:
|
||||
rabbitmq:
|
||||
# Worker queue TTL - how long messages wait before DLX
|
||||
worker_queue_ttl_ms: 300000 # 5 minutes (default)
|
||||
|
||||
# Dead letter configuration
|
||||
dead_letter:
|
||||
enabled: true # Enable DLQ system
|
||||
exchange: attune.dlx # DLX name
|
||||
ttl_ms: 86400000 # DLQ retention (24 hours)
|
||||
```
|
||||
|
||||
### Environment-Specific Settings
|
||||
|
||||
#### Development (`config.development.yaml`)
|
||||
```yaml
|
||||
message_queue:
|
||||
rabbitmq:
|
||||
worker_queue_ttl_ms: 300000 # 5 minutes
|
||||
dead_letter:
|
||||
enabled: true
|
||||
exchange: attune.dlx
|
||||
ttl_ms: 86400000 # 24 hours
|
||||
```
|
||||
|
||||
#### Production (`config.docker.yaml`)
|
||||
```yaml
|
||||
message_queue:
|
||||
rabbitmq:
|
||||
worker_queue_ttl_ms: 300000 # 5 minutes
|
||||
dead_letter:
|
||||
enabled: true
|
||||
exchange: attune.dlx
|
||||
ttl_ms: 86400000 # 24 hours
|
||||
```
|
||||
|
||||
### Tuning Guidelines
|
||||
|
||||
**Worker Queue TTL (`worker_queue_ttl_ms`):**
|
||||
- **Too short:** Legitimate slow workers may have executions failed prematurely
|
||||
- **Too long:** Unavailable workers cause delayed failure detection
|
||||
- **Recommendation:** 2-5x typical execution time, minimum 2 minutes
|
||||
- **Default (5 min):** Good balance for most workloads
|
||||
|
||||
**DLQ Retention (`dead_letter.ttl_ms`):**
|
||||
- Purpose: Debugging and forensics
|
||||
- **Too short:** May lose data before analysis
|
||||
- **Too long:** Accumulates stale data
|
||||
- **Recommendation:** 24-48 hours in production
|
||||
- **Default (24 hours):** Adequate for most troubleshooting
|
||||
|
||||
## Code Structure
|
||||
|
||||
### Queue Declaration with TTL
|
||||
|
||||
```rust
|
||||
// crates/common/src/mq/connection.rs
|
||||
|
||||
pub async fn declare_queue_with_dlx_and_ttl(
|
||||
&self,
|
||||
config: &QueueConfig,
|
||||
dlx_exchange: &str,
|
||||
ttl_ms: Option<u64>,
|
||||
) -> MqResult<()> {
|
||||
let mut args = FieldTable::default();
|
||||
|
||||
// Configure DLX
|
||||
args.insert(
|
||||
"x-dead-letter-exchange".into(),
|
||||
AMQPValue::LongString(dlx_exchange.into()),
|
||||
);
|
||||
|
||||
// Configure TTL if specified
|
||||
if let Some(ttl) = ttl_ms {
|
||||
args.insert(
|
||||
"x-message-ttl".into(),
|
||||
AMQPValue::LongInt(ttl as i64),
|
||||
);
|
||||
}
|
||||
|
||||
// Declare queue with arguments
|
||||
channel.queue_declare(&config.name, options, args).await?;
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
### Dead Letter Handler
|
||||
|
||||
```rust
|
||||
// crates/executor/src/dead_letter_handler.rs
|
||||
|
||||
pub struct DeadLetterHandler {
|
||||
pool: Arc<PgPool>,
|
||||
consumer: Consumer,
|
||||
running: Arc<Mutex<bool>>,
|
||||
}
|
||||
|
||||
impl DeadLetterHandler {
|
||||
pub async fn start(&self) -> Result<(), Error> {
|
||||
self.consumer.consume_with_handler(|envelope| {
|
||||
match envelope.message_type {
|
||||
MessageType::ExecutionRequested => {
|
||||
handle_execution_requested(&pool, &envelope).await
|
||||
}
|
||||
_ => {
|
||||
// Unexpected message type - acknowledge and discard
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
}).await
|
||||
}
|
||||
}
|
||||
|
||||
async fn handle_execution_requested(
|
||||
pool: &PgPool,
|
||||
envelope: &MessageEnvelope<Value>,
|
||||
) -> MqResult<()> {
|
||||
// Extract execution ID
|
||||
let execution_id = envelope.payload.get("execution_id")
|
||||
.and_then(|v| v.as_i64())
|
||||
.ok_or_else(|| /* error */)?;
|
||||
|
||||
// Fetch current state
|
||||
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
|
||||
|
||||
// Only fail if in non-terminal state
|
||||
if !execution.status.is_terminal() {
|
||||
ExecutionRepository::update(pool, execution_id, UpdateExecutionInput {
|
||||
status: Some(ExecutionStatus::Failed),
|
||||
result: Some(json!({
|
||||
"error": "Worker queue TTL expired",
|
||||
"message": "Worker did not process execution within configured TTL",
|
||||
})),
|
||||
ended: Some(Some(Utc::now())),
|
||||
..Default::default()
|
||||
}).await?;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
## Integration with Executor Service
|
||||
|
||||
The dead letter handler is started automatically by the executor service if DLQ is enabled:
|
||||
|
||||
```rust
|
||||
// crates/executor/src/service.rs
|
||||
|
||||
pub async fn start(&self) -> Result<()> {
|
||||
// ... other components ...
|
||||
|
||||
// Start dead letter handler (if enabled)
|
||||
if self.inner.mq_config.rabbitmq.dead_letter.enabled {
|
||||
let dlq_name = format!("{}.queue",
|
||||
self.inner.mq_config.rabbitmq.dead_letter.exchange);
|
||||
let dlq_consumer = Consumer::new(
|
||||
&self.inner.mq_connection,
|
||||
create_dlq_consumer_config(&dlq_name, "executor.dlq"),
|
||||
).await?;
|
||||
|
||||
let dlq_handler = Arc::new(
|
||||
DeadLetterHandler::new(self.inner.pool.clone(), dlq_consumer).await?
|
||||
);
|
||||
|
||||
handles.push(tokio::spawn(async move {
|
||||
dlq_handler.start().await
|
||||
}));
|
||||
}
|
||||
|
||||
// ... wait for completion ...
|
||||
}
|
||||
```
|
||||
|
||||
## Operational Considerations
|
||||
|
||||
### Monitoring
|
||||
|
||||
**Key Metrics:**
|
||||
- DLQ message rate (messages/sec entering DLQ)
|
||||
- DLQ queue depth (current messages in DLQ)
|
||||
- DLQ processing latency (time from DLX to handler)
|
||||
- Failed execution count (executions failed via DLQ)
|
||||
|
||||
**Alerting Thresholds:**
|
||||
- DLQ rate > 10/min: Workers may be unhealthy or TTL too aggressive
|
||||
- DLQ depth > 100: Handler may be falling behind
|
||||
- High failure rate: Systematic worker availability issues
|
||||
|
||||
### RabbitMQ Management
|
||||
|
||||
**View DLQ:**
|
||||
```bash
|
||||
# List messages in DLQ
|
||||
rabbitmqadmin list queues name messages
|
||||
|
||||
# Get DLQ details
|
||||
rabbitmqadmin show queue name=attune.dlx.queue
|
||||
|
||||
# Purge DLQ (use with caution)
|
||||
rabbitmqadmin purge queue name=attune.dlx.queue
|
||||
```
|
||||
|
||||
**View Dead Letters:**
|
||||
```bash
|
||||
# Get message from DLQ
|
||||
rabbitmqadmin get queue=attune.dlx.queue count=1
|
||||
|
||||
# Check message death history
|
||||
# Look for x-death header in message properties
|
||||
```
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
#### High DLQ Rate
|
||||
|
||||
**Symptoms:** Many executions failing via DLQ
|
||||
|
||||
**Causes:**
|
||||
1. Workers down or restarting frequently
|
||||
2. Worker queue TTL too aggressive
|
||||
3. Worker overloaded (not consuming fast enough)
|
||||
4. Network issues between executor and workers
|
||||
|
||||
**Resolution:**
|
||||
1. Check worker health and logs
|
||||
2. Verify worker heartbeats in database
|
||||
3. Consider increasing `worker_queue_ttl_ms`
|
||||
4. Scale worker fleet if overloaded
|
||||
|
||||
#### DLQ Handler Not Processing
|
||||
|
||||
**Symptoms:** DLQ depth increasing, executions stuck
|
||||
|
||||
**Causes:**
|
||||
1. Executor service not running
|
||||
2. DLQ disabled in configuration
|
||||
3. Database connection issues
|
||||
4. Handler crashed or deadlocked
|
||||
|
||||
**Resolution:**
|
||||
1. Check executor service logs
|
||||
2. Verify `dead_letter.enabled = true`
|
||||
3. Check database connectivity
|
||||
4. Restart executor service if needed
|
||||
|
||||
#### Messages Not Reaching DLQ
|
||||
|
||||
**Symptoms:** Executions stuck, DLQ empty
|
||||
|
||||
**Causes:**
|
||||
1. Worker queues not configured with DLX
|
||||
2. DLX exchange not created
|
||||
3. DLQ not bound to DLX
|
||||
4. TTL not configured on worker queues
|
||||
|
||||
**Resolution:**
|
||||
1. Restart services to recreate infrastructure
|
||||
2. Verify RabbitMQ configuration
|
||||
3. Check queue properties in RabbitMQ management UI
|
||||
|
||||
## Testing
|
||||
|
||||
### Unit Tests
|
||||
|
||||
```rust
|
||||
#[tokio::test]
|
||||
async fn test_expired_execution_handling() {
|
||||
let pool = setup_test_db().await;
|
||||
|
||||
// Create execution in SCHEDULED state
|
||||
let execution = create_test_execution(&pool, ExecutionStatus::Scheduled).await;
|
||||
|
||||
// Simulate DLQ message
|
||||
let envelope = MessageEnvelope::new(
|
||||
MessageType::ExecutionRequested,
|
||||
json!({ "execution_id": execution.id }),
|
||||
);
|
||||
|
||||
// Process message
|
||||
handle_execution_requested(&pool, &envelope).await.unwrap();
|
||||
|
||||
// Verify execution failed
|
||||
let updated = ExecutionRepository::find_by_id(&pool, execution.id).await.unwrap();
|
||||
assert_eq!(updated.status, ExecutionStatus::Failed);
|
||||
assert!(updated.result.unwrap()["error"].as_str().unwrap().contains("TTL expired"));
|
||||
}
|
||||
```
|
||||
|
||||
### Integration Tests
|
||||
|
||||
```bash
|
||||
# 1. Start all services
|
||||
docker compose up -d
|
||||
|
||||
# 2. Create execution targeting stopped worker
|
||||
curl -X POST http://localhost:8080/api/v1/executions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"action_ref": "core.echo",
|
||||
"parameters": {"message": "test"},
|
||||
"worker_id": 999 # Non-existent worker
|
||||
}'
|
||||
|
||||
# 3. Wait for TTL expiration (5+ minutes)
|
||||
sleep 330
|
||||
|
||||
# 4. Verify execution failed
|
||||
curl http://localhost:8080/api/v1/executions/{id}
|
||||
# Should show status: "failed", error: "Worker queue TTL expired"
|
||||
|
||||
# 5. Check DLQ processed the message
|
||||
rabbitmqadmin list queues name messages | grep attune.dlx.queue
|
||||
# Should show 0 messages (processed and removed)
|
||||
```
|
||||
|
||||
## Relationship to Other Phases
|
||||
|
||||
### Phase 1 (Completed)
|
||||
- Execution timeout monitor: Handles executions stuck in SCHEDULED
|
||||
- Graceful shutdown: Prevents new tasks to stopping workers
|
||||
- Reduced heartbeat: Faster stale worker detection
|
||||
|
||||
**Interaction:** Phase 1 timeout monitor acts as a backstop if DLQ processing fails
|
||||
|
||||
### Phase 2 (Current)
|
||||
- Worker queue TTL: Automatic message expiration
|
||||
- Dead letter queue: Capture expired messages
|
||||
- Dead letter handler: Process and fail expired executions
|
||||
|
||||
**Benefit:** More precise failure detection at the message queue level
|
||||
|
||||
### Phase 3 (Planned)
|
||||
- Health probes: Proactive worker health checking
|
||||
- Intelligent retry: Retry transient failures
|
||||
- Load balancing: Distribute work across healthy workers
|
||||
|
||||
**Integration:** Phase 3 will use Phase 2 DLQ data to inform routing decisions
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Automatic Failure Detection:** No manual intervention needed for unavailable workers
|
||||
2. **Precise Timing:** TTL provides exact failure window (vs polling-based Phase 1)
|
||||
3. **Resource Efficiency:** Prevents message accumulation in worker queues
|
||||
4. **Debugging Support:** DLQ retains messages for forensic analysis
|
||||
5. **Graceful Degradation:** System continues functioning even with worker failures
|
||||
|
||||
## Limitations
|
||||
|
||||
1. **TTL Precision:** RabbitMQ TTL is approximate, not guaranteed to the millisecond
|
||||
2. **Race Conditions:** Worker may start processing just as TTL expires (rare)
|
||||
3. **DLQ Capacity:** Very high failure rates may overwhelm DLQ
|
||||
4. **No Retry Logic:** Phase 2 always fails; Phase 3 will add intelligent retry
|
||||
|
||||
## Future Enhancements (Phase 3)
|
||||
|
||||
- **Conditional Retry:** Retry messages based on failure reason
|
||||
- **Priority DLQ:** Prioritize critical execution failures
|
||||
- **DLQ Analytics:** Aggregate statistics on failure patterns
|
||||
- **Auto-scaling:** Scale workers based on DLQ rate
|
||||
- **Custom TTL:** Per-action or per-execution TTL configuration
|
||||
|
||||
## References
|
||||
|
||||
- RabbitMQ Dead Letter Exchanges: https://www.rabbitmq.com/dlx.html
|
||||
- RabbitMQ TTL: https://www.rabbitmq.com/ttl.html
|
||||
- Phase 1 Documentation: `docs/architecture/worker-availability-handling.md`
|
||||
- Queue Architecture: `docs/architecture/queue-architecture.md`
|
||||
@@ -131,28 +131,38 @@ echo "Hello, $PARAM_NAME!"
|
||||
|
||||
### 4. Action Executor
|
||||
|
||||
**Purpose**: Orchestrate the complete execution flow for an action.
|
||||
**Purpose**: Orchestrate the complete execution flow for an action and own execution state after handoff.
|
||||
|
||||
**Execution Flow**:
|
||||
```
|
||||
1. Load execution record from database
|
||||
2. Update status to Running
|
||||
3. Load action definition by reference
|
||||
4. Prepare execution context (parameters, env vars, timeout)
|
||||
5. Select and execute in appropriate runtime
|
||||
6. Capture results (stdout, stderr, return value)
|
||||
7. Store artifacts (logs, results)
|
||||
8. Update execution status (Succeeded/Failed)
|
||||
9. Publish status update messages
|
||||
1. Receive execution.scheduled message from executor
|
||||
2. Load execution record from database
|
||||
3. Update status to Running (owns state after handoff)
|
||||
4. Load action definition by reference
|
||||
5. Prepare execution context (parameters, env vars, timeout)
|
||||
6. Select and execute in appropriate runtime
|
||||
7. Capture results (stdout, stderr, return value)
|
||||
8. Store artifacts (logs, results)
|
||||
9. Update execution status (Completed/Failed) in database
|
||||
10. Publish status change notifications
|
||||
11. Publish completion notification for queue management
|
||||
```
|
||||
|
||||
**Ownership Model**:
|
||||
- **Worker owns execution state** after receiving `execution.scheduled`
|
||||
- **Authoritative source** for all status updates: Running, Completed, Failed, Cancelled, etc.
|
||||
- **Updates database directly** for all state changes
|
||||
- **Publishes notifications** for orchestration and monitoring
|
||||
|
||||
**Responsibilities**:
|
||||
- Coordinate execution lifecycle
|
||||
- Load action and execution data from database
|
||||
- **Update execution state in database** (after handoff from executor)
|
||||
- Prepare execution context with parameters and environment
|
||||
- Execute action via runtime registry
|
||||
- Handle success and failure cases
|
||||
- Store execution artifacts
|
||||
- Publish status change notifications
|
||||
|
||||
**Key Implementation Details**:
|
||||
- Parameters merged: action defaults + execution overrides
|
||||
@@ -246,7 +256,10 @@ See `docs/secrets-management.md` for comprehensive documentation.
|
||||
- Register worker in database
|
||||
- Start heartbeat manager
|
||||
- Consume execution messages from worker-specific queue
|
||||
- Publish execution status updates
|
||||
- **Own execution state** after receiving scheduled executions
|
||||
- **Update execution status in database** (Running, Completed, Failed, etc.)
|
||||
- Publish execution status change notifications
|
||||
- Publish execution completion notifications
|
||||
- Handle graceful shutdown
|
||||
|
||||
**Message Flow**:
|
||||
@@ -407,8 +420,9 @@ pub struct ExecutionResult {
|
||||
### Error Propagation
|
||||
|
||||
- Runtime errors captured in `ExecutionResult.error`
|
||||
- Execution status updated to Failed in database
|
||||
- Error published in status update message
|
||||
- **Worker updates** execution status to Failed in database (owns state)
|
||||
- Error published in status change notification message
|
||||
- Error published in completion notification message
|
||||
- Artifacts still stored for failed executions
|
||||
- Logs preserved for debugging
|
||||
|
||||
|
||||
Reference in New Issue
Block a user