attune-system/attune

Fork 0

Files

David Culbreth e31ecb781b more internal polish, resilient workers

2026-02-09 18:32:34 -06:00

15 KiB

Raw Blame History

Worker Availability Handling

Status: Implementation Gap Identified
Priority: High
Date: 2026-02-09

Problem Statement

When workers are stopped or become unavailable, the executor continues attempting to schedule executions to them, resulting in:

Stuck executions: Executions remain in SCHEDULING or SCHEDULED status indefinitely
Queue buildup: Messages accumulate in worker-specific RabbitMQ queues
No failure notification: Users don't know their executions are stuck
Resource waste: System resources consumed by queued messages and database records

Current Architecture

Heartbeat Mechanism

Workers send heartbeat updates to the database periodically (default: 30 seconds).

// From crates/executor/src/scheduler.rs
const DEFAULT_HEARTBEAT_INTERVAL: u64 = 30;
const HEARTBEAT_STALENESS_MULTIPLIER: u64 = 3;

fn is_worker_heartbeat_fresh(worker: &Worker) -> bool {
    // Worker is fresh if heartbeat < 90 seconds old
    let max_age = Duration::from_secs(
        DEFAULT_HEARTBEAT_INTERVAL * HEARTBEAT_STALENESS_MULTIPLIER
    );
    // ...
}

Scheduling Flow

Execution Created (REQUESTED)
    ↓
Scheduler receives message
    ↓
Find compatible worker with fresh heartbeat
    ↓
Update execution to SCHEDULED
    ↓
Publish message to worker-specific queue
    ↓
Worker consumes and executes

Failure Points

Worker stops after heartbeat: Worker has fresh heartbeat but is actually down
Worker crashes: No graceful shutdown, heartbeat appears fresh temporarily
Network partition: Worker isolated but appears healthy
Queue accumulation: Messages sit in worker-specific queues indefinitely

Current Mitigations (Insufficient)

1. Heartbeat Staleness Check

fn select_worker(pool: &PgPool, action: &Action) -> Result<Worker> {
    // Filter by active workers
    let active_workers: Vec<_> = workers
        .into_iter()
        .filter(|w| w.status == WorkerStatus::Active)
        .collect();

    // Filter by heartbeat freshness
    let fresh_workers: Vec<_> = active_workers
        .into_iter()
        .filter(|w| is_worker_heartbeat_fresh(w))
        .collect();

    if fresh_workers.is_empty() {
        return Err(anyhow!("No workers with fresh heartbeats"));
    }

    // Select first available worker
    Ok(fresh_workers.into_iter().next().unwrap())
}

Gap: Workers can stop within the 90-second staleness window.

2. Message Requeue on Error

// From crates/common/src/mq/consumer.rs
match handler(envelope.clone()).await {
    Err(e) => {
        let requeue = e.is_retriable();
        channel.basic_nack(delivery_tag, BasicNackOptions {
            requeue,
            multiple: false,
        }).await?;
    }
}

Gap: Only requeues on retriable errors (connection/timeout), not worker unavailability.

3. Message TTL Configuration

// From crates/common/src/config.rs
pub struct MessageQueueConfig {
    #[serde(default = "default_message_ttl")]
    pub message_ttl: u64,
}

fn default_message_ttl() -> u64 {
    3600 // 1 hour
}

Gap: TTL not currently applied to worker queues, and 1 hour is too long.

Proposed Solutions

Solution 1: Execution Timeout Mechanism (HIGH PRIORITY)

Add a background task that monitors scheduled executions and fails them if they don't start within a timeout.

Implementation:

// crates/executor/src/execution_timeout_monitor.rs

pub struct ExecutionTimeoutMonitor {
    pool: PgPool,
    publisher: Arc<Publisher>,
    check_interval: Duration,
    scheduled_timeout: Duration,
}

impl ExecutionTimeoutMonitor {
    pub async fn start(&self) -> Result<()> {
        let mut interval = tokio::time::interval(self.check_interval);

        loop {
            interval.tick().await;

            if let Err(e) = self.check_stale_executions().await {
                error!("Error checking stale executions: {}", e);
            }
        }
    }

    async fn check_stale_executions(&self) -> Result<()> {
        let cutoff = Utc::now() - chrono::Duration::from_std(self.scheduled_timeout)?;

        // Find executions stuck in SCHEDULED status
        let stale_executions = sqlx::query_as::<_, Execution>(
            "SELECT * FROM execution 
             WHERE status = 'scheduled' 
             AND updated < $1"
        )
        .bind(cutoff)
        .fetch_all(&self.pool)
        .await?;

        for execution in stale_executions {
            warn!(
                "Execution {} has been scheduled for too long, marking as failed",
                execution.id
            );

            self.fail_execution(
                execution.id,
                "Execution timeout: worker did not pick up task within timeout"
            ).await?;
        }

        Ok(())
    }

    async fn fail_execution(&self, execution_id: i64, reason: &str) -> Result<()> {
        // Update execution status
        sqlx::query(
            "UPDATE execution 
             SET status = 'failed', 
                 result = $2,
                 updated = NOW() 
             WHERE id = $1"
        )
        .bind(execution_id)
        .bind(serde_json::json!({
            "error": reason,
            "failed_by": "execution_timeout_monitor"
        }))
        .execute(&self.pool)
        .await?;

        // Publish completion notification
        let payload = ExecutionCompletedPayload {
            execution_id,
            status: ExecutionStatus::Failed,
            result: Some(serde_json::json!({"error": reason})),
        };

        self.publisher
            .publish_envelope(
                MessageType::ExecutionCompleted,
                payload,
                "attune.executions",
            )
            .await?;

        Ok(())
    }
}

Configuration:

# config.yaml
executor:
  scheduled_timeout: 300  # 5 minutes (fail if not running within 5 min)
  timeout_check_interval: 60  # Check every minute

Solution 2: Worker Queue TTL and DLQ (MEDIUM PRIORITY)

Apply message TTL to worker-specific queues with dead letter exchange.

Implementation:

// When declaring worker-specific queues
let mut queue_args = FieldTable::default();

// Set message TTL (5 minutes)
queue_args.insert(
    "x-message-ttl".into(),
    AMQPValue::LongInt(300_000) // 5 minutes in milliseconds
);

// Set dead letter exchange
queue_args.insert(
    "x-dead-letter-exchange".into(),
    AMQPValue::LongString("attune.executions.dlx".into())
);

channel.queue_declare(
    &format!("attune.execution.worker.{}", worker_id),
    QueueDeclareOptions {
        durable: true,
        ..Default::default()
    },
    queue_args,
).await?;

Dead Letter Handler:

// crates/executor/src/dead_letter_handler.rs

pub struct DeadLetterHandler {
    pool: PgPool,
    consumer: Arc<Consumer>,
}

impl DeadLetterHandler {
    pub async fn start(&self) -> Result<()> {
        self.consumer
            .consume_with_handler(|envelope: MessageEnvelope<ExecutionScheduledPayload>| {
                let pool = self.pool.clone();
                
                async move {
                    warn!("Received dead letter for execution {}", envelope.payload.execution_id);
                    
                    // Mark execution as failed
                    sqlx::query(
                        "UPDATE execution 
                         SET status = 'failed', 
                             result = $2,
                             updated = NOW() 
                         WHERE id = $1 AND status = 'scheduled'"
                    )
                    .bind(envelope.payload.execution_id)
                    .bind(serde_json::json!({
                        "error": "Message expired in worker queue (worker unavailable)",
                        "failed_by": "dead_letter_handler"
                    }))
                    .execute(&pool)
                    .await?;
                    
                    Ok(())
                }
            })
            .await
    }
}

Solution 3: Worker Health Probes (LOW PRIORITY)

Add active health checking instead of relying solely on heartbeats.

Implementation:

// crates/executor/src/worker_health_checker.rs

pub struct WorkerHealthChecker {
    pool: PgPool,
    check_interval: Duration,
}

impl WorkerHealthChecker {
    pub async fn start(&self) -> Result<()> {
        let mut interval = tokio::time::interval(self.check_interval);

        loop {
            interval.tick().await;

            if let Err(e) = self.check_worker_health().await {
                error!("Error checking worker health: {}", e);
            }
        }
    }

    async fn check_worker_health(&self) -> Result<()> {
        let workers = WorkerRepository::find_action_workers(&self.pool).await?;

        for worker in workers {
            // Skip if heartbeat is very stale (worker is definitely down)
            if !is_heartbeat_recent(&worker) {
                continue;
            }

            // Attempt health check
            match self.ping_worker(&worker).await {
                Ok(true) => {
                    // Worker is healthy, ensure status is Active
                    if worker.status != Some(WorkerStatus::Active) {
                        self.update_worker_status(worker.id, WorkerStatus::Active).await?;
                    }
                }
                Ok(false) | Err(_) => {
                    // Worker is unhealthy, mark as inactive
                    warn!("Worker {} failed health check", worker.name);
                    self.update_worker_status(worker.id, WorkerStatus::Inactive).await?;
                }
            }
        }

        Ok(())
    }

    async fn ping_worker(&self, worker: &Worker) -> Result<bool> {
        // TODO: Implement health endpoint on worker
        // For now, check if worker's queue is being consumed
        Ok(true)
    }
}

Solution 4: Graceful Worker Shutdown (MEDIUM PRIORITY)

Ensure workers mark themselves as inactive before shutdown.

Implementation:

// In worker service shutdown handler
impl WorkerService {
    pub async fn shutdown(&self) -> Result<()> {
        info!("Worker shutting down gracefully...");

        // Mark worker as inactive
        sqlx::query(
            "UPDATE worker SET status = 'inactive', updated = NOW() WHERE id = $1"
        )
        .bind(self.worker_id)
        .execute(&self.pool)
        .await?;

        // Stop accepting new tasks
        self.stop_consuming().await?;

        // Wait for in-flight tasks to complete (with timeout)
        let timeout = Duration::from_secs(30);
        tokio::time::timeout(timeout, self.wait_for_completion()).await?;

        info!("Worker shutdown complete");
        Ok(())
    }
}

Docker Signal Handling:

# docker-compose.yaml
services:
  worker-shell:
    stop_grace_period: 45s  # Give worker time to finish tasks

Implementation Priority

Phase 1: Immediate (Week 1)

Execution Timeout Monitor - Prevents stuck executions
Graceful Shutdown - Marks workers inactive on stop

Phase 2: Short-term (Week 2)

Worker Queue TTL + DLQ - Prevents message buildup
Dead Letter Handler - Fails expired executions

Phase 3: Long-term (Month 1)

Worker Health Probes - Active availability verification
Retry Logic - Reschedule to different worker on failure

Configuration

Recommended Timeouts

executor:
  # How long an execution can stay SCHEDULED before failing
  scheduled_timeout: 300  # 5 minutes

  # How often to check for stale executions
  timeout_check_interval: 60  # 1 minute

  # Message TTL in worker queues
  worker_queue_ttl: 300  # 5 minutes (match scheduled_timeout)

  # Worker health check interval
  health_check_interval: 30  # 30 seconds

worker:
  # How often to send heartbeats
  heartbeat_interval: 10  # 10 seconds (more frequent)

  # Grace period for shutdown
  shutdown_timeout: 30  # 30 seconds

Staleness Calculation

Heartbeat Staleness Threshold = heartbeat_interval * 3
                               = 10 * 3 = 30 seconds

This means:
- Worker sends heartbeat every 10s
- If heartbeat is > 30s old, worker is considered stale
- Reduces window where stopped worker appears healthy from 90s to 30s

Monitoring and Observability

Metrics to Track

Execution timeout rate: Number of executions failed due to timeout
Worker downtime: Time between last heartbeat and status change
Dead letter queue depth: Number of expired messages
Average scheduling latency: Time from REQUESTED to RUNNING

Alerts

alerts:
  - name: high_execution_timeout_rate
    condition: execution_timeouts > 10 per minute
    severity: warning

  - name: no_active_workers
    condition: active_workers == 0
    severity: critical

  - name: dlq_buildup
    condition: dlq_depth > 100
    severity: warning

  - name: stale_executions
    condition: scheduled_executions_older_than_5min > 0
    severity: warning

Testing

Test Scenarios

Worker stops mid-execution: Should timeout and fail
Worker never picks up task: Should timeout after 5 minutes
All workers down: Should immediately fail with "no workers available"
Worker stops gracefully: Should mark inactive and not receive new tasks
Message expires in queue: Should be moved to DLQ and execution failed

Integration Test Example

#[tokio::test]
async fn test_execution_timeout_on_worker_down() {
    let pool = setup_test_db().await;
    let mq = setup_test_mq().await;

    // Create worker and execution
    let worker = create_test_worker(&pool).await;
    let execution = create_test_execution(&pool).await;

    // Schedule execution to worker
    schedule_execution(&pool, &mq, execution.id, worker.id).await;

    // Stop worker (simulate crash - no graceful shutdown)
    stop_worker(worker.id).await;

    // Wait for timeout
    tokio::time::sleep(Duration::from_secs(310)).await;

    // Verify execution is marked as failed
    let execution = get_execution(&pool, execution.id).await;
    assert_eq!(execution.status, ExecutionStatus::Failed);
    assert!(execution.result.unwrap()["error"]
        .as_str()
        .unwrap()
        .contains("timeout"));
}

Migration Path

Step 1: Add Monitoring (No Breaking Changes)

Deploy execution timeout monitor
Monitor logs for timeout events
Tune timeout values based on actual workload

Step 2: Add DLQ (Requires Queue Reconfiguration)

Create dead letter exchange
Update queue declarations with TTL and DLX
Deploy dead letter handler
Monitor DLQ depth

Step 3: Graceful Shutdown (Worker Update)

Add shutdown handler to worker
Update Docker Compose stop_grace_period
Test worker restarts

Step 4: Health Probes (Future Enhancement)

Add health endpoint to worker
Deploy health checker service
Transition from heartbeat-only to active probing

15 KiB Raw Blame History

Worker Availability Handling

Problem Statement

Current Architecture

Heartbeat Mechanism

Scheduling Flow

Failure Points

Current Mitigations (Insufficient)

1. Heartbeat Staleness Check

2. Message Requeue on Error

3. Message TTL Configuration

Proposed Solutions

Solution 1: Execution Timeout Mechanism (HIGH PRIORITY)

Solution 2: Worker Queue TTL and DLQ (MEDIUM PRIORITY)

Solution 3: Worker Health Probes (LOW PRIORITY)

Solution 4: Graceful Worker Shutdown (MEDIUM PRIORITY)

Implementation Priority

Phase 1: Immediate (Week 1)

Phase 2: Short-term (Week 2)

Phase 3: Long-term (Month 1)

Configuration

Recommended Timeouts

Staleness Calculation

Monitoring and Observability

Metrics to Track

Alerts

Testing

Test Scenarios

Integration Test Example

Migration Path

Step 1: Add Monitoring (No Breaking Changes)

Step 2: Add DLQ (Requires Queue Reconfiguration)

Step 3: Graceful Shutdown (Worker Update)

Step 4: Health Probes (Future Enhancement)

Related Documentation

15 KiB

Raw Blame History