Files

David Culbreth e31ecb781b more internal polish, resilient workers

2026-02-09 18:32:34 -06:00

12 KiB

Raw Blame History

Quick Reference: Phase 3 - Intelligent Retry & Worker Health

Overview

Phase 3 adds intelligent retry logic and proactive worker health monitoring to automatically recover from transient failures and optimize worker selection.

Key Features:

Automatic Retry: Failed executions automatically retry with exponential backoff
Health-Aware Scheduling: Prefer healthy workers with low queue depth
Per-Action Configuration: Custom timeouts and retry limits per action
Failure Classification: Distinguish retriable vs non-retriable failures

Quick Start

Enable Retry for an Action

# packs/mypack/actions/flaky-api.yaml
name: flaky_api_call
runtime: python
entrypoint: actions/flaky_api.py
timeout_seconds: 120      # Custom timeout (overrides global 5 min)
max_retries: 3            # Retry up to 3 times on failure
parameters:
  url:
    type: string
    required: true

Database Migration

# Apply Phase 3 schema changes
sqlx migrate run

# Or via Docker Compose
docker compose exec postgres psql -U attune -d attune -f /migrations/20260209000000_phase3_retry_and_health.sql

Check Worker Health

# View healthy workers
psql -c "SELECT * FROM healthy_workers;"

# Check specific worker health
psql -c "
SELECT 
    name,
    capabilities->'health'->>'status' as health_status,
    capabilities->'health'->>'queue_depth' as queue_depth,
    capabilities->'health'->>'consecutive_failures' as failures
FROM worker 
WHERE id = 1;
"

Retry Behavior

Retriable Failures

Executions are automatically retried for:

✓ Worker unavailable (worker_unavailable)
✓ Queue timeout/TTL expired (queue_timeout)
✓ Worker heartbeat stale (worker_heartbeat_stale)
✓ Transient errors (transient_error)
✓ Manual retry requested (manual_retry)

Non-Retriable Failures

These failures are NOT retried:

✗ Validation errors
✗ Permission denied
✗ Action not found
✗ Invalid parameters
✗ Explicit action failure

Retry Backoff

Strategy: Exponential backoff with jitter

Attempt 0: ~1 second
Attempt 1: ~2 seconds
Attempt 2: ~4 seconds
Attempt 3: ~8 seconds
Attempt N: min(base * 2^N, 300 seconds)

Jitter: ±20% randomization to avoid thundering herd

Retry Configuration

// Default retry configuration
RetryConfig {
    enabled: true,
    base_backoff_secs: 1,
    max_backoff_secs: 300,       // 5 minutes max
    backoff_multiplier: 2.0,
    jitter_factor: 0.2,          // 20% jitter
}

Worker Health

Health States

Healthy:

Heartbeat < 30 seconds old
Consecutive failures < 3
Queue depth < 50
Failure rate < 30%

Degraded:

Consecutive failures: 3-9
Queue depth: 50-99
Failure rate: 30-69%
Still receives tasks but deprioritized

Unhealthy:

Heartbeat > 30 seconds old
Consecutive failures ≥ 10
Queue depth ≥ 100
Failure rate ≥ 70%
Does NOT receive new tasks

Health Metrics

Workers self-report health in capabilities:

{
  "runtimes": ["shell", "python"],
  "health": {
    "status": "healthy",
    "last_check": "2026-02-09T12:00:00Z",
    "consecutive_failures": 0,
    "total_executions": 1000,
    "failed_executions": 20,
    "average_execution_time_ms": 1500,
    "queue_depth": 5
  }
}

Worker Selection

Selection Priority:

Healthy workers (queue depth ascending)
Degraded workers (queue depth ascending)
Skip unhealthy workers

Example:

Worker A: Healthy, queue=5    ← Selected first
Worker B: Healthy, queue=20   ← Selected second
Worker C: Degraded, queue=10  ← Selected third
Worker D: Unhealthy, queue=0  ← Never selected

Database Schema

Execution Retry Fields

-- Added to execution table
retry_count INTEGER NOT NULL DEFAULT 0,
max_retries INTEGER,
retry_reason TEXT,
original_execution BIGINT REFERENCES execution(id)

Action Configuration Fields

-- Added to action table
timeout_seconds INTEGER,          -- Per-action timeout override
max_retries INTEGER DEFAULT 0     -- Per-action retry limit

Helper Functions

-- Check if execution can be retried
SELECT is_execution_retriable(123);

-- Get worker queue depth
SELECT get_worker_queue_depth(1);

Views

-- Get all healthy workers
SELECT * FROM healthy_workers;

Practical Examples

Example 1: View Retry Chain

-- Find all retries for execution 100
WITH RECURSIVE retry_chain AS (
    SELECT id, retry_count, retry_reason, original_execution, status
    FROM execution
    WHERE id = 100
    
    UNION ALL
    
    SELECT e.id, e.retry_count, e.retry_reason, e.original_execution, e.status
    FROM execution e
    JOIN retry_chain rc ON e.original_execution = rc.id
)
SELECT * FROM retry_chain ORDER BY retry_count;

Example 2: Analyze Retry Success Rate

-- Success rate of retries by reason
SELECT 
    config->>'retry_reason' as reason,
    COUNT(*) as total_retries,
    COUNT(CASE WHEN status = 'completed' THEN 1 END) as succeeded,
    ROUND(100.0 * COUNT(CASE WHEN status = 'completed' THEN 1 END) / COUNT(*), 2) as success_rate
FROM execution
WHERE retry_count > 0
GROUP BY config->>'retry_reason'
ORDER BY total_retries DESC;

Example 3: Find Workers by Health

-- Workers sorted by health and load
SELECT 
    w.name,
    w.status,
    (w.capabilities->'health'->>'status')::TEXT as health,
    (w.capabilities->'health'->>'queue_depth')::INTEGER as queue,
    (w.capabilities->'health'->>'consecutive_failures')::INTEGER as failures,
    w.last_heartbeat
FROM worker w
WHERE w.status = 'active'
ORDER BY 
    CASE (w.capabilities->'health'->>'status')::TEXT
        WHEN 'healthy' THEN 1
        WHEN 'degraded' THEN 2
        WHEN 'unhealthy' THEN 3
        ELSE 4
    END,
    (w.capabilities->'health'->>'queue_depth')::INTEGER;

Example 4: Manual Retry via API

# Create retry execution
curl -X POST http://localhost:8080/api/v1/executions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "action_ref": "core.echo",
    "parameters": {"message": "retry test"},
    "config": {
      "retry_of": 123,
      "retry_count": 1,
      "max_retries": 3,
      "retry_reason": "manual_retry",
      "original_execution": 123
    }
  }'

Monitoring

Key Metrics

Retry Metrics:

Retry rate: % of executions that retry
Retry success rate: % of retries that succeed
Average retries per execution
Retry reason distribution

Health Metrics:

Healthy worker count
Degraded worker count
Unhealthy worker count
Average queue depth per worker
Average failure rate per worker

SQL Queries

-- Retry rate over last hour
SELECT 
    COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END) as original_executions,
    COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) as retry_executions,
    ROUND(100.0 * COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) / 
          COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END), 2) as retry_rate
FROM execution
WHERE created > NOW() - INTERVAL '1 hour';

-- Worker health distribution
SELECT 
    COALESCE((capabilities->'health'->>'status')::TEXT, 'unknown') as health_status,
    COUNT(*) as worker_count,
    AVG((capabilities->'health'->>'queue_depth')::INTEGER) as avg_queue_depth
FROM worker
WHERE status = 'active'
GROUP BY health_status;

Configuration

Retry Configuration

// In executor service initialization
let retry_manager = RetryManager::new(pool.clone(), RetryConfig {
    enabled: true,
    base_backoff_secs: 1,
    max_backoff_secs: 300,
    backoff_multiplier: 2.0,
    jitter_factor: 0.2,
});

Health Probe Configuration

// In executor service initialization
let health_probe = WorkerHealthProbe::new(pool.clone(), HealthProbeConfig {
    enabled: true,
    heartbeat_max_age_secs: 30,
    degraded_threshold: 3,
    unhealthy_threshold: 10,
    queue_depth_degraded: 50,
    queue_depth_unhealthy: 100,
    failure_rate_degraded: 0.3,
    failure_rate_unhealthy: 0.7,
});

Troubleshooting

High Retry Rate

Symptoms: Many executions retrying repeatedly

Causes:

Workers unstable or frequently restarting
Network issues causing transient failures
Actions not idempotent (retry makes things worse)

Resolution:

Check worker stability: docker compose ps
Review action idempotency
Adjust max_retries if retries are unhelpful
Investigate root cause of failures

Retries Not Triggering

Symptoms: Failed executions not retrying despite max_retries > 0

Causes:

Action doesn't have max_retries set
Failure is non-retriable (validation error, etc.)
Global retry disabled

Resolution:

Check action configuration: SELECT timeout_seconds, max_retries FROM action WHERE ref = 'action.name';
Check failure message for retriable patterns
Verify retry enabled in executor config

Workers Marked Unhealthy

Symptoms: Workers not receiving tasks

Causes:

High queue depth (overloaded)
Consecutive failures exceed threshold
Heartbeat stale

Resolution:

Check worker logs: docker compose logs -f worker-shell
Verify heartbeat: SELECT name, last_heartbeat FROM worker;
Check queue depth in capabilities
Restart worker if stuck: docker compose restart worker-shell

Retry Loops

Symptoms: Execution retries forever or excessive retries

Causes:

Bug in retry reason detection
Action failure always classified as retriable
max_retries not being enforced

Resolution:

Check retry chain: See Example 1 above
Verify max_retries: SELECT config FROM execution WHERE id = 123;
Fix retry reason classification if incorrect
Manually fail execution if stuck

Integration with Previous Phases

Phase 1 + Phase 2 + Phase 3 Together

Defense in Depth:

Phase 1 (Timeout Monitor): Catches stuck SCHEDULED executions (30s-5min)
Phase 2 (Queue TTL/DLQ): Expires messages in worker queues (5min)
Phase 3 (Intelligent Retry): Retries retriable failures (1s-5min backoff)

Failure Flow:

Execution dispatched → Worker unavailable (Phase 2: 5min TTL)
    → DLQ handler marks FAILED (Phase 2)
    → Retry manager creates retry (Phase 3)
    → Retry dispatched with backoff (Phase 3)
    → Success or exhaust retries

Backup Safety Net: If Phase 3 retry fails to create retry, Phase 1 timeout monitor will still catch stuck executions.

Best Practices

Action Design for Retries

Make actions idempotent: Safe to run multiple times
Set realistic timeouts: Based on typical execution time
Configure appropriate max_retries:
- Network calls: 3-5 retries
- Database operations: 2-3 retries
- External APIs: 3 retries
- Local operations: 0-1 retries

Worker Health Management

Report queue depth regularly: Update every heartbeat
Track failure metrics: Consecutive failures, total/failed counts
Implement graceful degradation: Continue working when degraded
Fail fast when unhealthy: Stop accepting work if overloaded

Monitoring Strategy

Alert on high retry rates: > 20% of executions retrying
Alert on unhealthy workers: > 50% workers unhealthy
Track retry success rate: Should be > 70%
Monitor queue depths: Average should stay < 20

12 KiB Raw Blame History

Quick Reference: Phase 3 - Intelligent Retry & Worker Health

Overview

Quick Start

Enable Retry for an Action

Database Migration

Check Worker Health

Retry Behavior

Retriable Failures

Non-Retriable Failures

Retry Backoff

Retry Configuration

Worker Health

Health States

Health Metrics

Worker Selection

Database Schema

Execution Retry Fields

Action Configuration Fields

Helper Functions

Views

Practical Examples

Example 1: View Retry Chain

Example 2: Analyze Retry Success Rate

Example 3: Find Workers by Health

Example 4: Manual Retry via API

Monitoring

Key Metrics

SQL Queries

Configuration

Retry Configuration

Health Probe Configuration

Troubleshooting

High Retry Rate

Retries Not Triggering

Workers Marked Unhealthy

Retry Loops

Integration with Previous Phases

Phase 1 + Phase 2 + Phase 3 Together

Best Practices

Action Design for Retries

Worker Health Management

Monitoring Strategy

See Also

12 KiB

Raw Blame History