Files
attune/docs/QUICKREF-phase3-retry-health.md

12 KiB

Quick Reference: Phase 3 - Intelligent Retry & Worker Health

Overview

Phase 3 adds intelligent retry logic and proactive worker health monitoring to automatically recover from transient failures and optimize worker selection.

Key Features:

  • Automatic Retry: Failed executions automatically retry with exponential backoff
  • Health-Aware Scheduling: Prefer healthy workers with low queue depth
  • Per-Action Configuration: Custom timeouts and retry limits per action
  • Failure Classification: Distinguish retriable vs non-retriable failures

Quick Start

Enable Retry for an Action

# packs/mypack/actions/flaky-api.yaml
name: flaky_api_call
runtime: python
entrypoint: actions/flaky_api.py
timeout_seconds: 120      # Custom timeout (overrides global 5 min)
max_retries: 3            # Retry up to 3 times on failure
parameters:
  url:
    type: string
    required: true

Database Migration

# Apply Phase 3 schema changes
sqlx migrate run

# Or via Docker Compose
docker compose exec postgres psql -U attune -d attune -f /migrations/20260209000000_phase3_retry_and_health.sql

Check Worker Health

# View healthy workers
psql -c "SELECT * FROM healthy_workers;"

# Check specific worker health
psql -c "
SELECT 
    name,
    capabilities->'health'->>'status' as health_status,
    capabilities->'health'->>'queue_depth' as queue_depth,
    capabilities->'health'->>'consecutive_failures' as failures
FROM worker 
WHERE id = 1;
"

Retry Behavior

Retriable Failures

Executions are automatically retried for:

  • ✓ Worker unavailable (worker_unavailable)
  • ✓ Queue timeout/TTL expired (queue_timeout)
  • ✓ Worker heartbeat stale (worker_heartbeat_stale)
  • ✓ Transient errors (transient_error)
  • ✓ Manual retry requested (manual_retry)

Non-Retriable Failures

These failures are NOT retried:

  • ✗ Validation errors
  • ✗ Permission denied
  • ✗ Action not found
  • ✗ Invalid parameters
  • ✗ Explicit action failure

Retry Backoff

Strategy: Exponential backoff with jitter

Attempt 0: ~1 second
Attempt 1: ~2 seconds
Attempt 2: ~4 seconds
Attempt 3: ~8 seconds
Attempt N: min(base * 2^N, 300 seconds)

Jitter: ±20% randomization to avoid thundering herd

Retry Configuration

// Default retry configuration
RetryConfig {
    enabled: true,
    base_backoff_secs: 1,
    max_backoff_secs: 300,       // 5 minutes max
    backoff_multiplier: 2.0,
    jitter_factor: 0.2,          // 20% jitter
}

Worker Health

Health States

Healthy:

  • Heartbeat < 30 seconds old
  • Consecutive failures < 3
  • Queue depth < 50
  • Failure rate < 30%

Degraded:

  • Consecutive failures: 3-9
  • Queue depth: 50-99
  • Failure rate: 30-69%
  • Still receives tasks but deprioritized

Unhealthy:

  • Heartbeat > 30 seconds old
  • Consecutive failures ≥ 10
  • Queue depth ≥ 100
  • Failure rate ≥ 70%
  • Does NOT receive new tasks

Health Metrics

Workers self-report health in capabilities:

{
  "runtimes": ["shell", "python"],
  "health": {
    "status": "healthy",
    "last_check": "2026-02-09T12:00:00Z",
    "consecutive_failures": 0,
    "total_executions": 1000,
    "failed_executions": 20,
    "average_execution_time_ms": 1500,
    "queue_depth": 5
  }
}

Worker Selection

Selection Priority:

  1. Healthy workers (queue depth ascending)
  2. Degraded workers (queue depth ascending)
  3. Skip unhealthy workers

Example:

Worker A: Healthy, queue=5    ← Selected first
Worker B: Healthy, queue=20   ← Selected second
Worker C: Degraded, queue=10  ← Selected third
Worker D: Unhealthy, queue=0  ← Never selected

Database Schema

Execution Retry Fields

-- Added to execution table
retry_count INTEGER NOT NULL DEFAULT 0,
max_retries INTEGER,
retry_reason TEXT,
original_execution BIGINT REFERENCES execution(id)

Action Configuration Fields

-- Added to action table
timeout_seconds INTEGER,          -- Per-action timeout override
max_retries INTEGER DEFAULT 0     -- Per-action retry limit

Helper Functions

-- Check if execution can be retried
SELECT is_execution_retriable(123);

-- Get worker queue depth
SELECT get_worker_queue_depth(1);

Views

-- Get all healthy workers
SELECT * FROM healthy_workers;

Practical Examples

Example 1: View Retry Chain

-- Find all retries for execution 100
WITH RECURSIVE retry_chain AS (
    SELECT id, retry_count, retry_reason, original_execution, status
    FROM execution
    WHERE id = 100
    
    UNION ALL
    
    SELECT e.id, e.retry_count, e.retry_reason, e.original_execution, e.status
    FROM execution e
    JOIN retry_chain rc ON e.original_execution = rc.id
)
SELECT * FROM retry_chain ORDER BY retry_count;

Example 2: Analyze Retry Success Rate

-- Success rate of retries by reason
SELECT 
    config->>'retry_reason' as reason,
    COUNT(*) as total_retries,
    COUNT(CASE WHEN status = 'completed' THEN 1 END) as succeeded,
    ROUND(100.0 * COUNT(CASE WHEN status = 'completed' THEN 1 END) / COUNT(*), 2) as success_rate
FROM execution
WHERE retry_count > 0
GROUP BY config->>'retry_reason'
ORDER BY total_retries DESC;

Example 3: Find Workers by Health

-- Workers sorted by health and load
SELECT 
    w.name,
    w.status,
    (w.capabilities->'health'->>'status')::TEXT as health,
    (w.capabilities->'health'->>'queue_depth')::INTEGER as queue,
    (w.capabilities->'health'->>'consecutive_failures')::INTEGER as failures,
    w.last_heartbeat
FROM worker w
WHERE w.status = 'active'
ORDER BY 
    CASE (w.capabilities->'health'->>'status')::TEXT
        WHEN 'healthy' THEN 1
        WHEN 'degraded' THEN 2
        WHEN 'unhealthy' THEN 3
        ELSE 4
    END,
    (w.capabilities->'health'->>'queue_depth')::INTEGER;

Example 4: Manual Retry via API

# Create retry execution
curl -X POST http://localhost:8080/api/v1/executions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "action_ref": "core.echo",
    "parameters": {"message": "retry test"},
    "config": {
      "retry_of": 123,
      "retry_count": 1,
      "max_retries": 3,
      "retry_reason": "manual_retry",
      "original_execution": 123
    }
  }'

Monitoring

Key Metrics

Retry Metrics:

  • Retry rate: % of executions that retry
  • Retry success rate: % of retries that succeed
  • Average retries per execution
  • Retry reason distribution

Health Metrics:

  • Healthy worker count
  • Degraded worker count
  • Unhealthy worker count
  • Average queue depth per worker
  • Average failure rate per worker

SQL Queries

-- Retry rate over last hour
SELECT 
    COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END) as original_executions,
    COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) as retry_executions,
    ROUND(100.0 * COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) / 
          COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END), 2) as retry_rate
FROM execution
WHERE created > NOW() - INTERVAL '1 hour';

-- Worker health distribution
SELECT 
    COALESCE((capabilities->'health'->>'status')::TEXT, 'unknown') as health_status,
    COUNT(*) as worker_count,
    AVG((capabilities->'health'->>'queue_depth')::INTEGER) as avg_queue_depth
FROM worker
WHERE status = 'active'
GROUP BY health_status;

Configuration

Retry Configuration

// In executor service initialization
let retry_manager = RetryManager::new(pool.clone(), RetryConfig {
    enabled: true,
    base_backoff_secs: 1,
    max_backoff_secs: 300,
    backoff_multiplier: 2.0,
    jitter_factor: 0.2,
});

Health Probe Configuration

// In executor service initialization
let health_probe = WorkerHealthProbe::new(pool.clone(), HealthProbeConfig {
    enabled: true,
    heartbeat_max_age_secs: 30,
    degraded_threshold: 3,
    unhealthy_threshold: 10,
    queue_depth_degraded: 50,
    queue_depth_unhealthy: 100,
    failure_rate_degraded: 0.3,
    failure_rate_unhealthy: 0.7,
});

Troubleshooting

High Retry Rate

Symptoms: Many executions retrying repeatedly

Causes:

  • Workers unstable or frequently restarting
  • Network issues causing transient failures
  • Actions not idempotent (retry makes things worse)

Resolution:

  1. Check worker stability: docker compose ps
  2. Review action idempotency
  3. Adjust max_retries if retries are unhelpful
  4. Investigate root cause of failures

Retries Not Triggering

Symptoms: Failed executions not retrying despite max_retries > 0

Causes:

  • Action doesn't have max_retries set
  • Failure is non-retriable (validation error, etc.)
  • Global retry disabled

Resolution:

  1. Check action configuration: SELECT timeout_seconds, max_retries FROM action WHERE ref = 'action.name';
  2. Check failure message for retriable patterns
  3. Verify retry enabled in executor config

Workers Marked Unhealthy

Symptoms: Workers not receiving tasks

Causes:

  • High queue depth (overloaded)
  • Consecutive failures exceed threshold
  • Heartbeat stale

Resolution:

  1. Check worker logs: docker compose logs -f worker-shell
  2. Verify heartbeat: SELECT name, last_heartbeat FROM worker;
  3. Check queue depth in capabilities
  4. Restart worker if stuck: docker compose restart worker-shell

Retry Loops

Symptoms: Execution retries forever or excessive retries

Causes:

  • Bug in retry reason detection
  • Action failure always classified as retriable
  • max_retries not being enforced

Resolution:

  1. Check retry chain: See Example 1 above
  2. Verify max_retries: SELECT config FROM execution WHERE id = 123;
  3. Fix retry reason classification if incorrect
  4. Manually fail execution if stuck

Integration with Previous Phases

Phase 1 + Phase 2 + Phase 3 Together

Defense in Depth:

  1. Phase 1 (Timeout Monitor): Catches stuck SCHEDULED executions (30s-5min)
  2. Phase 2 (Queue TTL/DLQ): Expires messages in worker queues (5min)
  3. Phase 3 (Intelligent Retry): Retries retriable failures (1s-5min backoff)

Failure Flow:

Execution dispatched → Worker unavailable (Phase 2: 5min TTL)
    → DLQ handler marks FAILED (Phase 2)
    → Retry manager creates retry (Phase 3)
    → Retry dispatched with backoff (Phase 3)
    → Success or exhaust retries

Backup Safety Net: If Phase 3 retry fails to create retry, Phase 1 timeout monitor will still catch stuck executions.

Best Practices

Action Design for Retries

  1. Make actions idempotent: Safe to run multiple times
  2. Set realistic timeouts: Based on typical execution time
  3. Configure appropriate max_retries:
    • Network calls: 3-5 retries
    • Database operations: 2-3 retries
    • External APIs: 3 retries
    • Local operations: 0-1 retries

Worker Health Management

  1. Report queue depth regularly: Update every heartbeat
  2. Track failure metrics: Consecutive failures, total/failed counts
  3. Implement graceful degradation: Continue working when degraded
  4. Fail fast when unhealthy: Stop accepting work if overloaded

Monitoring Strategy

  1. Alert on high retry rates: > 20% of executions retrying
  2. Alert on unhealthy workers: > 50% workers unhealthy
  3. Track retry success rate: Should be > 70%
  4. Monitor queue depths: Average should stay < 20

See Also

  • Architecture: docs/architecture/worker-availability-handling.md
  • Phase 1 Guide: docs/QUICKREF-worker-availability-phase1.md
  • Phase 2 Guide: docs/QUICKREF-worker-queue-ttl-dlq.md
  • Migration: migrations/20260209000000_phase3_retry_and_health.sql