12 KiB
Quick Reference: Phase 3 - Intelligent Retry & Worker Health
Overview
Phase 3 adds intelligent retry logic and proactive worker health monitoring to automatically recover from transient failures and optimize worker selection.
Key Features:
- Automatic Retry: Failed executions automatically retry with exponential backoff
- Health-Aware Scheduling: Prefer healthy workers with low queue depth
- Per-Action Configuration: Custom timeouts and retry limits per action
- Failure Classification: Distinguish retriable vs non-retriable failures
Quick Start
Enable Retry for an Action
# packs/mypack/actions/flaky-api.yaml
name: flaky_api_call
runtime: python
entrypoint: actions/flaky_api.py
timeout_seconds: 120 # Custom timeout (overrides global 5 min)
max_retries: 3 # Retry up to 3 times on failure
parameters:
url:
type: string
required: true
Database Migration
# Apply Phase 3 schema changes
sqlx migrate run
# Or via Docker Compose
docker compose exec postgres psql -U attune -d attune -f /migrations/20260209000000_phase3_retry_and_health.sql
Check Worker Health
# View healthy workers
psql -c "SELECT * FROM healthy_workers;"
# Check specific worker health
psql -c "
SELECT
name,
capabilities->'health'->>'status' as health_status,
capabilities->'health'->>'queue_depth' as queue_depth,
capabilities->'health'->>'consecutive_failures' as failures
FROM worker
WHERE id = 1;
"
Retry Behavior
Retriable Failures
Executions are automatically retried for:
- ✓ Worker unavailable (
worker_unavailable) - ✓ Queue timeout/TTL expired (
queue_timeout) - ✓ Worker heartbeat stale (
worker_heartbeat_stale) - ✓ Transient errors (
transient_error) - ✓ Manual retry requested (
manual_retry)
Non-Retriable Failures
These failures are NOT retried:
- ✗ Validation errors
- ✗ Permission denied
- ✗ Action not found
- ✗ Invalid parameters
- ✗ Explicit action failure
Retry Backoff
Strategy: Exponential backoff with jitter
Attempt 0: ~1 second
Attempt 1: ~2 seconds
Attempt 2: ~4 seconds
Attempt 3: ~8 seconds
Attempt N: min(base * 2^N, 300 seconds)
Jitter: ±20% randomization to avoid thundering herd
Retry Configuration
// Default retry configuration
RetryConfig {
enabled: true,
base_backoff_secs: 1,
max_backoff_secs: 300, // 5 minutes max
backoff_multiplier: 2.0,
jitter_factor: 0.2, // 20% jitter
}
Worker Health
Health States
Healthy:
- Heartbeat < 30 seconds old
- Consecutive failures < 3
- Queue depth < 50
- Failure rate < 30%
Degraded:
- Consecutive failures: 3-9
- Queue depth: 50-99
- Failure rate: 30-69%
- Still receives tasks but deprioritized
Unhealthy:
- Heartbeat > 30 seconds old
- Consecutive failures ≥ 10
- Queue depth ≥ 100
- Failure rate ≥ 70%
- Does NOT receive new tasks
Health Metrics
Workers self-report health in capabilities:
{
"runtimes": ["shell", "python"],
"health": {
"status": "healthy",
"last_check": "2026-02-09T12:00:00Z",
"consecutive_failures": 0,
"total_executions": 1000,
"failed_executions": 20,
"average_execution_time_ms": 1500,
"queue_depth": 5
}
}
Worker Selection
Selection Priority:
- Healthy workers (queue depth ascending)
- Degraded workers (queue depth ascending)
- Skip unhealthy workers
Example:
Worker A: Healthy, queue=5 ← Selected first
Worker B: Healthy, queue=20 ← Selected second
Worker C: Degraded, queue=10 ← Selected third
Worker D: Unhealthy, queue=0 ← Never selected
Database Schema
Execution Retry Fields
-- Added to execution table
retry_count INTEGER NOT NULL DEFAULT 0,
max_retries INTEGER,
retry_reason TEXT,
original_execution BIGINT REFERENCES execution(id)
Action Configuration Fields
-- Added to action table
timeout_seconds INTEGER, -- Per-action timeout override
max_retries INTEGER DEFAULT 0 -- Per-action retry limit
Helper Functions
-- Check if execution can be retried
SELECT is_execution_retriable(123);
-- Get worker queue depth
SELECT get_worker_queue_depth(1);
Views
-- Get all healthy workers
SELECT * FROM healthy_workers;
Practical Examples
Example 1: View Retry Chain
-- Find all retries for execution 100
WITH RECURSIVE retry_chain AS (
SELECT id, retry_count, retry_reason, original_execution, status
FROM execution
WHERE id = 100
UNION ALL
SELECT e.id, e.retry_count, e.retry_reason, e.original_execution, e.status
FROM execution e
JOIN retry_chain rc ON e.original_execution = rc.id
)
SELECT * FROM retry_chain ORDER BY retry_count;
Example 2: Analyze Retry Success Rate
-- Success rate of retries by reason
SELECT
config->>'retry_reason' as reason,
COUNT(*) as total_retries,
COUNT(CASE WHEN status = 'completed' THEN 1 END) as succeeded,
ROUND(100.0 * COUNT(CASE WHEN status = 'completed' THEN 1 END) / COUNT(*), 2) as success_rate
FROM execution
WHERE retry_count > 0
GROUP BY config->>'retry_reason'
ORDER BY total_retries DESC;
Example 3: Find Workers by Health
-- Workers sorted by health and load
SELECT
w.name,
w.status,
(w.capabilities->'health'->>'status')::TEXT as health,
(w.capabilities->'health'->>'queue_depth')::INTEGER as queue,
(w.capabilities->'health'->>'consecutive_failures')::INTEGER as failures,
w.last_heartbeat
FROM worker w
WHERE w.status = 'active'
ORDER BY
CASE (w.capabilities->'health'->>'status')::TEXT
WHEN 'healthy' THEN 1
WHEN 'degraded' THEN 2
WHEN 'unhealthy' THEN 3
ELSE 4
END,
(w.capabilities->'health'->>'queue_depth')::INTEGER;
Example 4: Manual Retry via API
# Create retry execution
curl -X POST http://localhost:8080/api/v1/executions \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"action_ref": "core.echo",
"parameters": {"message": "retry test"},
"config": {
"retry_of": 123,
"retry_count": 1,
"max_retries": 3,
"retry_reason": "manual_retry",
"original_execution": 123
}
}'
Monitoring
Key Metrics
Retry Metrics:
- Retry rate: % of executions that retry
- Retry success rate: % of retries that succeed
- Average retries per execution
- Retry reason distribution
Health Metrics:
- Healthy worker count
- Degraded worker count
- Unhealthy worker count
- Average queue depth per worker
- Average failure rate per worker
SQL Queries
-- Retry rate over last hour
SELECT
COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END) as original_executions,
COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) as retry_executions,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) /
COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END), 2) as retry_rate
FROM execution
WHERE created > NOW() - INTERVAL '1 hour';
-- Worker health distribution
SELECT
COALESCE((capabilities->'health'->>'status')::TEXT, 'unknown') as health_status,
COUNT(*) as worker_count,
AVG((capabilities->'health'->>'queue_depth')::INTEGER) as avg_queue_depth
FROM worker
WHERE status = 'active'
GROUP BY health_status;
Configuration
Retry Configuration
// In executor service initialization
let retry_manager = RetryManager::new(pool.clone(), RetryConfig {
enabled: true,
base_backoff_secs: 1,
max_backoff_secs: 300,
backoff_multiplier: 2.0,
jitter_factor: 0.2,
});
Health Probe Configuration
// In executor service initialization
let health_probe = WorkerHealthProbe::new(pool.clone(), HealthProbeConfig {
enabled: true,
heartbeat_max_age_secs: 30,
degraded_threshold: 3,
unhealthy_threshold: 10,
queue_depth_degraded: 50,
queue_depth_unhealthy: 100,
failure_rate_degraded: 0.3,
failure_rate_unhealthy: 0.7,
});
Troubleshooting
High Retry Rate
Symptoms: Many executions retrying repeatedly
Causes:
- Workers unstable or frequently restarting
- Network issues causing transient failures
- Actions not idempotent (retry makes things worse)
Resolution:
- Check worker stability:
docker compose ps - Review action idempotency
- Adjust
max_retriesif retries are unhelpful - Investigate root cause of failures
Retries Not Triggering
Symptoms: Failed executions not retrying despite max_retries > 0
Causes:
- Action doesn't have
max_retriesset - Failure is non-retriable (validation error, etc.)
- Global retry disabled
Resolution:
- Check action configuration:
SELECT timeout_seconds, max_retries FROM action WHERE ref = 'action.name'; - Check failure message for retriable patterns
- Verify retry enabled in executor config
Workers Marked Unhealthy
Symptoms: Workers not receiving tasks
Causes:
- High queue depth (overloaded)
- Consecutive failures exceed threshold
- Heartbeat stale
Resolution:
- Check worker logs:
docker compose logs -f worker-shell - Verify heartbeat:
SELECT name, last_heartbeat FROM worker; - Check queue depth in capabilities
- Restart worker if stuck:
docker compose restart worker-shell
Retry Loops
Symptoms: Execution retries forever or excessive retries
Causes:
- Bug in retry reason detection
- Action failure always classified as retriable
- max_retries not being enforced
Resolution:
- Check retry chain: See Example 1 above
- Verify max_retries:
SELECT config FROM execution WHERE id = 123; - Fix retry reason classification if incorrect
- Manually fail execution if stuck
Integration with Previous Phases
Phase 1 + Phase 2 + Phase 3 Together
Defense in Depth:
- Phase 1 (Timeout Monitor): Catches stuck SCHEDULED executions (30s-5min)
- Phase 2 (Queue TTL/DLQ): Expires messages in worker queues (5min)
- Phase 3 (Intelligent Retry): Retries retriable failures (1s-5min backoff)
Failure Flow:
Execution dispatched → Worker unavailable (Phase 2: 5min TTL)
→ DLQ handler marks FAILED (Phase 2)
→ Retry manager creates retry (Phase 3)
→ Retry dispatched with backoff (Phase 3)
→ Success or exhaust retries
Backup Safety Net: If Phase 3 retry fails to create retry, Phase 1 timeout monitor will still catch stuck executions.
Best Practices
Action Design for Retries
- Make actions idempotent: Safe to run multiple times
- Set realistic timeouts: Based on typical execution time
- Configure appropriate max_retries:
- Network calls: 3-5 retries
- Database operations: 2-3 retries
- External APIs: 3 retries
- Local operations: 0-1 retries
Worker Health Management
- Report queue depth regularly: Update every heartbeat
- Track failure metrics: Consecutive failures, total/failed counts
- Implement graceful degradation: Continue working when degraded
- Fail fast when unhealthy: Stop accepting work if overloaded
Monitoring Strategy
- Alert on high retry rates: > 20% of executions retrying
- Alert on unhealthy workers: > 50% workers unhealthy
- Track retry success rate: Should be > 70%
- Monitor queue depths: Average should stay < 20
See Also
- Architecture:
docs/architecture/worker-availability-handling.md - Phase 1 Guide:
docs/QUICKREF-worker-availability-phase1.md - Phase 2 Guide:
docs/QUICKREF-worker-queue-ttl-dlq.md - Migration:
migrations/20260209000000_phase3_retry_and_health.sql