461 lines
12 KiB
Markdown
461 lines
12 KiB
Markdown
# Quick Reference: Phase 3 - Intelligent Retry & Worker Health
|
|
|
|
## Overview
|
|
|
|
Phase 3 adds intelligent retry logic and proactive worker health monitoring to automatically recover from transient failures and optimize worker selection.
|
|
|
|
**Key Features:**
|
|
- **Automatic Retry:** Failed executions automatically retry with exponential backoff
|
|
- **Health-Aware Scheduling:** Prefer healthy workers with low queue depth
|
|
- **Per-Action Configuration:** Custom timeouts and retry limits per action
|
|
- **Failure Classification:** Distinguish retriable vs non-retriable failures
|
|
|
|
## Quick Start
|
|
|
|
### Enable Retry for an Action
|
|
|
|
```yaml
|
|
# packs/mypack/actions/flaky-api.yaml
|
|
name: flaky_api_call
|
|
runtime: python
|
|
entrypoint: actions/flaky_api.py
|
|
timeout_seconds: 120 # Custom timeout (overrides global 5 min)
|
|
max_retries: 3 # Retry up to 3 times on failure
|
|
parameters:
|
|
url:
|
|
type: string
|
|
required: true
|
|
```
|
|
|
|
### Database Migration
|
|
|
|
```bash
|
|
# Apply Phase 3 schema changes
|
|
sqlx migrate run
|
|
|
|
# Or via Docker Compose
|
|
docker compose exec postgres psql -U attune -d attune -f /migrations/20260209000000_phase3_retry_and_health.sql
|
|
```
|
|
|
|
### Check Worker Health
|
|
|
|
```bash
|
|
# View healthy workers
|
|
psql -c "SELECT * FROM healthy_workers;"
|
|
|
|
# Check specific worker health
|
|
psql -c "
|
|
SELECT
|
|
name,
|
|
capabilities->'health'->>'status' as health_status,
|
|
capabilities->'health'->>'queue_depth' as queue_depth,
|
|
capabilities->'health'->>'consecutive_failures' as failures
|
|
FROM worker
|
|
WHERE id = 1;
|
|
"
|
|
```
|
|
|
|
## Retry Behavior
|
|
|
|
### Retriable Failures
|
|
|
|
Executions are automatically retried for:
|
|
- ✓ Worker unavailable (`worker_unavailable`)
|
|
- ✓ Queue timeout/TTL expired (`queue_timeout`)
|
|
- ✓ Worker heartbeat stale (`worker_heartbeat_stale`)
|
|
- ✓ Transient errors (`transient_error`)
|
|
- ✓ Manual retry requested (`manual_retry`)
|
|
|
|
### Non-Retriable Failures
|
|
|
|
These failures are NOT retried:
|
|
- ✗ Validation errors
|
|
- ✗ Permission denied
|
|
- ✗ Action not found
|
|
- ✗ Invalid parameters
|
|
- ✗ Explicit action failure
|
|
|
|
### Retry Backoff
|
|
|
|
**Strategy:** Exponential backoff with jitter
|
|
|
|
```
|
|
Attempt 0: ~1 second
|
|
Attempt 1: ~2 seconds
|
|
Attempt 2: ~4 seconds
|
|
Attempt 3: ~8 seconds
|
|
Attempt N: min(base * 2^N, 300 seconds)
|
|
```
|
|
|
|
**Jitter:** ±20% randomization to avoid thundering herd
|
|
|
|
### Retry Configuration
|
|
|
|
```rust
|
|
// Default retry configuration
|
|
RetryConfig {
|
|
enabled: true,
|
|
base_backoff_secs: 1,
|
|
max_backoff_secs: 300, // 5 minutes max
|
|
backoff_multiplier: 2.0,
|
|
jitter_factor: 0.2, // 20% jitter
|
|
}
|
|
```
|
|
|
|
## Worker Health
|
|
|
|
### Health States
|
|
|
|
**Healthy:**
|
|
- Heartbeat < 30 seconds old
|
|
- Consecutive failures < 3
|
|
- Queue depth < 50
|
|
- Failure rate < 30%
|
|
|
|
**Degraded:**
|
|
- Consecutive failures: 3-9
|
|
- Queue depth: 50-99
|
|
- Failure rate: 30-69%
|
|
- Still receives tasks but deprioritized
|
|
|
|
**Unhealthy:**
|
|
- Heartbeat > 30 seconds old
|
|
- Consecutive failures ≥ 10
|
|
- Queue depth ≥ 100
|
|
- Failure rate ≥ 70%
|
|
- Does NOT receive new tasks
|
|
|
|
### Health Metrics
|
|
|
|
Workers self-report health in capabilities:
|
|
|
|
```json
|
|
{
|
|
"runtimes": ["shell", "python"],
|
|
"health": {
|
|
"status": "healthy",
|
|
"last_check": "2026-02-09T12:00:00Z",
|
|
"consecutive_failures": 0,
|
|
"total_executions": 1000,
|
|
"failed_executions": 20,
|
|
"average_execution_time_ms": 1500,
|
|
"queue_depth": 5
|
|
}
|
|
}
|
|
```
|
|
|
|
### Worker Selection
|
|
|
|
**Selection Priority:**
|
|
1. Healthy workers (queue depth ascending)
|
|
2. Degraded workers (queue depth ascending)
|
|
3. Skip unhealthy workers
|
|
|
|
**Example:**
|
|
```
|
|
Worker A: Healthy, queue=5 ← Selected first
|
|
Worker B: Healthy, queue=20 ← Selected second
|
|
Worker C: Degraded, queue=10 ← Selected third
|
|
Worker D: Unhealthy, queue=0 ← Never selected
|
|
```
|
|
|
|
## Database Schema
|
|
|
|
### Execution Retry Fields
|
|
|
|
```sql
|
|
-- Added to execution table
|
|
retry_count INTEGER NOT NULL DEFAULT 0,
|
|
max_retries INTEGER,
|
|
retry_reason TEXT,
|
|
original_execution BIGINT REFERENCES execution(id)
|
|
```
|
|
|
|
### Action Configuration Fields
|
|
|
|
```sql
|
|
-- Added to action table
|
|
timeout_seconds INTEGER, -- Per-action timeout override
|
|
max_retries INTEGER DEFAULT 0 -- Per-action retry limit
|
|
```
|
|
|
|
### Helper Functions
|
|
|
|
```sql
|
|
-- Check if execution can be retried
|
|
SELECT is_execution_retriable(123);
|
|
|
|
-- Get worker queue depth
|
|
SELECT get_worker_queue_depth(1);
|
|
```
|
|
|
|
### Views
|
|
|
|
```sql
|
|
-- Get all healthy workers
|
|
SELECT * FROM healthy_workers;
|
|
```
|
|
|
|
## Practical Examples
|
|
|
|
### Example 1: View Retry Chain
|
|
|
|
```sql
|
|
-- Find all retries for execution 100
|
|
WITH RECURSIVE retry_chain AS (
|
|
SELECT id, retry_count, retry_reason, original_execution, status
|
|
FROM execution
|
|
WHERE id = 100
|
|
|
|
UNION ALL
|
|
|
|
SELECT e.id, e.retry_count, e.retry_reason, e.original_execution, e.status
|
|
FROM execution e
|
|
JOIN retry_chain rc ON e.original_execution = rc.id
|
|
)
|
|
SELECT * FROM retry_chain ORDER BY retry_count;
|
|
```
|
|
|
|
### Example 2: Analyze Retry Success Rate
|
|
|
|
```sql
|
|
-- Success rate of retries by reason
|
|
SELECT
|
|
config->>'retry_reason' as reason,
|
|
COUNT(*) as total_retries,
|
|
COUNT(CASE WHEN status = 'completed' THEN 1 END) as succeeded,
|
|
ROUND(100.0 * COUNT(CASE WHEN status = 'completed' THEN 1 END) / COUNT(*), 2) as success_rate
|
|
FROM execution
|
|
WHERE retry_count > 0
|
|
GROUP BY config->>'retry_reason'
|
|
ORDER BY total_retries DESC;
|
|
```
|
|
|
|
### Example 3: Find Workers by Health
|
|
|
|
```sql
|
|
-- Workers sorted by health and load
|
|
SELECT
|
|
w.name,
|
|
w.status,
|
|
(w.capabilities->'health'->>'status')::TEXT as health,
|
|
(w.capabilities->'health'->>'queue_depth')::INTEGER as queue,
|
|
(w.capabilities->'health'->>'consecutive_failures')::INTEGER as failures,
|
|
w.last_heartbeat
|
|
FROM worker w
|
|
WHERE w.status = 'active'
|
|
ORDER BY
|
|
CASE (w.capabilities->'health'->>'status')::TEXT
|
|
WHEN 'healthy' THEN 1
|
|
WHEN 'degraded' THEN 2
|
|
WHEN 'unhealthy' THEN 3
|
|
ELSE 4
|
|
END,
|
|
(w.capabilities->'health'->>'queue_depth')::INTEGER;
|
|
```
|
|
|
|
### Example 4: Manual Retry via API
|
|
|
|
```bash
|
|
# Create retry execution
|
|
curl -X POST http://localhost:8080/api/v1/executions \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"action_ref": "core.echo",
|
|
"parameters": {"message": "retry test"},
|
|
"config": {
|
|
"retry_of": 123,
|
|
"retry_count": 1,
|
|
"max_retries": 3,
|
|
"retry_reason": "manual_retry",
|
|
"original_execution": 123
|
|
}
|
|
}'
|
|
```
|
|
|
|
## Monitoring
|
|
|
|
### Key Metrics
|
|
|
|
**Retry Metrics:**
|
|
- Retry rate: % of executions that retry
|
|
- Retry success rate: % of retries that succeed
|
|
- Average retries per execution
|
|
- Retry reason distribution
|
|
|
|
**Health Metrics:**
|
|
- Healthy worker count
|
|
- Degraded worker count
|
|
- Unhealthy worker count
|
|
- Average queue depth per worker
|
|
- Average failure rate per worker
|
|
|
|
### SQL Queries
|
|
|
|
```sql
|
|
-- Retry rate over last hour
|
|
SELECT
|
|
COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END) as original_executions,
|
|
COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) as retry_executions,
|
|
ROUND(100.0 * COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) /
|
|
COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END), 2) as retry_rate
|
|
FROM execution
|
|
WHERE created > NOW() - INTERVAL '1 hour';
|
|
|
|
-- Worker health distribution
|
|
SELECT
|
|
COALESCE((capabilities->'health'->>'status')::TEXT, 'unknown') as health_status,
|
|
COUNT(*) as worker_count,
|
|
AVG((capabilities->'health'->>'queue_depth')::INTEGER) as avg_queue_depth
|
|
FROM worker
|
|
WHERE status = 'active'
|
|
GROUP BY health_status;
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Retry Configuration
|
|
|
|
```rust
|
|
// In executor service initialization
|
|
let retry_manager = RetryManager::new(pool.clone(), RetryConfig {
|
|
enabled: true,
|
|
base_backoff_secs: 1,
|
|
max_backoff_secs: 300,
|
|
backoff_multiplier: 2.0,
|
|
jitter_factor: 0.2,
|
|
});
|
|
```
|
|
|
|
### Health Probe Configuration
|
|
|
|
```rust
|
|
// In executor service initialization
|
|
let health_probe = WorkerHealthProbe::new(pool.clone(), HealthProbeConfig {
|
|
enabled: true,
|
|
heartbeat_max_age_secs: 30,
|
|
degraded_threshold: 3,
|
|
unhealthy_threshold: 10,
|
|
queue_depth_degraded: 50,
|
|
queue_depth_unhealthy: 100,
|
|
failure_rate_degraded: 0.3,
|
|
failure_rate_unhealthy: 0.7,
|
|
});
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### High Retry Rate
|
|
|
|
**Symptoms:** Many executions retrying repeatedly
|
|
|
|
**Causes:**
|
|
- Workers unstable or frequently restarting
|
|
- Network issues causing transient failures
|
|
- Actions not idempotent (retry makes things worse)
|
|
|
|
**Resolution:**
|
|
1. Check worker stability: `docker compose ps`
|
|
2. Review action idempotency
|
|
3. Adjust `max_retries` if retries are unhelpful
|
|
4. Investigate root cause of failures
|
|
|
|
### Retries Not Triggering
|
|
|
|
**Symptoms:** Failed executions not retrying despite max_retries > 0
|
|
|
|
**Causes:**
|
|
- Action doesn't have `max_retries` set
|
|
- Failure is non-retriable (validation error, etc.)
|
|
- Global retry disabled
|
|
|
|
**Resolution:**
|
|
1. Check action configuration: `SELECT timeout_seconds, max_retries FROM action WHERE ref = 'action.name';`
|
|
2. Check failure message for retriable patterns
|
|
3. Verify retry enabled in executor config
|
|
|
|
### Workers Marked Unhealthy
|
|
|
|
**Symptoms:** Workers not receiving tasks
|
|
|
|
**Causes:**
|
|
- High queue depth (overloaded)
|
|
- Consecutive failures exceed threshold
|
|
- Heartbeat stale
|
|
|
|
**Resolution:**
|
|
1. Check worker logs: `docker compose logs -f worker-shell`
|
|
2. Verify heartbeat: `SELECT name, last_heartbeat FROM worker;`
|
|
3. Check queue depth in capabilities
|
|
4. Restart worker if stuck: `docker compose restart worker-shell`
|
|
|
|
### Retry Loops
|
|
|
|
**Symptoms:** Execution retries forever or excessive retries
|
|
|
|
**Causes:**
|
|
- Bug in retry reason detection
|
|
- Action failure always classified as retriable
|
|
- max_retries not being enforced
|
|
|
|
**Resolution:**
|
|
1. Check retry chain: See Example 1 above
|
|
2. Verify max_retries: `SELECT config FROM execution WHERE id = 123;`
|
|
3. Fix retry reason classification if incorrect
|
|
4. Manually fail execution if stuck
|
|
|
|
## Integration with Previous Phases
|
|
|
|
### Phase 1 + Phase 2 + Phase 3 Together
|
|
|
|
**Defense in Depth:**
|
|
1. **Phase 1 (Timeout Monitor):** Catches stuck SCHEDULED executions (30s-5min)
|
|
2. **Phase 2 (Queue TTL/DLQ):** Expires messages in worker queues (5min)
|
|
3. **Phase 3 (Intelligent Retry):** Retries retriable failures (1s-5min backoff)
|
|
|
|
**Failure Flow:**
|
|
```
|
|
Execution dispatched → Worker unavailable (Phase 2: 5min TTL)
|
|
→ DLQ handler marks FAILED (Phase 2)
|
|
→ Retry manager creates retry (Phase 3)
|
|
→ Retry dispatched with backoff (Phase 3)
|
|
→ Success or exhaust retries
|
|
```
|
|
|
|
**Backup Safety Net:**
|
|
If Phase 3 retry fails to create retry, Phase 1 timeout monitor will still catch stuck executions.
|
|
|
|
## Best Practices
|
|
|
|
### Action Design for Retries
|
|
|
|
1. **Make actions idempotent:** Safe to run multiple times
|
|
2. **Set realistic timeouts:** Based on typical execution time
|
|
3. **Configure appropriate max_retries:**
|
|
- Network calls: 3-5 retries
|
|
- Database operations: 2-3 retries
|
|
- External APIs: 3 retries
|
|
- Local operations: 0-1 retries
|
|
|
|
### Worker Health Management
|
|
|
|
1. **Report queue depth regularly:** Update every heartbeat
|
|
2. **Track failure metrics:** Consecutive failures, total/failed counts
|
|
3. **Implement graceful degradation:** Continue working when degraded
|
|
4. **Fail fast when unhealthy:** Stop accepting work if overloaded
|
|
|
|
### Monitoring Strategy
|
|
|
|
1. **Alert on high retry rates:** > 20% of executions retrying
|
|
2. **Alert on unhealthy workers:** > 50% workers unhealthy
|
|
3. **Track retry success rate:** Should be > 70%
|
|
4. **Monitor queue depths:** Average should stay < 20
|
|
|
|
## See Also
|
|
|
|
- **Architecture:** `docs/architecture/worker-availability-handling.md`
|
|
- **Phase 1 Guide:** `docs/QUICKREF-worker-availability-phase1.md`
|
|
- **Phase 2 Guide:** `docs/QUICKREF-worker-queue-ttl-dlq.md`
|
|
- **Migration:** `migrations/20260209000000_phase3_retry_and_health.sql`
|