more internal polish, resilient workers
This commit is contained in:
460
docs/QUICKREF-phase3-retry-health.md
Normal file
460
docs/QUICKREF-phase3-retry-health.md
Normal file
@@ -0,0 +1,460 @@
|
||||
# Quick Reference: Phase 3 - Intelligent Retry & Worker Health
|
||||
|
||||
## Overview
|
||||
|
||||
Phase 3 adds intelligent retry logic and proactive worker health monitoring to automatically recover from transient failures and optimize worker selection.
|
||||
|
||||
**Key Features:**
|
||||
- **Automatic Retry:** Failed executions automatically retry with exponential backoff
|
||||
- **Health-Aware Scheduling:** Prefer healthy workers with low queue depth
|
||||
- **Per-Action Configuration:** Custom timeouts and retry limits per action
|
||||
- **Failure Classification:** Distinguish retriable vs non-retriable failures
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Enable Retry for an Action
|
||||
|
||||
```yaml
|
||||
# packs/mypack/actions/flaky-api.yaml
|
||||
name: flaky_api_call
|
||||
runtime: python
|
||||
entrypoint: actions/flaky_api.py
|
||||
timeout_seconds: 120 # Custom timeout (overrides global 5 min)
|
||||
max_retries: 3 # Retry up to 3 times on failure
|
||||
parameters:
|
||||
url:
|
||||
type: string
|
||||
required: true
|
||||
```
|
||||
|
||||
### Database Migration
|
||||
|
||||
```bash
|
||||
# Apply Phase 3 schema changes
|
||||
sqlx migrate run
|
||||
|
||||
# Or via Docker Compose
|
||||
docker compose exec postgres psql -U attune -d attune -f /migrations/20260209000000_phase3_retry_and_health.sql
|
||||
```
|
||||
|
||||
### Check Worker Health
|
||||
|
||||
```bash
|
||||
# View healthy workers
|
||||
psql -c "SELECT * FROM healthy_workers;"
|
||||
|
||||
# Check specific worker health
|
||||
psql -c "
|
||||
SELECT
|
||||
name,
|
||||
capabilities->'health'->>'status' as health_status,
|
||||
capabilities->'health'->>'queue_depth' as queue_depth,
|
||||
capabilities->'health'->>'consecutive_failures' as failures
|
||||
FROM worker
|
||||
WHERE id = 1;
|
||||
"
|
||||
```
|
||||
|
||||
## Retry Behavior
|
||||
|
||||
### Retriable Failures
|
||||
|
||||
Executions are automatically retried for:
|
||||
- ✓ Worker unavailable (`worker_unavailable`)
|
||||
- ✓ Queue timeout/TTL expired (`queue_timeout`)
|
||||
- ✓ Worker heartbeat stale (`worker_heartbeat_stale`)
|
||||
- ✓ Transient errors (`transient_error`)
|
||||
- ✓ Manual retry requested (`manual_retry`)
|
||||
|
||||
### Non-Retriable Failures
|
||||
|
||||
These failures are NOT retried:
|
||||
- ✗ Validation errors
|
||||
- ✗ Permission denied
|
||||
- ✗ Action not found
|
||||
- ✗ Invalid parameters
|
||||
- ✗ Explicit action failure
|
||||
|
||||
### Retry Backoff
|
||||
|
||||
**Strategy:** Exponential backoff with jitter
|
||||
|
||||
```
|
||||
Attempt 0: ~1 second
|
||||
Attempt 1: ~2 seconds
|
||||
Attempt 2: ~4 seconds
|
||||
Attempt 3: ~8 seconds
|
||||
Attempt N: min(base * 2^N, 300 seconds)
|
||||
```
|
||||
|
||||
**Jitter:** ±20% randomization to avoid thundering herd
|
||||
|
||||
### Retry Configuration
|
||||
|
||||
```rust
|
||||
// Default retry configuration
|
||||
RetryConfig {
|
||||
enabled: true,
|
||||
base_backoff_secs: 1,
|
||||
max_backoff_secs: 300, // 5 minutes max
|
||||
backoff_multiplier: 2.0,
|
||||
jitter_factor: 0.2, // 20% jitter
|
||||
}
|
||||
```
|
||||
|
||||
## Worker Health
|
||||
|
||||
### Health States
|
||||
|
||||
**Healthy:**
|
||||
- Heartbeat < 30 seconds old
|
||||
- Consecutive failures < 3
|
||||
- Queue depth < 50
|
||||
- Failure rate < 30%
|
||||
|
||||
**Degraded:**
|
||||
- Consecutive failures: 3-9
|
||||
- Queue depth: 50-99
|
||||
- Failure rate: 30-69%
|
||||
- Still receives tasks but deprioritized
|
||||
|
||||
**Unhealthy:**
|
||||
- Heartbeat > 30 seconds old
|
||||
- Consecutive failures ≥ 10
|
||||
- Queue depth ≥ 100
|
||||
- Failure rate ≥ 70%
|
||||
- Does NOT receive new tasks
|
||||
|
||||
### Health Metrics
|
||||
|
||||
Workers self-report health in capabilities:
|
||||
|
||||
```json
|
||||
{
|
||||
"runtimes": ["shell", "python"],
|
||||
"health": {
|
||||
"status": "healthy",
|
||||
"last_check": "2026-02-09T12:00:00Z",
|
||||
"consecutive_failures": 0,
|
||||
"total_executions": 1000,
|
||||
"failed_executions": 20,
|
||||
"average_execution_time_ms": 1500,
|
||||
"queue_depth": 5
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Worker Selection
|
||||
|
||||
**Selection Priority:**
|
||||
1. Healthy workers (queue depth ascending)
|
||||
2. Degraded workers (queue depth ascending)
|
||||
3. Skip unhealthy workers
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Worker A: Healthy, queue=5 ← Selected first
|
||||
Worker B: Healthy, queue=20 ← Selected second
|
||||
Worker C: Degraded, queue=10 ← Selected third
|
||||
Worker D: Unhealthy, queue=0 ← Never selected
|
||||
```
|
||||
|
||||
## Database Schema
|
||||
|
||||
### Execution Retry Fields
|
||||
|
||||
```sql
|
||||
-- Added to execution table
|
||||
retry_count INTEGER NOT NULL DEFAULT 0,
|
||||
max_retries INTEGER,
|
||||
retry_reason TEXT,
|
||||
original_execution BIGINT REFERENCES execution(id)
|
||||
```
|
||||
|
||||
### Action Configuration Fields
|
||||
|
||||
```sql
|
||||
-- Added to action table
|
||||
timeout_seconds INTEGER, -- Per-action timeout override
|
||||
max_retries INTEGER DEFAULT 0 -- Per-action retry limit
|
||||
```
|
||||
|
||||
### Helper Functions
|
||||
|
||||
```sql
|
||||
-- Check if execution can be retried
|
||||
SELECT is_execution_retriable(123);
|
||||
|
||||
-- Get worker queue depth
|
||||
SELECT get_worker_queue_depth(1);
|
||||
```
|
||||
|
||||
### Views
|
||||
|
||||
```sql
|
||||
-- Get all healthy workers
|
||||
SELECT * FROM healthy_workers;
|
||||
```
|
||||
|
||||
## Practical Examples
|
||||
|
||||
### Example 1: View Retry Chain
|
||||
|
||||
```sql
|
||||
-- Find all retries for execution 100
|
||||
WITH RECURSIVE retry_chain AS (
|
||||
SELECT id, retry_count, retry_reason, original_execution, status
|
||||
FROM execution
|
||||
WHERE id = 100
|
||||
|
||||
UNION ALL
|
||||
|
||||
SELECT e.id, e.retry_count, e.retry_reason, e.original_execution, e.status
|
||||
FROM execution e
|
||||
JOIN retry_chain rc ON e.original_execution = rc.id
|
||||
)
|
||||
SELECT * FROM retry_chain ORDER BY retry_count;
|
||||
```
|
||||
|
||||
### Example 2: Analyze Retry Success Rate
|
||||
|
||||
```sql
|
||||
-- Success rate of retries by reason
|
||||
SELECT
|
||||
config->>'retry_reason' as reason,
|
||||
COUNT(*) as total_retries,
|
||||
COUNT(CASE WHEN status = 'completed' THEN 1 END) as succeeded,
|
||||
ROUND(100.0 * COUNT(CASE WHEN status = 'completed' THEN 1 END) / COUNT(*), 2) as success_rate
|
||||
FROM execution
|
||||
WHERE retry_count > 0
|
||||
GROUP BY config->>'retry_reason'
|
||||
ORDER BY total_retries DESC;
|
||||
```
|
||||
|
||||
### Example 3: Find Workers by Health
|
||||
|
||||
```sql
|
||||
-- Workers sorted by health and load
|
||||
SELECT
|
||||
w.name,
|
||||
w.status,
|
||||
(w.capabilities->'health'->>'status')::TEXT as health,
|
||||
(w.capabilities->'health'->>'queue_depth')::INTEGER as queue,
|
||||
(w.capabilities->'health'->>'consecutive_failures')::INTEGER as failures,
|
||||
w.last_heartbeat
|
||||
FROM worker w
|
||||
WHERE w.status = 'active'
|
||||
ORDER BY
|
||||
CASE (w.capabilities->'health'->>'status')::TEXT
|
||||
WHEN 'healthy' THEN 1
|
||||
WHEN 'degraded' THEN 2
|
||||
WHEN 'unhealthy' THEN 3
|
||||
ELSE 4
|
||||
END,
|
||||
(w.capabilities->'health'->>'queue_depth')::INTEGER;
|
||||
```
|
||||
|
||||
### Example 4: Manual Retry via API
|
||||
|
||||
```bash
|
||||
# Create retry execution
|
||||
curl -X POST http://localhost:8080/api/v1/executions \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"action_ref": "core.echo",
|
||||
"parameters": {"message": "retry test"},
|
||||
"config": {
|
||||
"retry_of": 123,
|
||||
"retry_count": 1,
|
||||
"max_retries": 3,
|
||||
"retry_reason": "manual_retry",
|
||||
"original_execution": 123
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Key Metrics
|
||||
|
||||
**Retry Metrics:**
|
||||
- Retry rate: % of executions that retry
|
||||
- Retry success rate: % of retries that succeed
|
||||
- Average retries per execution
|
||||
- Retry reason distribution
|
||||
|
||||
**Health Metrics:**
|
||||
- Healthy worker count
|
||||
- Degraded worker count
|
||||
- Unhealthy worker count
|
||||
- Average queue depth per worker
|
||||
- Average failure rate per worker
|
||||
|
||||
### SQL Queries
|
||||
|
||||
```sql
|
||||
-- Retry rate over last hour
|
||||
SELECT
|
||||
COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END) as original_executions,
|
||||
COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) as retry_executions,
|
||||
ROUND(100.0 * COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) /
|
||||
COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END), 2) as retry_rate
|
||||
FROM execution
|
||||
WHERE created > NOW() - INTERVAL '1 hour';
|
||||
|
||||
-- Worker health distribution
|
||||
SELECT
|
||||
COALESCE((capabilities->'health'->>'status')::TEXT, 'unknown') as health_status,
|
||||
COUNT(*) as worker_count,
|
||||
AVG((capabilities->'health'->>'queue_depth')::INTEGER) as avg_queue_depth
|
||||
FROM worker
|
||||
WHERE status = 'active'
|
||||
GROUP BY health_status;
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Retry Configuration
|
||||
|
||||
```rust
|
||||
// In executor service initialization
|
||||
let retry_manager = RetryManager::new(pool.clone(), RetryConfig {
|
||||
enabled: true,
|
||||
base_backoff_secs: 1,
|
||||
max_backoff_secs: 300,
|
||||
backoff_multiplier: 2.0,
|
||||
jitter_factor: 0.2,
|
||||
});
|
||||
```
|
||||
|
||||
### Health Probe Configuration
|
||||
|
||||
```rust
|
||||
// In executor service initialization
|
||||
let health_probe = WorkerHealthProbe::new(pool.clone(), HealthProbeConfig {
|
||||
enabled: true,
|
||||
heartbeat_max_age_secs: 30,
|
||||
degraded_threshold: 3,
|
||||
unhealthy_threshold: 10,
|
||||
queue_depth_degraded: 50,
|
||||
queue_depth_unhealthy: 100,
|
||||
failure_rate_degraded: 0.3,
|
||||
failure_rate_unhealthy: 0.7,
|
||||
});
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### High Retry Rate
|
||||
|
||||
**Symptoms:** Many executions retrying repeatedly
|
||||
|
||||
**Causes:**
|
||||
- Workers unstable or frequently restarting
|
||||
- Network issues causing transient failures
|
||||
- Actions not idempotent (retry makes things worse)
|
||||
|
||||
**Resolution:**
|
||||
1. Check worker stability: `docker compose ps`
|
||||
2. Review action idempotency
|
||||
3. Adjust `max_retries` if retries are unhelpful
|
||||
4. Investigate root cause of failures
|
||||
|
||||
### Retries Not Triggering
|
||||
|
||||
**Symptoms:** Failed executions not retrying despite max_retries > 0
|
||||
|
||||
**Causes:**
|
||||
- Action doesn't have `max_retries` set
|
||||
- Failure is non-retriable (validation error, etc.)
|
||||
- Global retry disabled
|
||||
|
||||
**Resolution:**
|
||||
1. Check action configuration: `SELECT timeout_seconds, max_retries FROM action WHERE ref = 'action.name';`
|
||||
2. Check failure message for retriable patterns
|
||||
3. Verify retry enabled in executor config
|
||||
|
||||
### Workers Marked Unhealthy
|
||||
|
||||
**Symptoms:** Workers not receiving tasks
|
||||
|
||||
**Causes:**
|
||||
- High queue depth (overloaded)
|
||||
- Consecutive failures exceed threshold
|
||||
- Heartbeat stale
|
||||
|
||||
**Resolution:**
|
||||
1. Check worker logs: `docker compose logs -f worker-shell`
|
||||
2. Verify heartbeat: `SELECT name, last_heartbeat FROM worker;`
|
||||
3. Check queue depth in capabilities
|
||||
4. Restart worker if stuck: `docker compose restart worker-shell`
|
||||
|
||||
### Retry Loops
|
||||
|
||||
**Symptoms:** Execution retries forever or excessive retries
|
||||
|
||||
**Causes:**
|
||||
- Bug in retry reason detection
|
||||
- Action failure always classified as retriable
|
||||
- max_retries not being enforced
|
||||
|
||||
**Resolution:**
|
||||
1. Check retry chain: See Example 1 above
|
||||
2. Verify max_retries: `SELECT config FROM execution WHERE id = 123;`
|
||||
3. Fix retry reason classification if incorrect
|
||||
4. Manually fail execution if stuck
|
||||
|
||||
## Integration with Previous Phases
|
||||
|
||||
### Phase 1 + Phase 2 + Phase 3 Together
|
||||
|
||||
**Defense in Depth:**
|
||||
1. **Phase 1 (Timeout Monitor):** Catches stuck SCHEDULED executions (30s-5min)
|
||||
2. **Phase 2 (Queue TTL/DLQ):** Expires messages in worker queues (5min)
|
||||
3. **Phase 3 (Intelligent Retry):** Retries retriable failures (1s-5min backoff)
|
||||
|
||||
**Failure Flow:**
|
||||
```
|
||||
Execution dispatched → Worker unavailable (Phase 2: 5min TTL)
|
||||
→ DLQ handler marks FAILED (Phase 2)
|
||||
→ Retry manager creates retry (Phase 3)
|
||||
→ Retry dispatched with backoff (Phase 3)
|
||||
→ Success or exhaust retries
|
||||
```
|
||||
|
||||
**Backup Safety Net:**
|
||||
If Phase 3 retry fails to create retry, Phase 1 timeout monitor will still catch stuck executions.
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Action Design for Retries
|
||||
|
||||
1. **Make actions idempotent:** Safe to run multiple times
|
||||
2. **Set realistic timeouts:** Based on typical execution time
|
||||
3. **Configure appropriate max_retries:**
|
||||
- Network calls: 3-5 retries
|
||||
- Database operations: 2-3 retries
|
||||
- External APIs: 3 retries
|
||||
- Local operations: 0-1 retries
|
||||
|
||||
### Worker Health Management
|
||||
|
||||
1. **Report queue depth regularly:** Update every heartbeat
|
||||
2. **Track failure metrics:** Consecutive failures, total/failed counts
|
||||
3. **Implement graceful degradation:** Continue working when degraded
|
||||
4. **Fail fast when unhealthy:** Stop accepting work if overloaded
|
||||
|
||||
### Monitoring Strategy
|
||||
|
||||
1. **Alert on high retry rates:** > 20% of executions retrying
|
||||
2. **Alert on unhealthy workers:** > 50% workers unhealthy
|
||||
3. **Track retry success rate:** Should be > 70%
|
||||
4. **Monitor queue depths:** Average should stay < 20
|
||||
|
||||
## See Also
|
||||
|
||||
- **Architecture:** `docs/architecture/worker-availability-handling.md`
|
||||
- **Phase 1 Guide:** `docs/QUICKREF-worker-availability-phase1.md`
|
||||
- **Phase 2 Guide:** `docs/QUICKREF-worker-queue-ttl-dlq.md`
|
||||
- **Migration:** `migrations/20260209000000_phase3_retry_and_health.sql`
|
||||
Reference in New Issue
Block a user