more internal polish, resilient workers

This commit is contained in:
2026-02-09 18:32:34 -06:00
parent 588b319fec
commit e31ecb781b
62 changed files with 9872 additions and 584 deletions

View File

@@ -0,0 +1,460 @@
# Quick Reference: Phase 3 - Intelligent Retry & Worker Health
## Overview
Phase 3 adds intelligent retry logic and proactive worker health monitoring to automatically recover from transient failures and optimize worker selection.
**Key Features:**
- **Automatic Retry:** Failed executions automatically retry with exponential backoff
- **Health-Aware Scheduling:** Prefer healthy workers with low queue depth
- **Per-Action Configuration:** Custom timeouts and retry limits per action
- **Failure Classification:** Distinguish retriable vs non-retriable failures
## Quick Start
### Enable Retry for an Action
```yaml
# packs/mypack/actions/flaky-api.yaml
name: flaky_api_call
runtime: python
entrypoint: actions/flaky_api.py
timeout_seconds: 120 # Custom timeout (overrides global 5 min)
max_retries: 3 # Retry up to 3 times on failure
parameters:
url:
type: string
required: true
```
### Database Migration
```bash
# Apply Phase 3 schema changes
sqlx migrate run
# Or via Docker Compose
docker compose exec postgres psql -U attune -d attune -f /migrations/20260209000000_phase3_retry_and_health.sql
```
### Check Worker Health
```bash
# View healthy workers
psql -c "SELECT * FROM healthy_workers;"
# Check specific worker health
psql -c "
SELECT
name,
capabilities->'health'->>'status' as health_status,
capabilities->'health'->>'queue_depth' as queue_depth,
capabilities->'health'->>'consecutive_failures' as failures
FROM worker
WHERE id = 1;
"
```
## Retry Behavior
### Retriable Failures
Executions are automatically retried for:
- ✓ Worker unavailable (`worker_unavailable`)
- ✓ Queue timeout/TTL expired (`queue_timeout`)
- ✓ Worker heartbeat stale (`worker_heartbeat_stale`)
- ✓ Transient errors (`transient_error`)
- ✓ Manual retry requested (`manual_retry`)
### Non-Retriable Failures
These failures are NOT retried:
- ✗ Validation errors
- ✗ Permission denied
- ✗ Action not found
- ✗ Invalid parameters
- ✗ Explicit action failure
### Retry Backoff
**Strategy:** Exponential backoff with jitter
```
Attempt 0: ~1 second
Attempt 1: ~2 seconds
Attempt 2: ~4 seconds
Attempt 3: ~8 seconds
Attempt N: min(base * 2^N, 300 seconds)
```
**Jitter:** ±20% randomization to avoid thundering herd
### Retry Configuration
```rust
// Default retry configuration
RetryConfig {
enabled: true,
base_backoff_secs: 1,
max_backoff_secs: 300, // 5 minutes max
backoff_multiplier: 2.0,
jitter_factor: 0.2, // 20% jitter
}
```
## Worker Health
### Health States
**Healthy:**
- Heartbeat < 30 seconds old
- Consecutive failures < 3
- Queue depth < 50
- Failure rate < 30%
**Degraded:**
- Consecutive failures: 3-9
- Queue depth: 50-99
- Failure rate: 30-69%
- Still receives tasks but deprioritized
**Unhealthy:**
- Heartbeat > 30 seconds old
- Consecutive failures ≥ 10
- Queue depth ≥ 100
- Failure rate ≥ 70%
- Does NOT receive new tasks
### Health Metrics
Workers self-report health in capabilities:
```json
{
"runtimes": ["shell", "python"],
"health": {
"status": "healthy",
"last_check": "2026-02-09T12:00:00Z",
"consecutive_failures": 0,
"total_executions": 1000,
"failed_executions": 20,
"average_execution_time_ms": 1500,
"queue_depth": 5
}
}
```
### Worker Selection
**Selection Priority:**
1. Healthy workers (queue depth ascending)
2. Degraded workers (queue depth ascending)
3. Skip unhealthy workers
**Example:**
```
Worker A: Healthy, queue=5 ← Selected first
Worker B: Healthy, queue=20 ← Selected second
Worker C: Degraded, queue=10 ← Selected third
Worker D: Unhealthy, queue=0 ← Never selected
```
## Database Schema
### Execution Retry Fields
```sql
-- Added to execution table
retry_count INTEGER NOT NULL DEFAULT 0,
max_retries INTEGER,
retry_reason TEXT,
original_execution BIGINT REFERENCES execution(id)
```
### Action Configuration Fields
```sql
-- Added to action table
timeout_seconds INTEGER, -- Per-action timeout override
max_retries INTEGER DEFAULT 0 -- Per-action retry limit
```
### Helper Functions
```sql
-- Check if execution can be retried
SELECT is_execution_retriable(123);
-- Get worker queue depth
SELECT get_worker_queue_depth(1);
```
### Views
```sql
-- Get all healthy workers
SELECT * FROM healthy_workers;
```
## Practical Examples
### Example 1: View Retry Chain
```sql
-- Find all retries for execution 100
WITH RECURSIVE retry_chain AS (
SELECT id, retry_count, retry_reason, original_execution, status
FROM execution
WHERE id = 100
UNION ALL
SELECT e.id, e.retry_count, e.retry_reason, e.original_execution, e.status
FROM execution e
JOIN retry_chain rc ON e.original_execution = rc.id
)
SELECT * FROM retry_chain ORDER BY retry_count;
```
### Example 2: Analyze Retry Success Rate
```sql
-- Success rate of retries by reason
SELECT
config->>'retry_reason' as reason,
COUNT(*) as total_retries,
COUNT(CASE WHEN status = 'completed' THEN 1 END) as succeeded,
ROUND(100.0 * COUNT(CASE WHEN status = 'completed' THEN 1 END) / COUNT(*), 2) as success_rate
FROM execution
WHERE retry_count > 0
GROUP BY config->>'retry_reason'
ORDER BY total_retries DESC;
```
### Example 3: Find Workers by Health
```sql
-- Workers sorted by health and load
SELECT
w.name,
w.status,
(w.capabilities->'health'->>'status')::TEXT as health,
(w.capabilities->'health'->>'queue_depth')::INTEGER as queue,
(w.capabilities->'health'->>'consecutive_failures')::INTEGER as failures,
w.last_heartbeat
FROM worker w
WHERE w.status = 'active'
ORDER BY
CASE (w.capabilities->'health'->>'status')::TEXT
WHEN 'healthy' THEN 1
WHEN 'degraded' THEN 2
WHEN 'unhealthy' THEN 3
ELSE 4
END,
(w.capabilities->'health'->>'queue_depth')::INTEGER;
```
### Example 4: Manual Retry via API
```bash
# Create retry execution
curl -X POST http://localhost:8080/api/v1/executions \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"action_ref": "core.echo",
"parameters": {"message": "retry test"},
"config": {
"retry_of": 123,
"retry_count": 1,
"max_retries": 3,
"retry_reason": "manual_retry",
"original_execution": 123
}
}'
```
## Monitoring
### Key Metrics
**Retry Metrics:**
- Retry rate: % of executions that retry
- Retry success rate: % of retries that succeed
- Average retries per execution
- Retry reason distribution
**Health Metrics:**
- Healthy worker count
- Degraded worker count
- Unhealthy worker count
- Average queue depth per worker
- Average failure rate per worker
### SQL Queries
```sql
-- Retry rate over last hour
SELECT
COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END) as original_executions,
COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) as retry_executions,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) /
COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END), 2) as retry_rate
FROM execution
WHERE created > NOW() - INTERVAL '1 hour';
-- Worker health distribution
SELECT
COALESCE((capabilities->'health'->>'status')::TEXT, 'unknown') as health_status,
COUNT(*) as worker_count,
AVG((capabilities->'health'->>'queue_depth')::INTEGER) as avg_queue_depth
FROM worker
WHERE status = 'active'
GROUP BY health_status;
```
## Configuration
### Retry Configuration
```rust
// In executor service initialization
let retry_manager = RetryManager::new(pool.clone(), RetryConfig {
enabled: true,
base_backoff_secs: 1,
max_backoff_secs: 300,
backoff_multiplier: 2.0,
jitter_factor: 0.2,
});
```
### Health Probe Configuration
```rust
// In executor service initialization
let health_probe = WorkerHealthProbe::new(pool.clone(), HealthProbeConfig {
enabled: true,
heartbeat_max_age_secs: 30,
degraded_threshold: 3,
unhealthy_threshold: 10,
queue_depth_degraded: 50,
queue_depth_unhealthy: 100,
failure_rate_degraded: 0.3,
failure_rate_unhealthy: 0.7,
});
```
## Troubleshooting
### High Retry Rate
**Symptoms:** Many executions retrying repeatedly
**Causes:**
- Workers unstable or frequently restarting
- Network issues causing transient failures
- Actions not idempotent (retry makes things worse)
**Resolution:**
1. Check worker stability: `docker compose ps`
2. Review action idempotency
3. Adjust `max_retries` if retries are unhelpful
4. Investigate root cause of failures
### Retries Not Triggering
**Symptoms:** Failed executions not retrying despite max_retries > 0
**Causes:**
- Action doesn't have `max_retries` set
- Failure is non-retriable (validation error, etc.)
- Global retry disabled
**Resolution:**
1. Check action configuration: `SELECT timeout_seconds, max_retries FROM action WHERE ref = 'action.name';`
2. Check failure message for retriable patterns
3. Verify retry enabled in executor config
### Workers Marked Unhealthy
**Symptoms:** Workers not receiving tasks
**Causes:**
- High queue depth (overloaded)
- Consecutive failures exceed threshold
- Heartbeat stale
**Resolution:**
1. Check worker logs: `docker compose logs -f worker-shell`
2. Verify heartbeat: `SELECT name, last_heartbeat FROM worker;`
3. Check queue depth in capabilities
4. Restart worker if stuck: `docker compose restart worker-shell`
### Retry Loops
**Symptoms:** Execution retries forever or excessive retries
**Causes:**
- Bug in retry reason detection
- Action failure always classified as retriable
- max_retries not being enforced
**Resolution:**
1. Check retry chain: See Example 1 above
2. Verify max_retries: `SELECT config FROM execution WHERE id = 123;`
3. Fix retry reason classification if incorrect
4. Manually fail execution if stuck
## Integration with Previous Phases
### Phase 1 + Phase 2 + Phase 3 Together
**Defense in Depth:**
1. **Phase 1 (Timeout Monitor):** Catches stuck SCHEDULED executions (30s-5min)
2. **Phase 2 (Queue TTL/DLQ):** Expires messages in worker queues (5min)
3. **Phase 3 (Intelligent Retry):** Retries retriable failures (1s-5min backoff)
**Failure Flow:**
```
Execution dispatched → Worker unavailable (Phase 2: 5min TTL)
→ DLQ handler marks FAILED (Phase 2)
→ Retry manager creates retry (Phase 3)
→ Retry dispatched with backoff (Phase 3)
→ Success or exhaust retries
```
**Backup Safety Net:**
If Phase 3 retry fails to create retry, Phase 1 timeout monitor will still catch stuck executions.
## Best Practices
### Action Design for Retries
1. **Make actions idempotent:** Safe to run multiple times
2. **Set realistic timeouts:** Based on typical execution time
3. **Configure appropriate max_retries:**
- Network calls: 3-5 retries
- Database operations: 2-3 retries
- External APIs: 3 retries
- Local operations: 0-1 retries
### Worker Health Management
1. **Report queue depth regularly:** Update every heartbeat
2. **Track failure metrics:** Consecutive failures, total/failed counts
3. **Implement graceful degradation:** Continue working when degraded
4. **Fail fast when unhealthy:** Stop accepting work if overloaded
### Monitoring Strategy
1. **Alert on high retry rates:** > 20% of executions retrying
2. **Alert on unhealthy workers:** > 50% workers unhealthy
3. **Track retry success rate:** Should be > 70%
4. **Monitor queue depths:** Average should stay < 20
## See Also
- **Architecture:** `docs/architecture/worker-availability-handling.md`
- **Phase 1 Guide:** `docs/QUICKREF-worker-availability-phase1.md`
- **Phase 2 Guide:** `docs/QUICKREF-worker-queue-ttl-dlq.md`
- **Migration:** `migrations/20260209000000_phase3_retry_and_health.sql`