more internal polish, resilient workers

2026-02-09 18:32:34 -06:00
parent 588b319fec
commit e31ecb781b
62 changed files with 9872 additions and 584 deletions
--- a/docs/QUICKREF-phase3-retry-health.md
+++ b/docs/QUICKREF-phase3-retry-health.md
@@ -0,0 +1,460 @@
+# Quick Reference: Phase 3 - Intelligent Retry & Worker Health
+
+## Overview
+
+Phase 3 adds intelligent retry logic and proactive worker health monitoring to automatically recover from transient failures and optimize worker selection.
+
+**Key Features:**
+- **Automatic Retry:** Failed executions automatically retry with exponential backoff
+- **Health-Aware Scheduling:** Prefer healthy workers with low queue depth
+- **Per-Action Configuration:** Custom timeouts and retry limits per action
+- **Failure Classification:** Distinguish retriable vs non-retriable failures
+
+## Quick Start
+
+### Enable Retry for an Action
+
+```yaml
+# packs/mypack/actions/flaky-api.yaml
+name: flaky_api_call
+runtime: python
+entrypoint: actions/flaky_api.py
+timeout_seconds: 120      # Custom timeout (overrides global 5 min)
+max_retries: 3            # Retry up to 3 times on failure
+parameters:
+  url:
+    type: string
+    required: true
+```
+
+### Database Migration
+
+```bash
+# Apply Phase 3 schema changes
+sqlx migrate run
+
+# Or via Docker Compose
+docker compose exec postgres psql -U attune -d attune -f /migrations/20260209000000_phase3_retry_and_health.sql
+```
+
+### Check Worker Health
+
+```bash
+# View healthy workers
+psql -c "SELECT * FROM healthy_workers;"
+
+# Check specific worker health
+psql -c "
+SELECT 
+    name,
+    capabilities->'health'->>'status' as health_status,
+    capabilities->'health'->>'queue_depth' as queue_depth,
+    capabilities->'health'->>'consecutive_failures' as failures
+FROM worker 
+WHERE id = 1;
+"
+```
+
+## Retry Behavior
+
+### Retriable Failures
+
+Executions are automatically retried for:
+- ✓ Worker unavailable (`worker_unavailable`)
+- ✓ Queue timeout/TTL expired (`queue_timeout`)
+- ✓ Worker heartbeat stale (`worker_heartbeat_stale`)
+- ✓ Transient errors (`transient_error`)
+- ✓ Manual retry requested (`manual_retry`)
+
+### Non-Retriable Failures
+
+These failures are NOT retried:
+- ✗ Validation errors
+- ✗ Permission denied
+- ✗ Action not found
+- ✗ Invalid parameters
+- ✗ Explicit action failure
+
+### Retry Backoff
+
+**Strategy:** Exponential backoff with jitter
+
+```
+Attempt 0: ~1 second
+Attempt 1: ~2 seconds
+Attempt 2: ~4 seconds
+Attempt 3: ~8 seconds
+Attempt N: min(base * 2^N, 300 seconds)
+```
+
+**Jitter:** ±20% randomization to avoid thundering herd
+
+### Retry Configuration
+
+```rust
+// Default retry configuration
+RetryConfig {
+    enabled: true,
+    base_backoff_secs: 1,
+    max_backoff_secs: 300,       // 5 minutes max
+    backoff_multiplier: 2.0,
+    jitter_factor: 0.2,          // 20% jitter
+}
+```
+
+## Worker Health
+
+### Health States
+
+**Healthy:**
+- Heartbeat < 30 seconds old
+- Consecutive failures < 3
+- Queue depth < 50
+- Failure rate < 30%
+
+**Degraded:**
+- Consecutive failures: 3-9
+- Queue depth: 50-99
+- Failure rate: 30-69%
+- Still receives tasks but deprioritized
+
+**Unhealthy:**
+- Heartbeat > 30 seconds old
+- Consecutive failures ≥ 10
+- Queue depth ≥ 100
+- Failure rate ≥ 70%
+- Does NOT receive new tasks
+
+### Health Metrics
+
+Workers self-report health in capabilities:
+
+```json
+{
+  "runtimes": ["shell", "python"],
+  "health": {
+    "status": "healthy",
+    "last_check": "2026-02-09T12:00:00Z",
+    "consecutive_failures": 0,
+    "total_executions": 1000,
+    "failed_executions": 20,
+    "average_execution_time_ms": 1500,
+    "queue_depth": 5
+  }
+}
+```
+
+### Worker Selection
+
+**Selection Priority:**
+1. Healthy workers (queue depth ascending)
+2. Degraded workers (queue depth ascending)
+3. Skip unhealthy workers
+
+**Example:**
+```
+Worker A: Healthy, queue=5    ← Selected first
+Worker B: Healthy, queue=20   ← Selected second
+Worker C: Degraded, queue=10  ← Selected third
+Worker D: Unhealthy, queue=0  ← Never selected
+```
+
+## Database Schema
+
+### Execution Retry Fields
+
+```sql
+-- Added to execution table
+retry_count INTEGER NOT NULL DEFAULT 0,
+max_retries INTEGER,
+retry_reason TEXT,
+original_execution BIGINT REFERENCES execution(id)
+```
+
+### Action Configuration Fields
+
+```sql
+-- Added to action table
+timeout_seconds INTEGER,          -- Per-action timeout override
+max_retries INTEGER DEFAULT 0     -- Per-action retry limit
+```
+
+### Helper Functions
+
+```sql
+-- Check if execution can be retried
+SELECT is_execution_retriable(123);
+
+-- Get worker queue depth
+SELECT get_worker_queue_depth(1);
+```
+
+### Views
+
+```sql
+-- Get all healthy workers
+SELECT * FROM healthy_workers;
+```
+
+## Practical Examples
+
+### Example 1: View Retry Chain
+
+```sql
+-- Find all retries for execution 100
+WITH RECURSIVE retry_chain AS (
+    SELECT id, retry_count, retry_reason, original_execution, status
+    FROM execution
+    WHERE id = 100
+    
+    UNION ALL
+    
+    SELECT e.id, e.retry_count, e.retry_reason, e.original_execution, e.status
+    FROM execution e
+    JOIN retry_chain rc ON e.original_execution = rc.id
+)
+SELECT * FROM retry_chain ORDER BY retry_count;
+```
+
+### Example 2: Analyze Retry Success Rate
+
+```sql
+-- Success rate of retries by reason
+SELECT 
+    config->>'retry_reason' as reason,
+    COUNT(*) as total_retries,
+    COUNT(CASE WHEN status = 'completed' THEN 1 END) as succeeded,
+    ROUND(100.0 * COUNT(CASE WHEN status = 'completed' THEN 1 END) / COUNT(*), 2) as success_rate
+FROM execution
+WHERE retry_count > 0
+GROUP BY config->>'retry_reason'
+ORDER BY total_retries DESC;
+```
+
+### Example 3: Find Workers by Health
+
+```sql
+-- Workers sorted by health and load
+SELECT 
+    w.name,
+    w.status,
+    (w.capabilities->'health'->>'status')::TEXT as health,
+    (w.capabilities->'health'->>'queue_depth')::INTEGER as queue,
+    (w.capabilities->'health'->>'consecutive_failures')::INTEGER as failures,
+    w.last_heartbeat
+FROM worker w
+WHERE w.status = 'active'
+ORDER BY 
+    CASE (w.capabilities->'health'->>'status')::TEXT
+        WHEN 'healthy' THEN 1
+        WHEN 'degraded' THEN 2
+        WHEN 'unhealthy' THEN 3
+        ELSE 4
+    END,
+    (w.capabilities->'health'->>'queue_depth')::INTEGER;
+```
+
+### Example 4: Manual Retry via API
+
+```bash
+# Create retry execution
+curl -X POST http://localhost:8080/api/v1/executions \
+  -H "Authorization: Bearer $TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "action_ref": "core.echo",
+    "parameters": {"message": "retry test"},
+    "config": {
+      "retry_of": 123,
+      "retry_count": 1,
+      "max_retries": 3,
+      "retry_reason": "manual_retry",
+      "original_execution": 123
+    }
+  }'
+```
+
+## Monitoring
+
+### Key Metrics
+
+**Retry Metrics:**
+- Retry rate: % of executions that retry
+- Retry success rate: % of retries that succeed
+- Average retries per execution
+- Retry reason distribution
+
+**Health Metrics:**
+- Healthy worker count
+- Degraded worker count
+- Unhealthy worker count
+- Average queue depth per worker
+- Average failure rate per worker
+
+### SQL Queries
+
+```sql
+-- Retry rate over last hour
+SELECT 
+    COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END) as original_executions,
+    COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) as retry_executions,
+    ROUND(100.0 * COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) / 
+          COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END), 2) as retry_rate
+FROM execution
+WHERE created > NOW() - INTERVAL '1 hour';
+
+-- Worker health distribution
+SELECT 
+    COALESCE((capabilities->'health'->>'status')::TEXT, 'unknown') as health_status,
+    COUNT(*) as worker_count,
+    AVG((capabilities->'health'->>'queue_depth')::INTEGER) as avg_queue_depth
+FROM worker
+WHERE status = 'active'
+GROUP BY health_status;
+```
+
+## Configuration
+
+### Retry Configuration
+
+```rust
+// In executor service initialization
+let retry_manager = RetryManager::new(pool.clone(), RetryConfig {
+    enabled: true,
+    base_backoff_secs: 1,
+    max_backoff_secs: 300,
+    backoff_multiplier: 2.0,
+    jitter_factor: 0.2,
+});
+```
+
+### Health Probe Configuration
+
+```rust
+// In executor service initialization
+let health_probe = WorkerHealthProbe::new(pool.clone(), HealthProbeConfig {
+    enabled: true,
+    heartbeat_max_age_secs: 30,
+    degraded_threshold: 3,
+    unhealthy_threshold: 10,
+    queue_depth_degraded: 50,
+    queue_depth_unhealthy: 100,
+    failure_rate_degraded: 0.3,
+    failure_rate_unhealthy: 0.7,
+});
+```
+
+## Troubleshooting
+
+### High Retry Rate
+
+**Symptoms:** Many executions retrying repeatedly
+
+**Causes:**
+- Workers unstable or frequently restarting
+- Network issues causing transient failures
+- Actions not idempotent (retry makes things worse)
+
+**Resolution:**
+1. Check worker stability: `docker compose ps`
+2. Review action idempotency
+3. Adjust `max_retries` if retries are unhelpful
+4. Investigate root cause of failures
+
+### Retries Not Triggering
+
+**Symptoms:** Failed executions not retrying despite max_retries > 0
+
+**Causes:**
+- Action doesn't have `max_retries` set
+- Failure is non-retriable (validation error, etc.)
+- Global retry disabled
+
+**Resolution:**
+1. Check action configuration: `SELECT timeout_seconds, max_retries FROM action WHERE ref = 'action.name';`
+2. Check failure message for retriable patterns
+3. Verify retry enabled in executor config
+
+### Workers Marked Unhealthy
+
+**Symptoms:** Workers not receiving tasks
+
+**Causes:**
+- High queue depth (overloaded)
+- Consecutive failures exceed threshold
+- Heartbeat stale
+
+**Resolution:**
+1. Check worker logs: `docker compose logs -f worker-shell`
+2. Verify heartbeat: `SELECT name, last_heartbeat FROM worker;`
+3. Check queue depth in capabilities
+4. Restart worker if stuck: `docker compose restart worker-shell`
+
+### Retry Loops
+
+**Symptoms:** Execution retries forever or excessive retries
+
+**Causes:**
+- Bug in retry reason detection
+- Action failure always classified as retriable
+- max_retries not being enforced
+
+**Resolution:**
+1. Check retry chain: See Example 1 above
+2. Verify max_retries: `SELECT config FROM execution WHERE id = 123;`
+3. Fix retry reason classification if incorrect
+4. Manually fail execution if stuck
+
+## Integration with Previous Phases
+
+### Phase 1 + Phase 2 + Phase 3 Together
+
+**Defense in Depth:**
+1. **Phase 1 (Timeout Monitor):** Catches stuck SCHEDULED executions (30s-5min)
+2. **Phase 2 (Queue TTL/DLQ):** Expires messages in worker queues (5min)
+3. **Phase 3 (Intelligent Retry):** Retries retriable failures (1s-5min backoff)
+
+**Failure Flow:**
+```
+Execution dispatched → Worker unavailable (Phase 2: 5min TTL)
+    → DLQ handler marks FAILED (Phase 2)
+    → Retry manager creates retry (Phase 3)
+    → Retry dispatched with backoff (Phase 3)
+    → Success or exhaust retries
+```
+
+**Backup Safety Net:**
+If Phase 3 retry fails to create retry, Phase 1 timeout monitor will still catch stuck executions.
+
+## Best Practices
+
+### Action Design for Retries
+
+1. **Make actions idempotent:** Safe to run multiple times
+2. **Set realistic timeouts:** Based on typical execution time
+3. **Configure appropriate max_retries:**
+   - Network calls: 3-5 retries
+   - Database operations: 2-3 retries
+   - External APIs: 3 retries
+   - Local operations: 0-1 retries
+
+### Worker Health Management
+
+1. **Report queue depth regularly:** Update every heartbeat
+2. **Track failure metrics:** Consecutive failures, total/failed counts
+3. **Implement graceful degradation:** Continue working when degraded
+4. **Fail fast when unhealthy:** Stop accepting work if overloaded
+
+### Monitoring Strategy
+
+1. **Alert on high retry rates:** > 20% of executions retrying
+2. **Alert on unhealthy workers:** > 50% workers unhealthy
+3. **Track retry success rate:** Should be > 70%
+4. **Monitor queue depths:** Average should stay < 20
+
+## See Also
+
+- **Architecture:** `docs/architecture/worker-availability-handling.md`
+- **Phase 1 Guide:** `docs/QUICKREF-worker-availability-phase1.md`
+- **Phase 2 Guide:** `docs/QUICKREF-worker-queue-ttl-dlq.md`
+- **Migration:** `migrations/20260209000000_phase3_retry_and_health.sql`