attune/docs/QUICKREF-phase3-retry-health.md

# Quick Reference: Phase 3 - Intelligent Retry & Worker Health

## Overview

Phase 3 adds intelligent retry logic and proactive worker health monitoring to automatically recover from transient failures and optimize worker selection.

**Key Features:**
- **Automatic Retry:** Failed executions automatically retry with exponential backoff
- **Health-Aware Scheduling:** Prefer healthy workers with low queue depth
- **Per-Action Configuration:** Custom timeouts and retry limits per action
- **Failure Classification:** Distinguish retriable vs non-retriable failures

## Quick Start

### Enable Retry for an Action

```yaml
# packs/mypack/actions/flaky-api.yaml
name: flaky_api_call
runtime: python
entrypoint: actions/flaky_api.py
timeout_seconds: 120      # Custom timeout (overrides global 5 min)
max_retries: 3            # Retry up to 3 times on failure
parameters:
  url:
    type: string
    required: true
```

### Database Migration

```bash
# Apply Phase 3 schema changes
sqlx migrate run

# Or via Docker Compose
docker compose exec postgres psql -U attune -d attune -f /migrations/20260209000000_phase3_retry_and_health.sql
```

### Check Worker Health

```bash
# View healthy workers
psql -c "SELECT * FROM healthy_workers;"

# Check specific worker health
psql -c "
SELECT
    name,
    capabilities->'health'->>'status' as health_status,
    capabilities->'health'->>'queue_depth' as queue_depth,
    capabilities->'health'->>'consecutive_failures' as failures
FROM worker
WHERE id = 1;
"
```

## Retry Behavior

### Retriable Failures

Executions are automatically retried for:
- ✓ Worker unavailable (`worker_unavailable`)
- ✓ Queue timeout/TTL expired (`queue_timeout`)
- ✓ Worker heartbeat stale (`worker_heartbeat_stale`)
- ✓ Transient errors (`transient_error`)
- ✓ Manual retry requested (`manual_retry`)

### Non-Retriable Failures

These failures are NOT retried:
- ✗ Validation errors
- ✗ Permission denied
- ✗ Action not found
- ✗ Invalid parameters
- ✗ Explicit action failure

### Retry Backoff

**Strategy:** Exponential backoff with jitter

```
Attempt 0: ~1 second
Attempt 1: ~2 seconds
Attempt 2: ~4 seconds
Attempt 3: ~8 seconds
Attempt N: min(base * 2^N, 300 seconds)
```

**Jitter:** ±20% randomization to avoid thundering herd

### Retry Configuration

```rust
// Default retry configuration
RetryConfig {
    enabled: true,
    base_backoff_secs: 1,
    max_backoff_secs: 300,       // 5 minutes max
    backoff_multiplier: 2.0,
    jitter_factor: 0.2,          // 20% jitter
}
```

## Worker Health

### Health States

**Healthy:**
- Heartbeat < 30 seconds old
- Consecutive failures < 3
- Queue depth < 50
- Failure rate < 30%

**Degraded:**
- Consecutive failures: 3-9
- Queue depth: 50-99
- Failure rate: 30-69%
- Still receives tasks but deprioritized

**Unhealthy:**
- Heartbeat > 30 seconds old
- Consecutive failures ≥ 10
- Queue depth ≥ 100
- Failure rate ≥ 70%
- Does NOT receive new tasks

### Health Metrics

Workers self-report health in capabilities:

```json
{
  "runtimes": ["shell", "python"],
  "health": {
    "status": "healthy",
    "last_check": "2026-02-09T12:00:00Z",
    "consecutive_failures": 0,
    "total_executions": 1000,
    "failed_executions": 20,
    "average_execution_time_ms": 1500,
    "queue_depth": 5
  }
}
```

### Worker Selection

**Selection Priority:**
1. Healthy workers (queue depth ascending)
2. Degraded workers (queue depth ascending)
3. Skip unhealthy workers

**Example:**
```
Worker A: Healthy, queue=5    ← Selected first
Worker B: Healthy, queue=20   ← Selected second
Worker C: Degraded, queue=10  ← Selected third
Worker D: Unhealthy, queue=0  ← Never selected
```

## Database Schema

### Execution Retry Fields

```sql
-- Added to execution table
retry_count INTEGER NOT NULL DEFAULT 0,
max_retries INTEGER,
retry_reason TEXT,
original_execution BIGINT REFERENCES execution(id)
```

### Action Configuration Fields

```sql
-- Added to action table
timeout_seconds INTEGER,          -- Per-action timeout override
max_retries INTEGER DEFAULT 0     -- Per-action retry limit
```

### Helper Functions

```sql
-- Check if execution can be retried
SELECT is_execution_retriable(123);

-- Get worker queue depth
SELECT get_worker_queue_depth(1);
```

### Views

```sql
-- Get all healthy workers
SELECT * FROM healthy_workers;
```

## Practical Examples

### Example 1: View Retry Chain

```sql
-- Find all retries for execution 100
WITH RECURSIVE retry_chain AS (
    SELECT id, retry_count, retry_reason, original_execution, status
    FROM execution
    WHERE id = 100

    UNION ALL

    SELECT e.id, e.retry_count, e.retry_reason, e.original_execution, e.status
    FROM execution e
    JOIN retry_chain rc ON e.original_execution = rc.id
)
SELECT * FROM retry_chain ORDER BY retry_count;
```

### Example 2: Analyze Retry Success Rate

```sql
-- Success rate of retries by reason
SELECT
    config->>'retry_reason' as reason,
    COUNT(*) as total_retries,
    COUNT(CASE WHEN status = 'completed' THEN 1 END) as succeeded,
    ROUND(100.0 * COUNT(CASE WHEN status = 'completed' THEN 1 END) / COUNT(*), 2) as success_rate
FROM execution
WHERE retry_count > 0
GROUP BY config->>'retry_reason'
ORDER BY total_retries DESC;
```

### Example 3: Find Workers by Health

```sql
-- Workers sorted by health and load
SELECT
    w.name,
    w.status,
    (w.capabilities->'health'->>'status')::TEXT as health,
    (w.capabilities->'health'->>'queue_depth')::INTEGER as queue,
    (w.capabilities->'health'->>'consecutive_failures')::INTEGER as failures,
    w.last_heartbeat
FROM worker w
WHERE w.status = 'active'
ORDER BY
    CASE (w.capabilities->'health'->>'status')::TEXT
        WHEN 'healthy' THEN 1
        WHEN 'degraded' THEN 2
        WHEN 'unhealthy' THEN 3
        ELSE 4
    END,
    (w.capabilities->'health'->>'queue_depth')::INTEGER;
```

### Example 4: Manual Retry via API

```bash
# Create retry execution
curl -X POST http://localhost:8080/api/v1/executions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "action_ref": "core.echo",
    "parameters": {"message": "retry test"},
    "config": {
      "retry_of": 123,
      "retry_count": 1,
      "max_retries": 3,
      "retry_reason": "manual_retry",
      "original_execution": 123
    }
  }'
```

## Monitoring

### Key Metrics

**Retry Metrics:**
- Retry rate: % of executions that retry
- Retry success rate: % of retries that succeed
- Average retries per execution
- Retry reason distribution

**Health Metrics:**
- Healthy worker count
- Degraded worker count
- Unhealthy worker count
- Average queue depth per worker
- Average failure rate per worker

### SQL Queries

```sql
-- Retry rate over last hour
SELECT
    COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END) as original_executions,
    COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) as retry_executions,
    ROUND(100.0 * COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) /
          COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END), 2) as retry_rate
FROM execution
WHERE created > NOW() - INTERVAL '1 hour';

-- Worker health distribution
SELECT
    COALESCE((capabilities->'health'->>'status')::TEXT, 'unknown') as health_status,
    COUNT(*) as worker_count,
    AVG((capabilities->'health'->>'queue_depth')::INTEGER) as avg_queue_depth
FROM worker
WHERE status = 'active'
GROUP BY health_status;
```

## Configuration

### Retry Configuration

```rust
// In executor service initialization
let retry_manager = RetryManager::new(pool.clone(), RetryConfig {
    enabled: true,
    base_backoff_secs: 1,
    max_backoff_secs: 300,
    backoff_multiplier: 2.0,
    jitter_factor: 0.2,
});
```

### Health Probe Configuration

```rust
// In executor service initialization
let health_probe = WorkerHealthProbe::new(pool.clone(), HealthProbeConfig {
    enabled: true,
    heartbeat_max_age_secs: 30,
    degraded_threshold: 3,
    unhealthy_threshold: 10,
    queue_depth_degraded: 50,
    queue_depth_unhealthy: 100,
    failure_rate_degraded: 0.3,
    failure_rate_unhealthy: 0.7,
});
```

## Troubleshooting

### High Retry Rate

**Symptoms:** Many executions retrying repeatedly

**Causes:**
- Workers unstable or frequently restarting
- Network issues causing transient failures
- Actions not idempotent (retry makes things worse)

**Resolution:**
1. Check worker stability: `docker compose ps`
2. Review action idempotency
3. Adjust `max_retries` if retries are unhelpful
4. Investigate root cause of failures

### Retries Not Triggering

**Symptoms:** Failed executions not retrying despite max_retries > 0

**Causes:**
- Action doesn't have `max_retries` set
- Failure is non-retriable (validation error, etc.)
- Global retry disabled

**Resolution:**
1. Check action configuration: `SELECT timeout_seconds, max_retries FROM action WHERE ref = 'action.name';`
2. Check failure message for retriable patterns
3. Verify retry enabled in executor config

### Workers Marked Unhealthy

**Symptoms:** Workers not receiving tasks

**Causes:**
- High queue depth (overloaded)
- Consecutive failures exceed threshold
- Heartbeat stale

**Resolution:**
1. Check worker logs: `docker compose logs -f worker-shell`
2. Verify heartbeat: `SELECT name, last_heartbeat FROM worker;`
3. Check queue depth in capabilities
4. Restart worker if stuck: `docker compose restart worker-shell`

### Retry Loops

**Symptoms:** Execution retries forever or excessive retries

**Causes:**
- Bug in retry reason detection
- Action failure always classified as retriable
- max_retries not being enforced

**Resolution:**
1. Check retry chain: See Example 1 above
2. Verify max_retries: `SELECT config FROM execution WHERE id = 123;`
3. Fix retry reason classification if incorrect
4. Manually fail execution if stuck

## Integration with Previous Phases

### Phase 1 + Phase 2 + Phase 3 Together

**Defense in Depth:**
1. **Phase 1 (Timeout Monitor):** Catches stuck SCHEDULED executions (30s-5min)
2. **Phase 2 (Queue TTL/DLQ):** Expires messages in worker queues (5min)
3. **Phase 3 (Intelligent Retry):** Retries retriable failures (1s-5min backoff)

**Failure Flow:**
```
Execution dispatched → Worker unavailable (Phase 2: 5min TTL)
    → DLQ handler marks FAILED (Phase 2)
    → Retry manager creates retry (Phase 3)
    → Retry dispatched with backoff (Phase 3)
    → Success or exhaust retries
```

**Backup Safety Net:**
If Phase 3 retry fails to create retry, Phase 1 timeout monitor will still catch stuck executions.

## Best Practices

### Action Design for Retries

1. **Make actions idempotent:** Safe to run multiple times
2. **Set realistic timeouts:** Based on typical execution time
3. **Configure appropriate max_retries:**
   - Network calls: 3-5 retries
   - Database operations: 2-3 retries
   - External APIs: 3 retries
   - Local operations: 0-1 retries

### Worker Health Management

1. **Report queue depth regularly:** Update every heartbeat
2. **Track failure metrics:** Consecutive failures, total/failed counts
3. **Implement graceful degradation:** Continue working when degraded
4. **Fail fast when unhealthy:** Stop accepting work if overloaded

### Monitoring Strategy

1. **Alert on high retry rates:** > 20% of executions retrying
2. **Alert on unhealthy workers:** > 50% workers unhealthy
3. **Track retry success rate:** Should be > 70%
4. **Monitor queue depths:** Average should stay < 20

## See Also

- **Architecture:** `docs/architecture/worker-availability-handling.md`
- **Phase 1 Guide:** `docs/QUICKREF-worker-availability-phase1.md`
- **Phase 2 Guide:** `docs/QUICKREF-worker-queue-ttl-dlq.md`
- **Migration:** `migrations/20260209000000_phase3_retry_and_health.sql`