more internal polish, resilient workers

2026-02-09 18:32:34 -06:00
parent 588b319fec
commit e31ecb781b
62 changed files with 9872 additions and 584 deletions
--- a/work-summary/2026-02-09-phase3-retry-health.md
+++ b/work-summary/2026-02-09-phase3-retry-health.md
@@ -0,0 +1,448 @@
+# Work Summary: Phase 3 - Intelligent Retry & Worker Health
+
+**Date:** 2026-02-09  
+**Author:** AI Assistant  
+**Phase:** Worker Availability Handling - Phase 3
+
+## Overview
+
+Implemented Phase 3 of worker availability handling: intelligent retry logic and proactive worker health monitoring. This enables automatic recovery from transient failures and health-aware worker selection for optimal execution scheduling.
+
+## Motivation
+
+Phases 1 and 2 provided robust failure detection and handling:
+- **Phase 1:** Timeout monitor catches stuck executions
+- **Phase 2:** Queue TTL and DLQ handle unavailable workers
+
+Phase 3 completes the reliability story by:
+1. **Automatic Recovery:** Retry transient failures without manual intervention
+2. **Intelligent Classification:** Distinguish retriable vs non-retriable failures
+3. **Optimal Scheduling:** Select healthy workers with low queue depth
+4. **Per-Action Configuration:** Custom timeouts and retry limits per action
+
+## Changes Made
+
+### 1. Database Schema Enhancement
+
+**New Migration:** `migrations/20260209000000_phase3_retry_and_health.sql`
+
+**Execution Retry Tracking:**
+- `retry_count` - Current retry attempt (0 = original, 1 = first retry, etc.)
+- `max_retries` - Maximum retry attempts (copied from action config)
+- `retry_reason` - Reason for retry (worker_unavailable, queue_timeout, etc.)
+- `original_execution` - ID of original execution (forms retry chain)
+
+**Action Configuration:**
+- `timeout_seconds` - Per-action timeout override (NULL = use global TTL)
+- `max_retries` - Maximum retry attempts for this action (default: 0)
+
+**Worker Health Tracking:**
+- Health metrics stored in `capabilities.health` JSONB object
+- Fields: status, last_check, consecutive_failures, queue_depth, etc.
+
+**Database Objects:**
+- `healthy_workers` view - Active workers with fresh heartbeat and healthy status
+- `get_worker_queue_depth()` function - Extract queue depth from worker metadata
+- `is_execution_retriable()` function - Check if execution can be retried
+- Indexes for retry queries and health-based worker selection
+
+### 2. Retry Manager Module
+
+**New File:** `crates/executor/src/retry_manager.rs` (487 lines)
+
+**Components:**
+- `RetryManager` - Core retry orchestration
+- `RetryConfig` - Retry behavior configuration
+- `RetryReason` - Enumeration of retry reasons
+- `RetryAnalysis` - Result of retry eligibility analysis
+
+**Key Features:**
+- **Failure Classification:** Detects retriable vs non-retriable failures from error messages
+- **Exponential Backoff:** Configurable base, multiplier, and max backoff (default: 1s, 2x, 300s max)
+- **Jitter:** Random variance (±20%) to prevent thundering herd
+- **Retry Chain Tracking:** Links retries to original execution via metadata
+- **Exhaustion Handling:** Stops retrying when max_retries reached
+
+**Retriable Failure Patterns:**
+- Worker queue TTL expired
+- Worker unavailable
+- Timeout/timed out
+- Heartbeat stale
+- Transient/temporary errors
+- Connection refused/reset
+
+**Non-Retriable Failures:**
+- Validation errors
+- Permission denied
+- Action not found
+- Invalid parameters
+- Unknown/unclassified errors (conservative approach)
+
+### 3. Worker Health Probe Module
+
+**New File:** `crates/executor/src/worker_health.rs` (464 lines)
+
+**Components:**
+- `WorkerHealthProbe` - Health monitoring and evaluation
+- `HealthProbeConfig` - Health check configuration
+- `HealthStatus` - Health state enum (Healthy, Degraded, Unhealthy)
+- `HealthMetrics` - Worker health metrics structure
+
+**Health States:**
+
+**Healthy:**
+- Heartbeat < 30 seconds old
+- Consecutive failures < 3
+- Queue depth < 50
+- Failure rate < 30%
+
+**Degraded:**
+- Consecutive failures: 3-9
+- Queue depth: 50-99
+- Failure rate: 30-69%
+- Still receives work but deprioritized
+
+**Unhealthy:**
+- Heartbeat > 30 seconds stale
+- Consecutive failures ≥ 10
+- Queue depth ≥ 100
+- Failure rate ≥ 70%
+- Does NOT receive new executions
+
+**Features:**
+- **Proactive Health Checks:** Evaluate worker health before scheduling
+- **Health-Aware Selection:** Sort workers by health status and queue depth
+- **Runtime Filtering:** Select best worker for specific runtime
+- **Metrics Extraction:** Parse health data from worker capabilities JSONB
+
+### 4. Module Integration
+
+**Updated Files:**
+- `crates/executor/src/lib.rs` - Export retry and health modules
+- `crates/executor/src/main.rs` - Declare modules
+- `crates/executor/Cargo.toml` - Add `rand` dependency for jitter
+
+**Public API Exports:**
+```rust
+pub use retry_manager::{RetryAnalysis, RetryConfig, RetryManager, RetryReason};
+pub use worker_health::{HealthMetrics, HealthProbeConfig, HealthStatus, WorkerHealthProbe};
+```
+
+### 5. Documentation
+
+**Quick Reference Guide:** `docs/QUICKREF-phase3-retry-health.md` (460 lines)
+- Retry behavior and configuration
+- Worker health states and metrics
+- Database schema reference
+- Practical SQL examples
+- Monitoring queries
+- Troubleshooting guides
+- Integration with Phases 1 & 2
+
+## Technical Details
+
+### Retry Flow
+
+```
+Execution fails → Retry Manager analyzes failure
+    ↓
+Is failure retriable?
+    ↓ Yes
+Check retry count < max_retries?
+    ↓ Yes
+Calculate exponential backoff with jitter
+    ↓
+Create retry execution with metadata:
+  - retry_count++
+  - original_execution
+  - retry_reason
+  - retry_at timestamp
+    ↓
+Schedule retry after backoff delay
+    ↓
+Success or exhaust retries
+```
+
+### Worker Selection Flow
+
+```
+Get runtime requirement → Health Probe queries all workers
+    ↓
+Filter by:
+  1. Active status
+  2. Fresh heartbeat
+  3. Runtime support
+    ↓
+Sort by:
+  1. Health status (healthy > degraded > unhealthy)
+  2. Queue depth (ascending)
+    ↓
+Return best worker or None
+```
+
+### Backoff Calculation
+
+```
+backoff = base_secs * (multiplier ^ retry_count)
+backoff = min(backoff, max_backoff_secs)
+jitter = random(1 - jitter_factor, 1 + jitter_factor)
+final_backoff = backoff * jitter
+```
+
+**Example:**
+- Attempt 0: ~1s (0.8-1.2s with 20% jitter)
+- Attempt 1: ~2s (1.6-2.4s)
+- Attempt 2: ~4s (3.2-4.8s)
+- Attempt 3: ~8s (6.4-9.6s)
+- Attempt N: min(base * 2^N, 300s) with jitter
+
+## Configuration
+
+### Retry Manager
+
+```rust
+RetryConfig {
+    enabled: true,                    // Enable automatic retries
+    base_backoff_secs: 1,             // Initial backoff
+    max_backoff_secs: 300,            // 5 minutes maximum
+    backoff_multiplier: 2.0,          // Exponential growth
+    jitter_factor: 0.2,               // 20% randomization
+}
+```
+
+### Health Probe
+
+```rust
+HealthProbeConfig {
+    enabled: true,
+    heartbeat_max_age_secs: 30,
+    degraded_threshold: 3,            // Consecutive failures
+    unhealthy_threshold: 10,
+    queue_depth_degraded: 50,
+    queue_depth_unhealthy: 100,
+    failure_rate_degraded: 0.3,       // 30%
+    failure_rate_unhealthy: 0.7,      // 70%
+}
+```
+
+### Per-Action Configuration
+
+```yaml
+# packs/mypack/actions/api-call.yaml
+name: external_api_call
+runtime: python
+entrypoint: actions/api.py
+timeout_seconds: 120        # 2 minutes (overrides global 5 min TTL)
+max_retries: 3              # Retry up to 3 times on failure
+```
+
+## Testing
+
+### Compilation
+- ✅ All crates compile cleanly with zero warnings
+- ✅ Added `rand` dependency for jitter calculation
+- ✅ All public API methods properly documented
+
+### Database Migration
+- ✅ SQLx compatible migration file
+- ✅ Adds all necessary columns, indexes, views, functions
+- ✅ Includes comprehensive comments
+- ✅ Backward compatible (nullable fields)
+
+### Unit Tests
+- ✅ Retry reason detection from error messages
+- ✅ Retriable error pattern matching
+- ✅ Backoff calculation (exponential with jitter)
+- ✅ Health status extraction from worker capabilities
+- ✅ Configuration defaults
+
+## Integration Status
+
+### Complete
+- ✅ Database schema
+- ✅ Retry manager module with full logic
+- ✅ Worker health probe module
+- ✅ Module exports and integration
+- ✅ Comprehensive documentation
+
+### Pending (Future Integration)
+- ⏳ Wire retry manager into completion listener
+- ⏳ Wire health probe into scheduler
+- ⏳ Add retry API endpoints
+- ⏳ Update worker to report health metrics
+- ⏳ Add retry/health UI components
+
+**Note:** Phase 3 provides the foundation and API. Full integration will occur in subsequent work as the system is tested and refined.
+
+## Benefits
+
+### Automatic Recovery
+- **Transient Failures:** Retry worker unavailability, timeouts, network issues
+- **No Manual Intervention:** System self-heals from temporary problems
+- **Exponential Backoff:** Avoids overwhelming struggling resources
+- **Jitter:** Prevents thundering herd problem
+
+### Intelligent Scheduling
+- **Health-Aware:** Avoid unhealthy workers proactively
+- **Load Balancing:** Prefer workers with lower queue depth
+- **Runtime Matching:** Only select workers supporting required runtime
+- **Graceful Degradation:** Degraded workers still used if necessary
+
+### Operational Visibility
+- **Retry Metrics:** Track retry rates, reasons, success rates
+- **Health Metrics:** Monitor worker health distribution
+- **Failure Classification:** Understand why executions fail
+- **Retry Chains:** Trace execution attempts through retries
+
+### Flexibility
+- **Per-Action Config:** Custom timeouts and retry limits per action
+- **Global Config:** Override retry/health settings for entire system
+- **Tunable Thresholds:** Adjust health and retry parameters
+- **Extensible:** Easy to add new retry reasons or health factors
+
+## Relationship to Previous Phases
+
+### Defense in Depth
+
+**Phase 1 (Timeout Monitor):**
+- Monitors database for stuck SCHEDULED executions
+- Fails executions after timeout (default: 5 minutes)
+- Acts as backstop for all phases
+
+**Phase 2 (Queue TTL + DLQ):**
+- Expires messages in worker queues (default: 5 minutes)
+- Routes expired messages to DLQ
+- DLQ handler marks executions as FAILED
+
+**Phase 3 (Intelligent Retry + Health):**
+- Analyzes failures and retries if retriable
+- Exponential backoff prevents immediate re-failure
+- Health-aware selection avoids problematic workers
+
+### Failure Flow Integration
+
+```
+Execution scheduled → Sent to worker queue (Phase 2 TTL active)
+    ↓
+Worker unavailable → Message expires (5 min)
+    ↓
+DLQ handler fails execution (Phase 2)
+    ↓
+Retry manager detects retriable failure (Phase 3)
+    ↓
+Create retry with backoff (Phase 3)
+    ↓
+Health probe selects healthy worker (Phase 3)
+    ↓
+Retry succeeds or exhausts attempts
+    ↓
+If stuck, Phase 1 timeout monitor catches it (safety net)
+```
+
+### Complementary Mechanisms
+
+- **Phase 1:** Polling-based safety net (catches anything missed)
+- **Phase 2:** Message-level expiration (precise timing)
+- **Phase 3:** Active recovery (automatic retry) + Prevention (health checks)
+
+Together: Complete reliability from failure detection → automatic recovery
+
+## Known Limitations
+
+1. **Not Fully Integrated:** Modules are standalone, not yet wired into executor/worker
+2. **No Worker Health Reporting:** Workers don't yet update health metrics
+3. **No Retry API:** Manual retry requires direct execution creation
+4. **No UI Components:** Web UI doesn't display retry chains or health
+5. **No per-action TTL:** Worker queue TTL still global (schema supports it)
+
+## Files Modified/Created
+
+### New Files (4)
+- `migrations/20260209000000_phase3_retry_and_health.sql` (127 lines)
+- `crates/executor/src/retry_manager.rs` (487 lines)
+- `crates/executor/src/worker_health.rs` (464 lines)
+- `docs/QUICKREF-phase3-retry-health.md` (460 lines)
+
+### Modified Files (4)
+- `crates/executor/src/lib.rs` (+4 lines)
+- `crates/executor/src/main.rs` (+2 lines)
+- `crates/executor/Cargo.toml` (+1 line)
+- `work-summary/2026-02-09-phase3-retry-health.md` (this document)
+
+### Total Changes
+- **New Files:** 4
+- **Modified Files:** 4
+- **Lines Added:** ~1,550
+- **Lines Removed:** ~0
+
+## Deployment Notes
+
+1. **Database Migration Required:** Run `sqlx migrate run` before deploying
+2. **No Breaking Changes:** All new fields are nullable or have defaults
+3. **Backward Compatible:** Existing executions work without retry metadata
+4. **No Configuration Required:** Sensible defaults for all settings
+5. **Incremental Adoption:** Retry/health features can be enabled per-action
+
+## Next Steps
+
+### Immediate (Complete Phase 3 Integration)
+1. **Wire Retry Manager:** Integrate into completion listener to create retries
+2. **Wire Health Probe:** Integrate into scheduler for worker selection
+3. **Worker Health Reporting:** Update workers to report health metrics
+4. **Add API Endpoints:** `/api/v1/executions/{id}/retry` endpoint
+5. **Testing:** End-to-end tests with retry scenarios
+
+### Short Term (Enhance Phase 3)
+6. **Retry UI:** Display retry chains and status in web UI
+7. **Health Dashboard:** Visualize worker health distribution
+8. **Per-Action TTL:** Use action.timeout_seconds for custom queue TTL
+9. **Retry Policies:** Allow pack-level retry configuration
+10. **Health Probes:** Active HTTP health checks to workers
+
+### Long Term (Advanced Features)
+11. **Circuit Breakers:** Automatically disable failing actions
+12. **Retry Quotas:** Limit total retries per time window
+13. **Smart Routing:** Affinity-based worker selection
+14. **Predictive Health:** ML-based health prediction
+15. **Auto-scaling:** Scale workers based on queue depth and health
+
+## Monitoring Recommendations
+
+### Key Metrics to Track
+- **Retry Rate:** % of executions that retry
+- **Retry Success Rate:** % of retries that eventually succeed
+- **Retry Reason Distribution:** Which failures are most common
+- **Worker Health Distribution:** Healthy/degraded/unhealthy counts
+- **Average Queue Depth:** Per-worker queue occupancy
+- **Health-Driven Routing:** % of executions using health-aware selection
+
+### Alert Thresholds
+- **Warning:** Retry rate > 20%, unhealthy workers > 30%
+- **Critical:** Retry rate > 50%, unhealthy workers > 70%
+
+### SQL Monitoring Queries
+
+See `docs/QUICKREF-phase3-retry-health.md` for comprehensive monitoring queries including:
+- Retry rate over time
+- Retry success rate by reason
+- Worker health distribution
+- Queue depth analysis
+- Retry chain tracing
+
+## References
+
+- **Phase 1 Summary:** `work-summary/2026-02-09-worker-availability-phase1.md`
+- **Phase 2 Summary:** `work-summary/2026-02-09-worker-queue-ttl-phase2.md`
+- **Quick Reference:** `docs/QUICKREF-phase3-retry-health.md`
+- **Architecture:** `docs/architecture/worker-availability-handling.md`
+
+## Conclusion
+
+Phase 3 provides the foundation for intelligent retry logic and health-aware worker selection. The modules are fully implemented with comprehensive error handling, configuration options, and documentation. While not yet fully integrated into the executor/worker services, the groundwork is complete and ready for incremental integration and testing.
+
+Together with Phases 1 and 2, the Attune platform now has a complete three-layer reliability system:
+1. **Detection** (Phase 1): Timeout monitor catches stuck executions
+2. **Handling** (Phase 2): Queue TTL and DLQ fail unavailable workers
+3. **Recovery** (Phase 3): Intelligent retry and health-aware scheduling
+
+This defense-in-depth approach ensures executions are resilient to transient failures while maintaining system stability and performance. 🚀