# Work Summary: Phase 3 - Intelligent Retry & Worker Health **Date:** 2026-02-09 **Author:** AI Assistant **Phase:** Worker Availability Handling - Phase 3 ## Overview Implemented Phase 3 of worker availability handling: intelligent retry logic and proactive worker health monitoring. This enables automatic recovery from transient failures and health-aware worker selection for optimal execution scheduling. ## Motivation Phases 1 and 2 provided robust failure detection and handling: - **Phase 1:** Timeout monitor catches stuck executions - **Phase 2:** Queue TTL and DLQ handle unavailable workers Phase 3 completes the reliability story by: 1. **Automatic Recovery:** Retry transient failures without manual intervention 2. **Intelligent Classification:** Distinguish retriable vs non-retriable failures 3. **Optimal Scheduling:** Select healthy workers with low queue depth 4. **Per-Action Configuration:** Custom timeouts and retry limits per action ## Changes Made ### 1. Database Schema Enhancement **New Migration:** `migrations/20260209000000_phase3_retry_and_health.sql` **Execution Retry Tracking:** - `retry_count` - Current retry attempt (0 = original, 1 = first retry, etc.) - `max_retries` - Maximum retry attempts (copied from action config) - `retry_reason` - Reason for retry (worker_unavailable, queue_timeout, etc.) - `original_execution` - ID of original execution (forms retry chain) **Action Configuration:** - `timeout_seconds` - Per-action timeout override (NULL = use global TTL) - `max_retries` - Maximum retry attempts for this action (default: 0) **Worker Health Tracking:** - Health metrics stored in `capabilities.health` JSONB object - Fields: status, last_check, consecutive_failures, queue_depth, etc. **Database Objects:** - `healthy_workers` view - Active workers with fresh heartbeat and healthy status - `get_worker_queue_depth()` function - Extract queue depth from worker metadata - `is_execution_retriable()` function - Check if execution can be retried - Indexes for retry queries and health-based worker selection ### 2. Retry Manager Module **New File:** `crates/executor/src/retry_manager.rs` (487 lines) **Components:** - `RetryManager` - Core retry orchestration - `RetryConfig` - Retry behavior configuration - `RetryReason` - Enumeration of retry reasons - `RetryAnalysis` - Result of retry eligibility analysis **Key Features:** - **Failure Classification:** Detects retriable vs non-retriable failures from error messages - **Exponential Backoff:** Configurable base, multiplier, and max backoff (default: 1s, 2x, 300s max) - **Jitter:** Random variance (±20%) to prevent thundering herd - **Retry Chain Tracking:** Links retries to original execution via metadata - **Exhaustion Handling:** Stops retrying when max_retries reached **Retriable Failure Patterns:** - Worker queue TTL expired - Worker unavailable - Timeout/timed out - Heartbeat stale - Transient/temporary errors - Connection refused/reset **Non-Retriable Failures:** - Validation errors - Permission denied - Action not found - Invalid parameters - Unknown/unclassified errors (conservative approach) ### 3. Worker Health Probe Module **New File:** `crates/executor/src/worker_health.rs` (464 lines) **Components:** - `WorkerHealthProbe` - Health monitoring and evaluation - `HealthProbeConfig` - Health check configuration - `HealthStatus` - Health state enum (Healthy, Degraded, Unhealthy) - `HealthMetrics` - Worker health metrics structure **Health States:** **Healthy:** - Heartbeat < 30 seconds old - Consecutive failures < 3 - Queue depth < 50 - Failure rate < 30% **Degraded:** - Consecutive failures: 3-9 - Queue depth: 50-99 - Failure rate: 30-69% - Still receives work but deprioritized **Unhealthy:** - Heartbeat > 30 seconds stale - Consecutive failures ≥ 10 - Queue depth ≥ 100 - Failure rate ≥ 70% - Does NOT receive new executions **Features:** - **Proactive Health Checks:** Evaluate worker health before scheduling - **Health-Aware Selection:** Sort workers by health status and queue depth - **Runtime Filtering:** Select best worker for specific runtime - **Metrics Extraction:** Parse health data from worker capabilities JSONB ### 4. Module Integration **Updated Files:** - `crates/executor/src/lib.rs` - Export retry and health modules - `crates/executor/src/main.rs` - Declare modules - `crates/executor/Cargo.toml` - Add `rand` dependency for jitter **Public API Exports:** ```rust pub use retry_manager::{RetryAnalysis, RetryConfig, RetryManager, RetryReason}; pub use worker_health::{HealthMetrics, HealthProbeConfig, HealthStatus, WorkerHealthProbe}; ``` ### 5. Documentation **Quick Reference Guide:** `docs/QUICKREF-phase3-retry-health.md` (460 lines) - Retry behavior and configuration - Worker health states and metrics - Database schema reference - Practical SQL examples - Monitoring queries - Troubleshooting guides - Integration with Phases 1 & 2 ## Technical Details ### Retry Flow ``` Execution fails → Retry Manager analyzes failure ↓ Is failure retriable? ↓ Yes Check retry count < max_retries? ↓ Yes Calculate exponential backoff with jitter ↓ Create retry execution with metadata: - retry_count++ - original_execution - retry_reason - retry_at timestamp ↓ Schedule retry after backoff delay ↓ Success or exhaust retries ``` ### Worker Selection Flow ``` Get runtime requirement → Health Probe queries all workers ↓ Filter by: 1. Active status 2. Fresh heartbeat 3. Runtime support ↓ Sort by: 1. Health status (healthy > degraded > unhealthy) 2. Queue depth (ascending) ↓ Return best worker or None ``` ### Backoff Calculation ``` backoff = base_secs * (multiplier ^ retry_count) backoff = min(backoff, max_backoff_secs) jitter = random(1 - jitter_factor, 1 + jitter_factor) final_backoff = backoff * jitter ``` **Example:** - Attempt 0: ~1s (0.8-1.2s with 20% jitter) - Attempt 1: ~2s (1.6-2.4s) - Attempt 2: ~4s (3.2-4.8s) - Attempt 3: ~8s (6.4-9.6s) - Attempt N: min(base * 2^N, 300s) with jitter ## Configuration ### Retry Manager ```rust RetryConfig { enabled: true, // Enable automatic retries base_backoff_secs: 1, // Initial backoff max_backoff_secs: 300, // 5 minutes maximum backoff_multiplier: 2.0, // Exponential growth jitter_factor: 0.2, // 20% randomization } ``` ### Health Probe ```rust HealthProbeConfig { enabled: true, heartbeat_max_age_secs: 30, degraded_threshold: 3, // Consecutive failures unhealthy_threshold: 10, queue_depth_degraded: 50, queue_depth_unhealthy: 100, failure_rate_degraded: 0.3, // 30% failure_rate_unhealthy: 0.7, // 70% } ``` ### Per-Action Configuration ```yaml # packs/mypack/actions/api-call.yaml name: external_api_call runtime: python entrypoint: actions/api.py timeout_seconds: 120 # 2 minutes (overrides global 5 min TTL) max_retries: 3 # Retry up to 3 times on failure ``` ## Testing ### Compilation - ✅ All crates compile cleanly with zero warnings - ✅ Added `rand` dependency for jitter calculation - ✅ All public API methods properly documented ### Database Migration - ✅ SQLx compatible migration file - ✅ Adds all necessary columns, indexes, views, functions - ✅ Includes comprehensive comments - ✅ Backward compatible (nullable fields) ### Unit Tests - ✅ Retry reason detection from error messages - ✅ Retriable error pattern matching - ✅ Backoff calculation (exponential with jitter) - ✅ Health status extraction from worker capabilities - ✅ Configuration defaults ## Integration Status ### Complete - ✅ Database schema - ✅ Retry manager module with full logic - ✅ Worker health probe module - ✅ Module exports and integration - ✅ Comprehensive documentation ### Pending (Future Integration) - ⏳ Wire retry manager into completion listener - ⏳ Wire health probe into scheduler - ⏳ Add retry API endpoints - ⏳ Update worker to report health metrics - ⏳ Add retry/health UI components **Note:** Phase 3 provides the foundation and API. Full integration will occur in subsequent work as the system is tested and refined. ## Benefits ### Automatic Recovery - **Transient Failures:** Retry worker unavailability, timeouts, network issues - **No Manual Intervention:** System self-heals from temporary problems - **Exponential Backoff:** Avoids overwhelming struggling resources - **Jitter:** Prevents thundering herd problem ### Intelligent Scheduling - **Health-Aware:** Avoid unhealthy workers proactively - **Load Balancing:** Prefer workers with lower queue depth - **Runtime Matching:** Only select workers supporting required runtime - **Graceful Degradation:** Degraded workers still used if necessary ### Operational Visibility - **Retry Metrics:** Track retry rates, reasons, success rates - **Health Metrics:** Monitor worker health distribution - **Failure Classification:** Understand why executions fail - **Retry Chains:** Trace execution attempts through retries ### Flexibility - **Per-Action Config:** Custom timeouts and retry limits per action - **Global Config:** Override retry/health settings for entire system - **Tunable Thresholds:** Adjust health and retry parameters - **Extensible:** Easy to add new retry reasons or health factors ## Relationship to Previous Phases ### Defense in Depth **Phase 1 (Timeout Monitor):** - Monitors database for stuck SCHEDULED executions - Fails executions after timeout (default: 5 minutes) - Acts as backstop for all phases **Phase 2 (Queue TTL + DLQ):** - Expires messages in worker queues (default: 5 minutes) - Routes expired messages to DLQ - DLQ handler marks executions as FAILED **Phase 3 (Intelligent Retry + Health):** - Analyzes failures and retries if retriable - Exponential backoff prevents immediate re-failure - Health-aware selection avoids problematic workers ### Failure Flow Integration ``` Execution scheduled → Sent to worker queue (Phase 2 TTL active) ↓ Worker unavailable → Message expires (5 min) ↓ DLQ handler fails execution (Phase 2) ↓ Retry manager detects retriable failure (Phase 3) ↓ Create retry with backoff (Phase 3) ↓ Health probe selects healthy worker (Phase 3) ↓ Retry succeeds or exhausts attempts ↓ If stuck, Phase 1 timeout monitor catches it (safety net) ``` ### Complementary Mechanisms - **Phase 1:** Polling-based safety net (catches anything missed) - **Phase 2:** Message-level expiration (precise timing) - **Phase 3:** Active recovery (automatic retry) + Prevention (health checks) Together: Complete reliability from failure detection → automatic recovery ## Known Limitations 1. **Not Fully Integrated:** Modules are standalone, not yet wired into executor/worker 2. **No Worker Health Reporting:** Workers don't yet update health metrics 3. **No Retry API:** Manual retry requires direct execution creation 4. **No UI Components:** Web UI doesn't display retry chains or health 5. **No per-action TTL:** Worker queue TTL still global (schema supports it) ## Files Modified/Created ### New Files (4) - `migrations/20260209000000_phase3_retry_and_health.sql` (127 lines) - `crates/executor/src/retry_manager.rs` (487 lines) - `crates/executor/src/worker_health.rs` (464 lines) - `docs/QUICKREF-phase3-retry-health.md` (460 lines) ### Modified Files (4) - `crates/executor/src/lib.rs` (+4 lines) - `crates/executor/src/main.rs` (+2 lines) - `crates/executor/Cargo.toml` (+1 line) - `work-summary/2026-02-09-phase3-retry-health.md` (this document) ### Total Changes - **New Files:** 4 - **Modified Files:** 4 - **Lines Added:** ~1,550 - **Lines Removed:** ~0 ## Deployment Notes 1. **Database Migration Required:** Run `sqlx migrate run` before deploying 2. **No Breaking Changes:** All new fields are nullable or have defaults 3. **Backward Compatible:** Existing executions work without retry metadata 4. **No Configuration Required:** Sensible defaults for all settings 5. **Incremental Adoption:** Retry/health features can be enabled per-action ## Next Steps ### Immediate (Complete Phase 3 Integration) 1. **Wire Retry Manager:** Integrate into completion listener to create retries 2. **Wire Health Probe:** Integrate into scheduler for worker selection 3. **Worker Health Reporting:** Update workers to report health metrics 4. **Add API Endpoints:** `/api/v1/executions/{id}/retry` endpoint 5. **Testing:** End-to-end tests with retry scenarios ### Short Term (Enhance Phase 3) 6. **Retry UI:** Display retry chains and status in web UI 7. **Health Dashboard:** Visualize worker health distribution 8. **Per-Action TTL:** Use action.timeout_seconds for custom queue TTL 9. **Retry Policies:** Allow pack-level retry configuration 10. **Health Probes:** Active HTTP health checks to workers ### Long Term (Advanced Features) 11. **Circuit Breakers:** Automatically disable failing actions 12. **Retry Quotas:** Limit total retries per time window 13. **Smart Routing:** Affinity-based worker selection 14. **Predictive Health:** ML-based health prediction 15. **Auto-scaling:** Scale workers based on queue depth and health ## Monitoring Recommendations ### Key Metrics to Track - **Retry Rate:** % of executions that retry - **Retry Success Rate:** % of retries that eventually succeed - **Retry Reason Distribution:** Which failures are most common - **Worker Health Distribution:** Healthy/degraded/unhealthy counts - **Average Queue Depth:** Per-worker queue occupancy - **Health-Driven Routing:** % of executions using health-aware selection ### Alert Thresholds - **Warning:** Retry rate > 20%, unhealthy workers > 30% - **Critical:** Retry rate > 50%, unhealthy workers > 70% ### SQL Monitoring Queries See `docs/QUICKREF-phase3-retry-health.md` for comprehensive monitoring queries including: - Retry rate over time - Retry success rate by reason - Worker health distribution - Queue depth analysis - Retry chain tracing ## References - **Phase 1 Summary:** `work-summary/2026-02-09-worker-availability-phase1.md` - **Phase 2 Summary:** `work-summary/2026-02-09-worker-queue-ttl-phase2.md` - **Quick Reference:** `docs/QUICKREF-phase3-retry-health.md` - **Architecture:** `docs/architecture/worker-availability-handling.md` ## Conclusion Phase 3 provides the foundation for intelligent retry logic and health-aware worker selection. The modules are fully implemented with comprehensive error handling, configuration options, and documentation. While not yet fully integrated into the executor/worker services, the groundwork is complete and ready for incremental integration and testing. Together with Phases 1 and 2, the Attune platform now has a complete three-layer reliability system: 1. **Detection** (Phase 1): Timeout monitor catches stuck executions 2. **Handling** (Phase 2): Queue TTL and DLQ fail unavailable workers 3. **Recovery** (Phase 3): Intelligent retry and health-aware scheduling This defense-in-depth approach ensures executions are resilient to transient failures while maintaining system stability and performance. 🚀