15 KiB
Work Summary: Phase 3 - Intelligent Retry & Worker Health
Date: 2026-02-09
Author: AI Assistant
Phase: Worker Availability Handling - Phase 3
Overview
Implemented Phase 3 of worker availability handling: intelligent retry logic and proactive worker health monitoring. This enables automatic recovery from transient failures and health-aware worker selection for optimal execution scheduling.
Motivation
Phases 1 and 2 provided robust failure detection and handling:
- Phase 1: Timeout monitor catches stuck executions
- Phase 2: Queue TTL and DLQ handle unavailable workers
Phase 3 completes the reliability story by:
- Automatic Recovery: Retry transient failures without manual intervention
- Intelligent Classification: Distinguish retriable vs non-retriable failures
- Optimal Scheduling: Select healthy workers with low queue depth
- Per-Action Configuration: Custom timeouts and retry limits per action
Changes Made
1. Database Schema Enhancement
New Migration: migrations/20260209000000_phase3_retry_and_health.sql
Execution Retry Tracking:
retry_count- Current retry attempt (0 = original, 1 = first retry, etc.)max_retries- Maximum retry attempts (copied from action config)retry_reason- Reason for retry (worker_unavailable, queue_timeout, etc.)original_execution- ID of original execution (forms retry chain)
Action Configuration:
timeout_seconds- Per-action timeout override (NULL = use global TTL)max_retries- Maximum retry attempts for this action (default: 0)
Worker Health Tracking:
- Health metrics stored in
capabilities.healthJSONB object - Fields: status, last_check, consecutive_failures, queue_depth, etc.
Database Objects:
healthy_workersview - Active workers with fresh heartbeat and healthy statusget_worker_queue_depth()function - Extract queue depth from worker metadatais_execution_retriable()function - Check if execution can be retried- Indexes for retry queries and health-based worker selection
2. Retry Manager Module
New File: crates/executor/src/retry_manager.rs (487 lines)
Components:
RetryManager- Core retry orchestrationRetryConfig- Retry behavior configurationRetryReason- Enumeration of retry reasonsRetryAnalysis- Result of retry eligibility analysis
Key Features:
- Failure Classification: Detects retriable vs non-retriable failures from error messages
- Exponential Backoff: Configurable base, multiplier, and max backoff (default: 1s, 2x, 300s max)
- Jitter: Random variance (±20%) to prevent thundering herd
- Retry Chain Tracking: Links retries to original execution via metadata
- Exhaustion Handling: Stops retrying when max_retries reached
Retriable Failure Patterns:
- Worker queue TTL expired
- Worker unavailable
- Timeout/timed out
- Heartbeat stale
- Transient/temporary errors
- Connection refused/reset
Non-Retriable Failures:
- Validation errors
- Permission denied
- Action not found
- Invalid parameters
- Unknown/unclassified errors (conservative approach)
3. Worker Health Probe Module
New File: crates/executor/src/worker_health.rs (464 lines)
Components:
WorkerHealthProbe- Health monitoring and evaluationHealthProbeConfig- Health check configurationHealthStatus- Health state enum (Healthy, Degraded, Unhealthy)HealthMetrics- Worker health metrics structure
Health States:
Healthy:
- Heartbeat < 30 seconds old
- Consecutive failures < 3
- Queue depth < 50
- Failure rate < 30%
Degraded:
- Consecutive failures: 3-9
- Queue depth: 50-99
- Failure rate: 30-69%
- Still receives work but deprioritized
Unhealthy:
- Heartbeat > 30 seconds stale
- Consecutive failures ≥ 10
- Queue depth ≥ 100
- Failure rate ≥ 70%
- Does NOT receive new executions
Features:
- Proactive Health Checks: Evaluate worker health before scheduling
- Health-Aware Selection: Sort workers by health status and queue depth
- Runtime Filtering: Select best worker for specific runtime
- Metrics Extraction: Parse health data from worker capabilities JSONB
4. Module Integration
Updated Files:
crates/executor/src/lib.rs- Export retry and health modulescrates/executor/src/main.rs- Declare modulescrates/executor/Cargo.toml- Addranddependency for jitter
Public API Exports:
pub use retry_manager::{RetryAnalysis, RetryConfig, RetryManager, RetryReason};
pub use worker_health::{HealthMetrics, HealthProbeConfig, HealthStatus, WorkerHealthProbe};
5. Documentation
Quick Reference Guide: docs/QUICKREF-phase3-retry-health.md (460 lines)
- Retry behavior and configuration
- Worker health states and metrics
- Database schema reference
- Practical SQL examples
- Monitoring queries
- Troubleshooting guides
- Integration with Phases 1 & 2
Technical Details
Retry Flow
Execution fails → Retry Manager analyzes failure
↓
Is failure retriable?
↓ Yes
Check retry count < max_retries?
↓ Yes
Calculate exponential backoff with jitter
↓
Create retry execution with metadata:
- retry_count++
- original_execution
- retry_reason
- retry_at timestamp
↓
Schedule retry after backoff delay
↓
Success or exhaust retries
Worker Selection Flow
Get runtime requirement → Health Probe queries all workers
↓
Filter by:
1. Active status
2. Fresh heartbeat
3. Runtime support
↓
Sort by:
1. Health status (healthy > degraded > unhealthy)
2. Queue depth (ascending)
↓
Return best worker or None
Backoff Calculation
backoff = base_secs * (multiplier ^ retry_count)
backoff = min(backoff, max_backoff_secs)
jitter = random(1 - jitter_factor, 1 + jitter_factor)
final_backoff = backoff * jitter
Example:
- Attempt 0: ~1s (0.8-1.2s with 20% jitter)
- Attempt 1: ~2s (1.6-2.4s)
- Attempt 2: ~4s (3.2-4.8s)
- Attempt 3: ~8s (6.4-9.6s)
- Attempt N: min(base * 2^N, 300s) with jitter
Configuration
Retry Manager
RetryConfig {
enabled: true, // Enable automatic retries
base_backoff_secs: 1, // Initial backoff
max_backoff_secs: 300, // 5 minutes maximum
backoff_multiplier: 2.0, // Exponential growth
jitter_factor: 0.2, // 20% randomization
}
Health Probe
HealthProbeConfig {
enabled: true,
heartbeat_max_age_secs: 30,
degraded_threshold: 3, // Consecutive failures
unhealthy_threshold: 10,
queue_depth_degraded: 50,
queue_depth_unhealthy: 100,
failure_rate_degraded: 0.3, // 30%
failure_rate_unhealthy: 0.7, // 70%
}
Per-Action Configuration
# packs/mypack/actions/api-call.yaml
name: external_api_call
runtime: python
entrypoint: actions/api.py
timeout_seconds: 120 # 2 minutes (overrides global 5 min TTL)
max_retries: 3 # Retry up to 3 times on failure
Testing
Compilation
- ✅ All crates compile cleanly with zero warnings
- ✅ Added
randdependency for jitter calculation - ✅ All public API methods properly documented
Database Migration
- ✅ SQLx compatible migration file
- ✅ Adds all necessary columns, indexes, views, functions
- ✅ Includes comprehensive comments
- ✅ Backward compatible (nullable fields)
Unit Tests
- ✅ Retry reason detection from error messages
- ✅ Retriable error pattern matching
- ✅ Backoff calculation (exponential with jitter)
- ✅ Health status extraction from worker capabilities
- ✅ Configuration defaults
Integration Status
Complete
- ✅ Database schema
- ✅ Retry manager module with full logic
- ✅ Worker health probe module
- ✅ Module exports and integration
- ✅ Comprehensive documentation
Pending (Future Integration)
- ⏳ Wire retry manager into completion listener
- ⏳ Wire health probe into scheduler
- ⏳ Add retry API endpoints
- ⏳ Update worker to report health metrics
- ⏳ Add retry/health UI components
Note: Phase 3 provides the foundation and API. Full integration will occur in subsequent work as the system is tested and refined.
Benefits
Automatic Recovery
- Transient Failures: Retry worker unavailability, timeouts, network issues
- No Manual Intervention: System self-heals from temporary problems
- Exponential Backoff: Avoids overwhelming struggling resources
- Jitter: Prevents thundering herd problem
Intelligent Scheduling
- Health-Aware: Avoid unhealthy workers proactively
- Load Balancing: Prefer workers with lower queue depth
- Runtime Matching: Only select workers supporting required runtime
- Graceful Degradation: Degraded workers still used if necessary
Operational Visibility
- Retry Metrics: Track retry rates, reasons, success rates
- Health Metrics: Monitor worker health distribution
- Failure Classification: Understand why executions fail
- Retry Chains: Trace execution attempts through retries
Flexibility
- Per-Action Config: Custom timeouts and retry limits per action
- Global Config: Override retry/health settings for entire system
- Tunable Thresholds: Adjust health and retry parameters
- Extensible: Easy to add new retry reasons or health factors
Relationship to Previous Phases
Defense in Depth
Phase 1 (Timeout Monitor):
- Monitors database for stuck SCHEDULED executions
- Fails executions after timeout (default: 5 minutes)
- Acts as backstop for all phases
Phase 2 (Queue TTL + DLQ):
- Expires messages in worker queues (default: 5 minutes)
- Routes expired messages to DLQ
- DLQ handler marks executions as FAILED
Phase 3 (Intelligent Retry + Health):
- Analyzes failures and retries if retriable
- Exponential backoff prevents immediate re-failure
- Health-aware selection avoids problematic workers
Failure Flow Integration
Execution scheduled → Sent to worker queue (Phase 2 TTL active)
↓
Worker unavailable → Message expires (5 min)
↓
DLQ handler fails execution (Phase 2)
↓
Retry manager detects retriable failure (Phase 3)
↓
Create retry with backoff (Phase 3)
↓
Health probe selects healthy worker (Phase 3)
↓
Retry succeeds or exhausts attempts
↓
If stuck, Phase 1 timeout monitor catches it (safety net)
Complementary Mechanisms
- Phase 1: Polling-based safety net (catches anything missed)
- Phase 2: Message-level expiration (precise timing)
- Phase 3: Active recovery (automatic retry) + Prevention (health checks)
Together: Complete reliability from failure detection → automatic recovery
Known Limitations
- Not Fully Integrated: Modules are standalone, not yet wired into executor/worker
- No Worker Health Reporting: Workers don't yet update health metrics
- No Retry API: Manual retry requires direct execution creation
- No UI Components: Web UI doesn't display retry chains or health
- No per-action TTL: Worker queue TTL still global (schema supports it)
Files Modified/Created
New Files (4)
migrations/20260209000000_phase3_retry_and_health.sql(127 lines)crates/executor/src/retry_manager.rs(487 lines)crates/executor/src/worker_health.rs(464 lines)docs/QUICKREF-phase3-retry-health.md(460 lines)
Modified Files (4)
crates/executor/src/lib.rs(+4 lines)crates/executor/src/main.rs(+2 lines)crates/executor/Cargo.toml(+1 line)work-summary/2026-02-09-phase3-retry-health.md(this document)
Total Changes
- New Files: 4
- Modified Files: 4
- Lines Added: ~1,550
- Lines Removed: ~0
Deployment Notes
- Database Migration Required: Run
sqlx migrate runbefore deploying - No Breaking Changes: All new fields are nullable or have defaults
- Backward Compatible: Existing executions work without retry metadata
- No Configuration Required: Sensible defaults for all settings
- Incremental Adoption: Retry/health features can be enabled per-action
Next Steps
Immediate (Complete Phase 3 Integration)
- Wire Retry Manager: Integrate into completion listener to create retries
- Wire Health Probe: Integrate into scheduler for worker selection
- Worker Health Reporting: Update workers to report health metrics
- Add API Endpoints:
/api/v1/executions/{id}/retryendpoint - Testing: End-to-end tests with retry scenarios
Short Term (Enhance Phase 3)
- Retry UI: Display retry chains and status in web UI
- Health Dashboard: Visualize worker health distribution
- Per-Action TTL: Use action.timeout_seconds for custom queue TTL
- Retry Policies: Allow pack-level retry configuration
- Health Probes: Active HTTP health checks to workers
Long Term (Advanced Features)
- Circuit Breakers: Automatically disable failing actions
- Retry Quotas: Limit total retries per time window
- Smart Routing: Affinity-based worker selection
- Predictive Health: ML-based health prediction
- Auto-scaling: Scale workers based on queue depth and health
Monitoring Recommendations
Key Metrics to Track
- Retry Rate: % of executions that retry
- Retry Success Rate: % of retries that eventually succeed
- Retry Reason Distribution: Which failures are most common
- Worker Health Distribution: Healthy/degraded/unhealthy counts
- Average Queue Depth: Per-worker queue occupancy
- Health-Driven Routing: % of executions using health-aware selection
Alert Thresholds
- Warning: Retry rate > 20%, unhealthy workers > 30%
- Critical: Retry rate > 50%, unhealthy workers > 70%
SQL Monitoring Queries
See docs/QUICKREF-phase3-retry-health.md for comprehensive monitoring queries including:
- Retry rate over time
- Retry success rate by reason
- Worker health distribution
- Queue depth analysis
- Retry chain tracing
References
- Phase 1 Summary:
work-summary/2026-02-09-worker-availability-phase1.md - Phase 2 Summary:
work-summary/2026-02-09-worker-queue-ttl-phase2.md - Quick Reference:
docs/QUICKREF-phase3-retry-health.md - Architecture:
docs/architecture/worker-availability-handling.md
Conclusion
Phase 3 provides the foundation for intelligent retry logic and health-aware worker selection. The modules are fully implemented with comprehensive error handling, configuration options, and documentation. While not yet fully integrated into the executor/worker services, the groundwork is complete and ready for incremental integration and testing.
Together with Phases 1 and 2, the Attune platform now has a complete three-layer reliability system:
- Detection (Phase 1): Timeout monitor catches stuck executions
- Handling (Phase 2): Queue TTL and DLQ fail unavailable workers
- Recovery (Phase 3): Intelligent retry and health-aware scheduling
This defense-in-depth approach ensures executions are resilient to transient failures while maintaining system stability and performance. 🚀