more internal polish, resilient workers

This commit is contained in:
2026-02-09 18:32:34 -06:00
parent 588b319fec
commit e31ecb781b
62 changed files with 9872 additions and 584 deletions

View File

@@ -0,0 +1,448 @@
# Work Summary: Phase 3 - Intelligent Retry & Worker Health
**Date:** 2026-02-09
**Author:** AI Assistant
**Phase:** Worker Availability Handling - Phase 3
## Overview
Implemented Phase 3 of worker availability handling: intelligent retry logic and proactive worker health monitoring. This enables automatic recovery from transient failures and health-aware worker selection for optimal execution scheduling.
## Motivation
Phases 1 and 2 provided robust failure detection and handling:
- **Phase 1:** Timeout monitor catches stuck executions
- **Phase 2:** Queue TTL and DLQ handle unavailable workers
Phase 3 completes the reliability story by:
1. **Automatic Recovery:** Retry transient failures without manual intervention
2. **Intelligent Classification:** Distinguish retriable vs non-retriable failures
3. **Optimal Scheduling:** Select healthy workers with low queue depth
4. **Per-Action Configuration:** Custom timeouts and retry limits per action
## Changes Made
### 1. Database Schema Enhancement
**New Migration:** `migrations/20260209000000_phase3_retry_and_health.sql`
**Execution Retry Tracking:**
- `retry_count` - Current retry attempt (0 = original, 1 = first retry, etc.)
- `max_retries` - Maximum retry attempts (copied from action config)
- `retry_reason` - Reason for retry (worker_unavailable, queue_timeout, etc.)
- `original_execution` - ID of original execution (forms retry chain)
**Action Configuration:**
- `timeout_seconds` - Per-action timeout override (NULL = use global TTL)
- `max_retries` - Maximum retry attempts for this action (default: 0)
**Worker Health Tracking:**
- Health metrics stored in `capabilities.health` JSONB object
- Fields: status, last_check, consecutive_failures, queue_depth, etc.
**Database Objects:**
- `healthy_workers` view - Active workers with fresh heartbeat and healthy status
- `get_worker_queue_depth()` function - Extract queue depth from worker metadata
- `is_execution_retriable()` function - Check if execution can be retried
- Indexes for retry queries and health-based worker selection
### 2. Retry Manager Module
**New File:** `crates/executor/src/retry_manager.rs` (487 lines)
**Components:**
- `RetryManager` - Core retry orchestration
- `RetryConfig` - Retry behavior configuration
- `RetryReason` - Enumeration of retry reasons
- `RetryAnalysis` - Result of retry eligibility analysis
**Key Features:**
- **Failure Classification:** Detects retriable vs non-retriable failures from error messages
- **Exponential Backoff:** Configurable base, multiplier, and max backoff (default: 1s, 2x, 300s max)
- **Jitter:** Random variance (±20%) to prevent thundering herd
- **Retry Chain Tracking:** Links retries to original execution via metadata
- **Exhaustion Handling:** Stops retrying when max_retries reached
**Retriable Failure Patterns:**
- Worker queue TTL expired
- Worker unavailable
- Timeout/timed out
- Heartbeat stale
- Transient/temporary errors
- Connection refused/reset
**Non-Retriable Failures:**
- Validation errors
- Permission denied
- Action not found
- Invalid parameters
- Unknown/unclassified errors (conservative approach)
### 3. Worker Health Probe Module
**New File:** `crates/executor/src/worker_health.rs` (464 lines)
**Components:**
- `WorkerHealthProbe` - Health monitoring and evaluation
- `HealthProbeConfig` - Health check configuration
- `HealthStatus` - Health state enum (Healthy, Degraded, Unhealthy)
- `HealthMetrics` - Worker health metrics structure
**Health States:**
**Healthy:**
- Heartbeat < 30 seconds old
- Consecutive failures < 3
- Queue depth < 50
- Failure rate < 30%
**Degraded:**
- Consecutive failures: 3-9
- Queue depth: 50-99
- Failure rate: 30-69%
- Still receives work but deprioritized
**Unhealthy:**
- Heartbeat > 30 seconds stale
- Consecutive failures ≥ 10
- Queue depth ≥ 100
- Failure rate ≥ 70%
- Does NOT receive new executions
**Features:**
- **Proactive Health Checks:** Evaluate worker health before scheduling
- **Health-Aware Selection:** Sort workers by health status and queue depth
- **Runtime Filtering:** Select best worker for specific runtime
- **Metrics Extraction:** Parse health data from worker capabilities JSONB
### 4. Module Integration
**Updated Files:**
- `crates/executor/src/lib.rs` - Export retry and health modules
- `crates/executor/src/main.rs` - Declare modules
- `crates/executor/Cargo.toml` - Add `rand` dependency for jitter
**Public API Exports:**
```rust
pub use retry_manager::{RetryAnalysis, RetryConfig, RetryManager, RetryReason};
pub use worker_health::{HealthMetrics, HealthProbeConfig, HealthStatus, WorkerHealthProbe};
```
### 5. Documentation
**Quick Reference Guide:** `docs/QUICKREF-phase3-retry-health.md` (460 lines)
- Retry behavior and configuration
- Worker health states and metrics
- Database schema reference
- Practical SQL examples
- Monitoring queries
- Troubleshooting guides
- Integration with Phases 1 & 2
## Technical Details
### Retry Flow
```
Execution fails → Retry Manager analyzes failure
Is failure retriable?
↓ Yes
Check retry count < max_retries?
↓ Yes
Calculate exponential backoff with jitter
Create retry execution with metadata:
- retry_count++
- original_execution
- retry_reason
- retry_at timestamp
Schedule retry after backoff delay
Success or exhaust retries
```
### Worker Selection Flow
```
Get runtime requirement → Health Probe queries all workers
Filter by:
1. Active status
2. Fresh heartbeat
3. Runtime support
Sort by:
1. Health status (healthy > degraded > unhealthy)
2. Queue depth (ascending)
Return best worker or None
```
### Backoff Calculation
```
backoff = base_secs * (multiplier ^ retry_count)
backoff = min(backoff, max_backoff_secs)
jitter = random(1 - jitter_factor, 1 + jitter_factor)
final_backoff = backoff * jitter
```
**Example:**
- Attempt 0: ~1s (0.8-1.2s with 20% jitter)
- Attempt 1: ~2s (1.6-2.4s)
- Attempt 2: ~4s (3.2-4.8s)
- Attempt 3: ~8s (6.4-9.6s)
- Attempt N: min(base * 2^N, 300s) with jitter
## Configuration
### Retry Manager
```rust
RetryConfig {
enabled: true, // Enable automatic retries
base_backoff_secs: 1, // Initial backoff
max_backoff_secs: 300, // 5 minutes maximum
backoff_multiplier: 2.0, // Exponential growth
jitter_factor: 0.2, // 20% randomization
}
```
### Health Probe
```rust
HealthProbeConfig {
enabled: true,
heartbeat_max_age_secs: 30,
degraded_threshold: 3, // Consecutive failures
unhealthy_threshold: 10,
queue_depth_degraded: 50,
queue_depth_unhealthy: 100,
failure_rate_degraded: 0.3, // 30%
failure_rate_unhealthy: 0.7, // 70%
}
```
### Per-Action Configuration
```yaml
# packs/mypack/actions/api-call.yaml
name: external_api_call
runtime: python
entrypoint: actions/api.py
timeout_seconds: 120 # 2 minutes (overrides global 5 min TTL)
max_retries: 3 # Retry up to 3 times on failure
```
## Testing
### Compilation
- ✅ All crates compile cleanly with zero warnings
- ✅ Added `rand` dependency for jitter calculation
- ✅ All public API methods properly documented
### Database Migration
- ✅ SQLx compatible migration file
- ✅ Adds all necessary columns, indexes, views, functions
- ✅ Includes comprehensive comments
- ✅ Backward compatible (nullable fields)
### Unit Tests
- ✅ Retry reason detection from error messages
- ✅ Retriable error pattern matching
- ✅ Backoff calculation (exponential with jitter)
- ✅ Health status extraction from worker capabilities
- ✅ Configuration defaults
## Integration Status
### Complete
- ✅ Database schema
- ✅ Retry manager module with full logic
- ✅ Worker health probe module
- ✅ Module exports and integration
- ✅ Comprehensive documentation
### Pending (Future Integration)
- ⏳ Wire retry manager into completion listener
- ⏳ Wire health probe into scheduler
- ⏳ Add retry API endpoints
- ⏳ Update worker to report health metrics
- ⏳ Add retry/health UI components
**Note:** Phase 3 provides the foundation and API. Full integration will occur in subsequent work as the system is tested and refined.
## Benefits
### Automatic Recovery
- **Transient Failures:** Retry worker unavailability, timeouts, network issues
- **No Manual Intervention:** System self-heals from temporary problems
- **Exponential Backoff:** Avoids overwhelming struggling resources
- **Jitter:** Prevents thundering herd problem
### Intelligent Scheduling
- **Health-Aware:** Avoid unhealthy workers proactively
- **Load Balancing:** Prefer workers with lower queue depth
- **Runtime Matching:** Only select workers supporting required runtime
- **Graceful Degradation:** Degraded workers still used if necessary
### Operational Visibility
- **Retry Metrics:** Track retry rates, reasons, success rates
- **Health Metrics:** Monitor worker health distribution
- **Failure Classification:** Understand why executions fail
- **Retry Chains:** Trace execution attempts through retries
### Flexibility
- **Per-Action Config:** Custom timeouts and retry limits per action
- **Global Config:** Override retry/health settings for entire system
- **Tunable Thresholds:** Adjust health and retry parameters
- **Extensible:** Easy to add new retry reasons or health factors
## Relationship to Previous Phases
### Defense in Depth
**Phase 1 (Timeout Monitor):**
- Monitors database for stuck SCHEDULED executions
- Fails executions after timeout (default: 5 minutes)
- Acts as backstop for all phases
**Phase 2 (Queue TTL + DLQ):**
- Expires messages in worker queues (default: 5 minutes)
- Routes expired messages to DLQ
- DLQ handler marks executions as FAILED
**Phase 3 (Intelligent Retry + Health):**
- Analyzes failures and retries if retriable
- Exponential backoff prevents immediate re-failure
- Health-aware selection avoids problematic workers
### Failure Flow Integration
```
Execution scheduled → Sent to worker queue (Phase 2 TTL active)
Worker unavailable → Message expires (5 min)
DLQ handler fails execution (Phase 2)
Retry manager detects retriable failure (Phase 3)
Create retry with backoff (Phase 3)
Health probe selects healthy worker (Phase 3)
Retry succeeds or exhausts attempts
If stuck, Phase 1 timeout monitor catches it (safety net)
```
### Complementary Mechanisms
- **Phase 1:** Polling-based safety net (catches anything missed)
- **Phase 2:** Message-level expiration (precise timing)
- **Phase 3:** Active recovery (automatic retry) + Prevention (health checks)
Together: Complete reliability from failure detection → automatic recovery
## Known Limitations
1. **Not Fully Integrated:** Modules are standalone, not yet wired into executor/worker
2. **No Worker Health Reporting:** Workers don't yet update health metrics
3. **No Retry API:** Manual retry requires direct execution creation
4. **No UI Components:** Web UI doesn't display retry chains or health
5. **No per-action TTL:** Worker queue TTL still global (schema supports it)
## Files Modified/Created
### New Files (4)
- `migrations/20260209000000_phase3_retry_and_health.sql` (127 lines)
- `crates/executor/src/retry_manager.rs` (487 lines)
- `crates/executor/src/worker_health.rs` (464 lines)
- `docs/QUICKREF-phase3-retry-health.md` (460 lines)
### Modified Files (4)
- `crates/executor/src/lib.rs` (+4 lines)
- `crates/executor/src/main.rs` (+2 lines)
- `crates/executor/Cargo.toml` (+1 line)
- `work-summary/2026-02-09-phase3-retry-health.md` (this document)
### Total Changes
- **New Files:** 4
- **Modified Files:** 4
- **Lines Added:** ~1,550
- **Lines Removed:** ~0
## Deployment Notes
1. **Database Migration Required:** Run `sqlx migrate run` before deploying
2. **No Breaking Changes:** All new fields are nullable or have defaults
3. **Backward Compatible:** Existing executions work without retry metadata
4. **No Configuration Required:** Sensible defaults for all settings
5. **Incremental Adoption:** Retry/health features can be enabled per-action
## Next Steps
### Immediate (Complete Phase 3 Integration)
1. **Wire Retry Manager:** Integrate into completion listener to create retries
2. **Wire Health Probe:** Integrate into scheduler for worker selection
3. **Worker Health Reporting:** Update workers to report health metrics
4. **Add API Endpoints:** `/api/v1/executions/{id}/retry` endpoint
5. **Testing:** End-to-end tests with retry scenarios
### Short Term (Enhance Phase 3)
6. **Retry UI:** Display retry chains and status in web UI
7. **Health Dashboard:** Visualize worker health distribution
8. **Per-Action TTL:** Use action.timeout_seconds for custom queue TTL
9. **Retry Policies:** Allow pack-level retry configuration
10. **Health Probes:** Active HTTP health checks to workers
### Long Term (Advanced Features)
11. **Circuit Breakers:** Automatically disable failing actions
12. **Retry Quotas:** Limit total retries per time window
13. **Smart Routing:** Affinity-based worker selection
14. **Predictive Health:** ML-based health prediction
15. **Auto-scaling:** Scale workers based on queue depth and health
## Monitoring Recommendations
### Key Metrics to Track
- **Retry Rate:** % of executions that retry
- **Retry Success Rate:** % of retries that eventually succeed
- **Retry Reason Distribution:** Which failures are most common
- **Worker Health Distribution:** Healthy/degraded/unhealthy counts
- **Average Queue Depth:** Per-worker queue occupancy
- **Health-Driven Routing:** % of executions using health-aware selection
### Alert Thresholds
- **Warning:** Retry rate > 20%, unhealthy workers > 30%
- **Critical:** Retry rate > 50%, unhealthy workers > 70%
### SQL Monitoring Queries
See `docs/QUICKREF-phase3-retry-health.md` for comprehensive monitoring queries including:
- Retry rate over time
- Retry success rate by reason
- Worker health distribution
- Queue depth analysis
- Retry chain tracing
## References
- **Phase 1 Summary:** `work-summary/2026-02-09-worker-availability-phase1.md`
- **Phase 2 Summary:** `work-summary/2026-02-09-worker-queue-ttl-phase2.md`
- **Quick Reference:** `docs/QUICKREF-phase3-retry-health.md`
- **Architecture:** `docs/architecture/worker-availability-handling.md`
## Conclusion
Phase 3 provides the foundation for intelligent retry logic and health-aware worker selection. The modules are fully implemented with comprehensive error handling, configuration options, and documentation. While not yet fully integrated into the executor/worker services, the groundwork is complete and ready for incremental integration and testing.
Together with Phases 1 and 2, the Attune platform now has a complete three-layer reliability system:
1. **Detection** (Phase 1): Timeout monitor catches stuck executions
2. **Handling** (Phase 2): Queue TTL and DLQ fail unavailable workers
3. **Recovery** (Phase 3): Intelligent retry and health-aware scheduling
This defense-in-depth approach ensures executions are resilient to transient failures while maintaining system stability and performance. 🚀