Files
attune/work-summary/2026-02-09-phase3-retry-health.md

15 KiB

Work Summary: Phase 3 - Intelligent Retry & Worker Health

Date: 2026-02-09
Author: AI Assistant
Phase: Worker Availability Handling - Phase 3

Overview

Implemented Phase 3 of worker availability handling: intelligent retry logic and proactive worker health monitoring. This enables automatic recovery from transient failures and health-aware worker selection for optimal execution scheduling.

Motivation

Phases 1 and 2 provided robust failure detection and handling:

  • Phase 1: Timeout monitor catches stuck executions
  • Phase 2: Queue TTL and DLQ handle unavailable workers

Phase 3 completes the reliability story by:

  1. Automatic Recovery: Retry transient failures without manual intervention
  2. Intelligent Classification: Distinguish retriable vs non-retriable failures
  3. Optimal Scheduling: Select healthy workers with low queue depth
  4. Per-Action Configuration: Custom timeouts and retry limits per action

Changes Made

1. Database Schema Enhancement

New Migration: migrations/20260209000000_phase3_retry_and_health.sql

Execution Retry Tracking:

  • retry_count - Current retry attempt (0 = original, 1 = first retry, etc.)
  • max_retries - Maximum retry attempts (copied from action config)
  • retry_reason - Reason for retry (worker_unavailable, queue_timeout, etc.)
  • original_execution - ID of original execution (forms retry chain)

Action Configuration:

  • timeout_seconds - Per-action timeout override (NULL = use global TTL)
  • max_retries - Maximum retry attempts for this action (default: 0)

Worker Health Tracking:

  • Health metrics stored in capabilities.health JSONB object
  • Fields: status, last_check, consecutive_failures, queue_depth, etc.

Database Objects:

  • healthy_workers view - Active workers with fresh heartbeat and healthy status
  • get_worker_queue_depth() function - Extract queue depth from worker metadata
  • is_execution_retriable() function - Check if execution can be retried
  • Indexes for retry queries and health-based worker selection

2. Retry Manager Module

New File: crates/executor/src/retry_manager.rs (487 lines)

Components:

  • RetryManager - Core retry orchestration
  • RetryConfig - Retry behavior configuration
  • RetryReason - Enumeration of retry reasons
  • RetryAnalysis - Result of retry eligibility analysis

Key Features:

  • Failure Classification: Detects retriable vs non-retriable failures from error messages
  • Exponential Backoff: Configurable base, multiplier, and max backoff (default: 1s, 2x, 300s max)
  • Jitter: Random variance (±20%) to prevent thundering herd
  • Retry Chain Tracking: Links retries to original execution via metadata
  • Exhaustion Handling: Stops retrying when max_retries reached

Retriable Failure Patterns:

  • Worker queue TTL expired
  • Worker unavailable
  • Timeout/timed out
  • Heartbeat stale
  • Transient/temporary errors
  • Connection refused/reset

Non-Retriable Failures:

  • Validation errors
  • Permission denied
  • Action not found
  • Invalid parameters
  • Unknown/unclassified errors (conservative approach)

3. Worker Health Probe Module

New File: crates/executor/src/worker_health.rs (464 lines)

Components:

  • WorkerHealthProbe - Health monitoring and evaluation
  • HealthProbeConfig - Health check configuration
  • HealthStatus - Health state enum (Healthy, Degraded, Unhealthy)
  • HealthMetrics - Worker health metrics structure

Health States:

Healthy:

  • Heartbeat < 30 seconds old
  • Consecutive failures < 3
  • Queue depth < 50
  • Failure rate < 30%

Degraded:

  • Consecutive failures: 3-9
  • Queue depth: 50-99
  • Failure rate: 30-69%
  • Still receives work but deprioritized

Unhealthy:

  • Heartbeat > 30 seconds stale
  • Consecutive failures ≥ 10
  • Queue depth ≥ 100
  • Failure rate ≥ 70%
  • Does NOT receive new executions

Features:

  • Proactive Health Checks: Evaluate worker health before scheduling
  • Health-Aware Selection: Sort workers by health status and queue depth
  • Runtime Filtering: Select best worker for specific runtime
  • Metrics Extraction: Parse health data from worker capabilities JSONB

4. Module Integration

Updated Files:

  • crates/executor/src/lib.rs - Export retry and health modules
  • crates/executor/src/main.rs - Declare modules
  • crates/executor/Cargo.toml - Add rand dependency for jitter

Public API Exports:

pub use retry_manager::{RetryAnalysis, RetryConfig, RetryManager, RetryReason};
pub use worker_health::{HealthMetrics, HealthProbeConfig, HealthStatus, WorkerHealthProbe};

5. Documentation

Quick Reference Guide: docs/QUICKREF-phase3-retry-health.md (460 lines)

  • Retry behavior and configuration
  • Worker health states and metrics
  • Database schema reference
  • Practical SQL examples
  • Monitoring queries
  • Troubleshooting guides
  • Integration with Phases 1 & 2

Technical Details

Retry Flow

Execution fails → Retry Manager analyzes failure
    ↓
Is failure retriable?
    ↓ Yes
Check retry count < max_retries?
    ↓ Yes
Calculate exponential backoff with jitter
    ↓
Create retry execution with metadata:
  - retry_count++
  - original_execution
  - retry_reason
  - retry_at timestamp
    ↓
Schedule retry after backoff delay
    ↓
Success or exhaust retries

Worker Selection Flow

Get runtime requirement → Health Probe queries all workers
    ↓
Filter by:
  1. Active status
  2. Fresh heartbeat
  3. Runtime support
    ↓
Sort by:
  1. Health status (healthy > degraded > unhealthy)
  2. Queue depth (ascending)
    ↓
Return best worker or None

Backoff Calculation

backoff = base_secs * (multiplier ^ retry_count)
backoff = min(backoff, max_backoff_secs)
jitter = random(1 - jitter_factor, 1 + jitter_factor)
final_backoff = backoff * jitter

Example:

  • Attempt 0: ~1s (0.8-1.2s with 20% jitter)
  • Attempt 1: ~2s (1.6-2.4s)
  • Attempt 2: ~4s (3.2-4.8s)
  • Attempt 3: ~8s (6.4-9.6s)
  • Attempt N: min(base * 2^N, 300s) with jitter

Configuration

Retry Manager

RetryConfig {
    enabled: true,                    // Enable automatic retries
    base_backoff_secs: 1,             // Initial backoff
    max_backoff_secs: 300,            // 5 minutes maximum
    backoff_multiplier: 2.0,          // Exponential growth
    jitter_factor: 0.2,               // 20% randomization
}

Health Probe

HealthProbeConfig {
    enabled: true,
    heartbeat_max_age_secs: 30,
    degraded_threshold: 3,            // Consecutive failures
    unhealthy_threshold: 10,
    queue_depth_degraded: 50,
    queue_depth_unhealthy: 100,
    failure_rate_degraded: 0.3,       // 30%
    failure_rate_unhealthy: 0.7,      // 70%
}

Per-Action Configuration

# packs/mypack/actions/api-call.yaml
name: external_api_call
runtime: python
entrypoint: actions/api.py
timeout_seconds: 120        # 2 minutes (overrides global 5 min TTL)
max_retries: 3              # Retry up to 3 times on failure

Testing

Compilation

  • All crates compile cleanly with zero warnings
  • Added rand dependency for jitter calculation
  • All public API methods properly documented

Database Migration

  • SQLx compatible migration file
  • Adds all necessary columns, indexes, views, functions
  • Includes comprehensive comments
  • Backward compatible (nullable fields)

Unit Tests

  • Retry reason detection from error messages
  • Retriable error pattern matching
  • Backoff calculation (exponential with jitter)
  • Health status extraction from worker capabilities
  • Configuration defaults

Integration Status

Complete

  • Database schema
  • Retry manager module with full logic
  • Worker health probe module
  • Module exports and integration
  • Comprehensive documentation

Pending (Future Integration)

  • Wire retry manager into completion listener
  • Wire health probe into scheduler
  • Add retry API endpoints
  • Update worker to report health metrics
  • Add retry/health UI components

Note: Phase 3 provides the foundation and API. Full integration will occur in subsequent work as the system is tested and refined.

Benefits

Automatic Recovery

  • Transient Failures: Retry worker unavailability, timeouts, network issues
  • No Manual Intervention: System self-heals from temporary problems
  • Exponential Backoff: Avoids overwhelming struggling resources
  • Jitter: Prevents thundering herd problem

Intelligent Scheduling

  • Health-Aware: Avoid unhealthy workers proactively
  • Load Balancing: Prefer workers with lower queue depth
  • Runtime Matching: Only select workers supporting required runtime
  • Graceful Degradation: Degraded workers still used if necessary

Operational Visibility

  • Retry Metrics: Track retry rates, reasons, success rates
  • Health Metrics: Monitor worker health distribution
  • Failure Classification: Understand why executions fail
  • Retry Chains: Trace execution attempts through retries

Flexibility

  • Per-Action Config: Custom timeouts and retry limits per action
  • Global Config: Override retry/health settings for entire system
  • Tunable Thresholds: Adjust health and retry parameters
  • Extensible: Easy to add new retry reasons or health factors

Relationship to Previous Phases

Defense in Depth

Phase 1 (Timeout Monitor):

  • Monitors database for stuck SCHEDULED executions
  • Fails executions after timeout (default: 5 minutes)
  • Acts as backstop for all phases

Phase 2 (Queue TTL + DLQ):

  • Expires messages in worker queues (default: 5 minutes)
  • Routes expired messages to DLQ
  • DLQ handler marks executions as FAILED

Phase 3 (Intelligent Retry + Health):

  • Analyzes failures and retries if retriable
  • Exponential backoff prevents immediate re-failure
  • Health-aware selection avoids problematic workers

Failure Flow Integration

Execution scheduled → Sent to worker queue (Phase 2 TTL active)
    ↓
Worker unavailable → Message expires (5 min)
    ↓
DLQ handler fails execution (Phase 2)
    ↓
Retry manager detects retriable failure (Phase 3)
    ↓
Create retry with backoff (Phase 3)
    ↓
Health probe selects healthy worker (Phase 3)
    ↓
Retry succeeds or exhausts attempts
    ↓
If stuck, Phase 1 timeout monitor catches it (safety net)

Complementary Mechanisms

  • Phase 1: Polling-based safety net (catches anything missed)
  • Phase 2: Message-level expiration (precise timing)
  • Phase 3: Active recovery (automatic retry) + Prevention (health checks)

Together: Complete reliability from failure detection → automatic recovery

Known Limitations

  1. Not Fully Integrated: Modules are standalone, not yet wired into executor/worker
  2. No Worker Health Reporting: Workers don't yet update health metrics
  3. No Retry API: Manual retry requires direct execution creation
  4. No UI Components: Web UI doesn't display retry chains or health
  5. No per-action TTL: Worker queue TTL still global (schema supports it)

Files Modified/Created

New Files (4)

  • migrations/20260209000000_phase3_retry_and_health.sql (127 lines)
  • crates/executor/src/retry_manager.rs (487 lines)
  • crates/executor/src/worker_health.rs (464 lines)
  • docs/QUICKREF-phase3-retry-health.md (460 lines)

Modified Files (4)

  • crates/executor/src/lib.rs (+4 lines)
  • crates/executor/src/main.rs (+2 lines)
  • crates/executor/Cargo.toml (+1 line)
  • work-summary/2026-02-09-phase3-retry-health.md (this document)

Total Changes

  • New Files: 4
  • Modified Files: 4
  • Lines Added: ~1,550
  • Lines Removed: ~0

Deployment Notes

  1. Database Migration Required: Run sqlx migrate run before deploying
  2. No Breaking Changes: All new fields are nullable or have defaults
  3. Backward Compatible: Existing executions work without retry metadata
  4. No Configuration Required: Sensible defaults for all settings
  5. Incremental Adoption: Retry/health features can be enabled per-action

Next Steps

Immediate (Complete Phase 3 Integration)

  1. Wire Retry Manager: Integrate into completion listener to create retries
  2. Wire Health Probe: Integrate into scheduler for worker selection
  3. Worker Health Reporting: Update workers to report health metrics
  4. Add API Endpoints: /api/v1/executions/{id}/retry endpoint
  5. Testing: End-to-end tests with retry scenarios

Short Term (Enhance Phase 3)

  1. Retry UI: Display retry chains and status in web UI
  2. Health Dashboard: Visualize worker health distribution
  3. Per-Action TTL: Use action.timeout_seconds for custom queue TTL
  4. Retry Policies: Allow pack-level retry configuration
  5. Health Probes: Active HTTP health checks to workers

Long Term (Advanced Features)

  1. Circuit Breakers: Automatically disable failing actions
  2. Retry Quotas: Limit total retries per time window
  3. Smart Routing: Affinity-based worker selection
  4. Predictive Health: ML-based health prediction
  5. Auto-scaling: Scale workers based on queue depth and health

Monitoring Recommendations

Key Metrics to Track

  • Retry Rate: % of executions that retry
  • Retry Success Rate: % of retries that eventually succeed
  • Retry Reason Distribution: Which failures are most common
  • Worker Health Distribution: Healthy/degraded/unhealthy counts
  • Average Queue Depth: Per-worker queue occupancy
  • Health-Driven Routing: % of executions using health-aware selection

Alert Thresholds

  • Warning: Retry rate > 20%, unhealthy workers > 30%
  • Critical: Retry rate > 50%, unhealthy workers > 70%

SQL Monitoring Queries

See docs/QUICKREF-phase3-retry-health.md for comprehensive monitoring queries including:

  • Retry rate over time
  • Retry success rate by reason
  • Worker health distribution
  • Queue depth analysis
  • Retry chain tracing

References

  • Phase 1 Summary: work-summary/2026-02-09-worker-availability-phase1.md
  • Phase 2 Summary: work-summary/2026-02-09-worker-queue-ttl-phase2.md
  • Quick Reference: docs/QUICKREF-phase3-retry-health.md
  • Architecture: docs/architecture/worker-availability-handling.md

Conclusion

Phase 3 provides the foundation for intelligent retry logic and health-aware worker selection. The modules are fully implemented with comprehensive error handling, configuration options, and documentation. While not yet fully integrated into the executor/worker services, the groundwork is complete and ready for incremental integration and testing.

Together with Phases 1 and 2, the Attune platform now has a complete three-layer reliability system:

  1. Detection (Phase 1): Timeout monitor catches stuck executions
  2. Handling (Phase 2): Queue TTL and DLQ fail unavailable workers
  3. Recovery (Phase 3): Intelligent retry and health-aware scheduling

This defense-in-depth approach ensures executions are resilient to transient failures while maintaining system stability and performance. 🚀