more internal polish, resilient workers
This commit is contained in:
330
work-summary/2026-02-09-worker-availability-gaps.md
Normal file
330
work-summary/2026-02-09-worker-availability-gaps.md
Normal file
@@ -0,0 +1,330 @@
|
||||
# Worker Availability Handling - Gap Analysis
|
||||
|
||||
**Date**: 2026-02-09
|
||||
**Status**: Investigation Complete - Implementation Pending
|
||||
**Priority**: High
|
||||
**Impact**: Operational Reliability
|
||||
|
||||
## Issue Reported
|
||||
|
||||
User reported that when workers are brought down (e.g., `docker compose down worker-shell`), the executor continues attempting to send executions to the unavailable workers, resulting in stuck executions that never complete or fail.
|
||||
|
||||
## Investigation Summary
|
||||
|
||||
Investigated the executor's worker selection and scheduling logic to understand how worker availability is determined and what happens when workers become unavailable.
|
||||
|
||||
### Current Architecture
|
||||
|
||||
**Heartbeat-Based Availability:**
|
||||
- Workers send heartbeats to database every 30 seconds (configurable)
|
||||
- Scheduler filters workers based on heartbeat freshness
|
||||
- Workers are considered "stale" if heartbeat is older than 90 seconds (3x heartbeat interval)
|
||||
- Only workers with fresh heartbeats are eligible for scheduling
|
||||
|
||||
**Scheduling Flow:**
|
||||
```
|
||||
Execution (REQUESTED)
|
||||
→ Scheduler finds worker with fresh heartbeat
|
||||
→ Execution status updated to SCHEDULED
|
||||
→ Message published to worker-specific queue
|
||||
→ Worker consumes and executes
|
||||
```
|
||||
|
||||
### Root Causes Identified
|
||||
|
||||
1. **Heartbeat Staleness Window**: Workers can stop within the 90-second staleness window and still appear "available"
|
||||
- Worker sends heartbeat at T=0
|
||||
- Worker stops at T=30
|
||||
- Scheduler can still select this worker until T=90
|
||||
- 60-second window where dead worker appears healthy
|
||||
|
||||
2. **No Execution Timeout**: Once scheduled, executions have no timeout mechanism
|
||||
- Execution remains in SCHEDULED status indefinitely
|
||||
- No background process monitors scheduled executions
|
||||
- No automatic failure after reasonable time period
|
||||
|
||||
3. **Message Queue Accumulation**: Messages sit in worker-specific queues forever
|
||||
- Worker-specific queues: `attune.execution.worker.{worker_id}`
|
||||
- No TTL configured on these queues
|
||||
- No dead letter queue (DLQ) for expired messages
|
||||
- Messages never expire even if worker is permanently down
|
||||
|
||||
4. **No Graceful Shutdown**: Workers don't update their status when stopping
|
||||
- Docker SIGTERM signal not handled
|
||||
- Worker status remains "active" in database
|
||||
- No notification that worker is shutting down
|
||||
|
||||
5. **Retry Logic Issues**: Failed scheduling doesn't trigger meaningful retries
|
||||
- Scheduler returns error if no workers available
|
||||
- Error triggers message requeue (via nack)
|
||||
- But if worker WAS available during scheduling, message is successfully published
|
||||
- No mechanism to detect that worker never picked up the message
|
||||
|
||||
### Code Locations
|
||||
|
||||
**Heartbeat Check:**
|
||||
```rust
|
||||
// crates/executor/src/scheduler.rs:226-241
|
||||
fn is_worker_heartbeat_fresh(worker: &Worker) -> bool {
|
||||
let max_age = Duration::from_secs(
|
||||
DEFAULT_HEARTBEAT_INTERVAL * HEARTBEAT_STALENESS_MULTIPLIER
|
||||
); // 30 * 3 = 90 seconds
|
||||
|
||||
let is_fresh = age.to_std().unwrap_or(Duration::MAX) <= max_age;
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
**Worker Selection:**
|
||||
```rust
|
||||
// crates/executor/src/scheduler.rs:171-246
|
||||
async fn select_worker(pool: &PgPool, action: &Action) -> Result<Worker> {
|
||||
// 1. Find action workers
|
||||
// 2. Filter by runtime compatibility
|
||||
// 3. Filter by active status
|
||||
// 4. Filter by heartbeat freshness ← Gap: 90s window
|
||||
// 5. Select first available (no load balancing)
|
||||
}
|
||||
```
|
||||
|
||||
**Message Queue Consumer:**
|
||||
```rust
|
||||
// crates/common/src/mq/consumer.rs:150-175
|
||||
match handler(envelope.clone()).await {
|
||||
Err(e) => {
|
||||
let requeue = e.is_retriable(); // Only retries connection errors
|
||||
channel.basic_nack(delivery_tag, BasicNackOptions { requeue, .. })
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Impact Analysis
|
||||
|
||||
### User Experience
|
||||
- **Stuck executions**: Appear to be running but never complete
|
||||
- **No feedback**: Users don't know execution failed until they check manually
|
||||
- **Confusion**: Status shows SCHEDULED but nothing happens
|
||||
- **Lost work**: Executions that could have been routed to healthy workers are stuck
|
||||
|
||||
### System Impact
|
||||
- **Queue buildup**: Messages accumulate in unavailable worker queues
|
||||
- **Database pollution**: SCHEDULED executions remain in database indefinitely
|
||||
- **Resource waste**: Memory and disk consumed by stuck state
|
||||
- **Monitoring gaps**: No clear way to detect this condition
|
||||
|
||||
### Severity
|
||||
**HIGH** - This affects core functionality (execution reliability) and user trust in the system. In production, this would result in:
|
||||
- Failed automations with no notification
|
||||
- Debugging difficulties (why didn't my rule execute?)
|
||||
- Potential data loss (execution intended to process event is lost)
|
||||
|
||||
## Proposed Solutions
|
||||
|
||||
Comprehensive solution document created at: `docs/architecture/worker-availability-handling.md`
|
||||
|
||||
### Phase 1: Immediate Fixes (HIGH PRIORITY)
|
||||
|
||||
#### 1. Execution Timeout Monitor
|
||||
**Purpose**: Fail executions that remain SCHEDULED too long
|
||||
|
||||
**Implementation:**
|
||||
- Background task in executor service
|
||||
- Checks every 60 seconds for stale scheduled executions
|
||||
- Fails executions older than 5 minutes
|
||||
- Updates status to FAILED with descriptive error
|
||||
- Publishes ExecutionCompleted notification
|
||||
|
||||
**Impact**: Prevents indefinitely stuck executions
|
||||
|
||||
#### 2. Graceful Worker Shutdown
|
||||
**Purpose**: Mark workers inactive before they stop
|
||||
|
||||
**Implementation:**
|
||||
- Add SIGTERM handler to worker service
|
||||
- Update worker status to INACTIVE in database
|
||||
- Stop consuming from queue
|
||||
- Wait for in-flight tasks to complete (30s timeout)
|
||||
- Then exit
|
||||
|
||||
**Impact**: Reduces window where dead worker appears available
|
||||
|
||||
### Phase 2: Medium-Term Improvements (MEDIUM PRIORITY)
|
||||
|
||||
#### 3. Worker Queue TTL + Dead Letter Queue
|
||||
**Purpose**: Expire messages that sit too long in worker queues
|
||||
|
||||
**Implementation:**
|
||||
- Configure `x-message-ttl: 300000` (5 minutes) on worker queues
|
||||
- Configure `x-dead-letter-exchange` to route expired messages
|
||||
- Create DLQ exchange and queue
|
||||
- Add dead letter handler to fail executions from DLQ
|
||||
|
||||
**Impact**: Prevents message queue buildup
|
||||
|
||||
#### 4. Reduced Heartbeat Interval
|
||||
**Purpose**: Detect unavailable workers faster
|
||||
|
||||
**Configuration Changes:**
|
||||
```yaml
|
||||
worker:
|
||||
heartbeat_interval: 10 # Down from 30 seconds
|
||||
|
||||
executor:
|
||||
# Staleness = 10 * 3 = 30 seconds (down from 90s)
|
||||
```
|
||||
|
||||
**Impact**: 60-second window reduced to 20 seconds
|
||||
|
||||
### Phase 3: Long-Term Enhancements (LOW PRIORITY)
|
||||
|
||||
#### 5. Active Health Probes
|
||||
**Purpose**: Verify worker availability beyond heartbeats
|
||||
|
||||
**Implementation:**
|
||||
- Add health endpoint to worker service
|
||||
- Background health checker in executor
|
||||
- Pings workers periodically
|
||||
- Marks workers INACTIVE if unresponsive
|
||||
|
||||
**Impact**: More reliable availability detection
|
||||
|
||||
#### 6. Intelligent Retry with Worker Affinity
|
||||
**Purpose**: Reschedule failed executions to different workers
|
||||
|
||||
**Implementation:**
|
||||
- Track which worker was assigned to execution
|
||||
- On timeout, reschedule to different worker
|
||||
- Implement exponential backoff
|
||||
- Maximum retry limit
|
||||
|
||||
**Impact**: Better fault tolerance
|
||||
|
||||
## Recommended Immediate Actions
|
||||
|
||||
1. **Deploy Execution Timeout Monitor** (Week 1)
|
||||
- Add timeout check to executor service
|
||||
- Configure 5-minute timeout for SCHEDULED executions
|
||||
- Monitor timeout rate to tune values
|
||||
|
||||
2. **Add Graceful Shutdown to Workers** (Week 1)
|
||||
- Implement SIGTERM handler
|
||||
- Update Docker Compose `stop_grace_period: 45s`
|
||||
- Test worker restart scenarios
|
||||
|
||||
3. **Reduce Heartbeat Interval** (Week 1)
|
||||
- Update config: `worker.heartbeat_interval: 10`
|
||||
- Reduces staleness window from 90s to 30s
|
||||
- Low-risk configuration change
|
||||
|
||||
4. **Document Known Limitation** (Week 1)
|
||||
- Add operational notes about worker restart behavior
|
||||
- Document expected timeout duration
|
||||
- Provide troubleshooting guide
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Manual Testing
|
||||
1. Start system with worker running
|
||||
2. Create execution
|
||||
3. Immediately stop worker: `docker compose stop worker-shell`
|
||||
4. Observe execution status over 5 minutes
|
||||
5. Verify execution fails with timeout error
|
||||
6. Verify notification sent to user
|
||||
|
||||
### Integration Tests
|
||||
```rust
|
||||
#[tokio::test]
|
||||
async fn test_execution_timeout_on_worker_unavailable() {
|
||||
// 1. Create worker and start heartbeat
|
||||
// 2. Schedule execution
|
||||
// 3. Stop worker (no graceful shutdown)
|
||||
// 4. Wait > timeout duration
|
||||
// 5. Assert execution status = FAILED
|
||||
// 6. Assert error message contains "timeout"
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_graceful_worker_shutdown() {
|
||||
// 1. Create worker with active execution
|
||||
// 2. Send SIGTERM
|
||||
// 3. Verify worker status → INACTIVE
|
||||
// 4. Verify existing execution completes
|
||||
// 5. Verify new executions not scheduled to this worker
|
||||
}
|
||||
```
|
||||
|
||||
### Load Testing
|
||||
- Test with multiple workers
|
||||
- Stop workers randomly during execution
|
||||
- Verify executions redistribute to healthy workers
|
||||
- Measure timeout detection latency
|
||||
|
||||
## Metrics to Monitor Post-Deployment
|
||||
|
||||
1. **Execution Timeout Rate**: Track how often executions timeout
|
||||
2. **Timeout Latency**: Time from worker stop to execution failure
|
||||
3. **Queue Depth**: Monitor worker-specific queue lengths
|
||||
4. **Heartbeat Gaps**: Track time between last heartbeat and status change
|
||||
5. **Worker Restart Impact**: Measure execution disruption during restarts
|
||||
|
||||
## Configuration Recommendations
|
||||
|
||||
### Development
|
||||
```yaml
|
||||
executor:
|
||||
scheduled_timeout: 120 # 2 minutes (faster feedback)
|
||||
timeout_check_interval: 30 # Check every 30 seconds
|
||||
|
||||
worker:
|
||||
heartbeat_interval: 10
|
||||
shutdown_timeout: 15
|
||||
```
|
||||
|
||||
### Production
|
||||
```yaml
|
||||
executor:
|
||||
scheduled_timeout: 300 # 5 minutes
|
||||
timeout_check_interval: 60 # Check every minute
|
||||
|
||||
worker:
|
||||
heartbeat_interval: 10
|
||||
shutdown_timeout: 30
|
||||
```
|
||||
|
||||
## Related Work
|
||||
|
||||
This investigation complements:
|
||||
- **2026-02-09 DOTENV Parameter Flattening**: Fixes action execution parameters
|
||||
- **2026-02-09 URL Query Parameter Support**: Improves web UI filtering
|
||||
- **Worker Heartbeat Monitoring**: Existing heartbeat mechanism (needs enhancement)
|
||||
|
||||
Together, these improvements address both execution correctness (parameter passing) and execution reliability (worker availability).
|
||||
|
||||
## Documentation Created
|
||||
|
||||
1. `docs/architecture/worker-availability-handling.md` - Comprehensive solution guide
|
||||
- Problem statement and current architecture
|
||||
- Detailed solutions with code examples
|
||||
- Implementation priorities and phases
|
||||
- Configuration recommendations
|
||||
- Testing strategies
|
||||
- Migration path
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Review solutions document** with team
|
||||
2. **Prioritize implementation** based on urgency and resources
|
||||
3. **Create implementation tickets** for each solution
|
||||
4. **Schedule deployment** of Phase 1 fixes
|
||||
5. **Establish monitoring** for new metrics
|
||||
6. **Document operational procedures** for worker management
|
||||
|
||||
## Conclusion
|
||||
|
||||
The executor lacks robust handling for worker unavailability, relying solely on heartbeat staleness checks with a wide time window. Multiple complementary solutions are needed:
|
||||
|
||||
- **Short-term**: Timeout monitor + graceful shutdown (prevents indefinite stuck state)
|
||||
- **Medium-term**: Queue TTL + DLQ (prevents message buildup)
|
||||
- **Long-term**: Health probes + retry logic (improves reliability)
|
||||
|
||||
**Priority**: Phase 1 solutions should be implemented immediately as they address critical operational gaps that affect system reliability and user experience.
|
||||
Reference in New Issue
Block a user