more internal polish, resilient workers

This commit is contained in:
2026-02-09 18:32:34 -06:00
parent 588b319fec
commit e31ecb781b
62 changed files with 9872 additions and 584 deletions

View File

@@ -0,0 +1,227 @@
# Quick Reference: Worker Heartbeat Monitoring
**Purpose**: Automatically detect and deactivate workers that have stopped sending heartbeats
## Overview
The executor service includes a background task that monitors worker heartbeats and automatically marks stale workers as inactive. This prevents the scheduler from attempting to assign work to workers that are no longer available.
## How It Works
### Background Monitor Task
- **Location**: `crates/executor/src/service.rs``worker_heartbeat_monitor_loop()`
- **Check Interval**: Every 60 seconds
- **Staleness Threshold**: 90 seconds (3x the expected 30-second heartbeat interval)
### Detection Logic
The monitor checks all workers with `status = 'active'`:
1. **No Heartbeat**: Workers with `last_heartbeat = NULL` → marked inactive
2. **Stale Heartbeat**: Workers with heartbeat older than 90 seconds → marked inactive
3. **Fresh Heartbeat**: Workers with heartbeat within 90 seconds → remain active
### Automatic Deactivation
When a stale worker is detected:
- Worker status updated to `inactive` in database
- Warning logged with worker name, ID, and heartbeat age
- Summary logged with count of deactivated workers
## Configuration
### Constants (in scheduler.rs and service.rs)
```rust
DEFAULT_HEARTBEAT_INTERVAL: 30 seconds // Expected worker heartbeat frequency
HEARTBEAT_STALENESS_MULTIPLIER: 3 // Grace period multiplier
MAX_STALENESS: 90 seconds // Calculated: 30 * 3
```
### Check Interval
Currently hardcoded to 60 seconds. Configured when spawning the monitor task:
```rust
Self::worker_heartbeat_monitor_loop(worker_pool, 60).await;
```
## Worker Lifecycle
### Normal Operation
```
Worker Starts → Registers → Sends Heartbeats (30s) → Remains Active
```
### Graceful Shutdown
```
Worker Stops → No More Heartbeats → Monitor Detects (60s) → Marked Inactive
```
### Crash/Network Failure
```
Worker Crashes → Heartbeats Stop → Monitor Detects (60s) → Marked Inactive
```
## Monitoring
### Check Active Workers
```sql
SELECT name, worker_role, status, last_heartbeat
FROM worker
WHERE status = 'active'
ORDER BY last_heartbeat DESC;
```
### Check Recent Deactivations
```sql
SELECT name, worker_role, status, last_heartbeat, updated
FROM worker
WHERE status = 'inactive'
AND updated > NOW() - INTERVAL '5 minutes'
ORDER BY updated DESC;
```
### Count Workers by Status
```sql
SELECT status, COUNT(*)
FROM worker
GROUP BY status;
```
## Logs
### Monitor Startup
```
INFO: Starting worker heartbeat monitor...
INFO: Worker heartbeat monitor started (check interval: 60s, staleness threshold: 90s)
```
### Worker Deactivation
```
WARN: Worker sensor-77cd23b50478 (ID: 27) heartbeat is stale (1289s old), marking as inactive
INFO: Deactivated 5 worker(s) with stale heartbeats
```
### Error Handling
```
ERROR: Failed to deactivate worker worker-123 (stale heartbeat): <error details>
ERROR: Failed to query active workers for heartbeat check: <error details>
```
## Scheduler Integration
The scheduler already filters out stale workers during worker selection:
```rust
// Filter by heartbeat freshness
let fresh_workers: Vec<_> = active_workers
.into_iter()
.filter(|w| Self::is_worker_heartbeat_fresh(w))
.collect();
```
**Before Heartbeat Monitor**: Scheduler filtered at selection time, but workers stayed "active" in DB
**After Heartbeat Monitor**: Workers marked inactive in DB, scheduler sees accurate state
## Troubleshooting
### Workers Constantly Becoming Inactive
**Symptoms**: Active workers being marked inactive despite running
**Causes**:
- Worker heartbeat interval > 30 seconds
- Network issues preventing heartbeat messages
- Worker service crash loop
**Solutions**:
1. Check worker logs for heartbeat send attempts
2. Verify RabbitMQ connectivity
3. Check worker configuration for heartbeat interval
### Stale Workers Not Being Deactivated
**Symptoms**: Workers with old heartbeats remain active
**Causes**:
- Executor service not running
- Monitor task crashed
**Solutions**:
1. Check executor service logs
2. Verify monitor task started: `grep "heartbeat monitor started" executor.log`
3. Restart executor service
### Too Many Inactive Workers
**Symptoms**: Database has hundreds of inactive workers
**Causes**: Historical workers from development/testing
**Solutions**:
```sql
-- Delete inactive workers older than 7 days
DELETE FROM worker
WHERE status = 'inactive'
AND updated < NOW() - INTERVAL '7 days';
```
## Best Practices
### Worker Registration
Workers should:
- Set appropriate unique name (hostname-based)
- Send heartbeat every 30 seconds
- Handle graceful shutdown (optional: mark self inactive)
### Database Maintenance
- Periodically clean up old inactive workers
- Monitor worker table growth
- Index on `status` and `last_heartbeat` for efficient queries
### Monitoring & Alerts
- Track worker deactivation rate (should be low in production)
- Alert on sudden increase in deactivations (infrastructure issue)
- Monitor active worker count vs. expected
## Related Documentation
- `docs/architecture/worker-service.md` - Worker architecture
- `docs/architecture/executor-service.md` - Executor architecture
- `docs/deployment/ops-runbook-queues.md` - Operational procedures
- `AGENTS.md` - Project rules and conventions
## Implementation Notes
### Why 90 Seconds?
- Worker sends heartbeat every 30 seconds
- 3x multiplier provides grace period for:
- Network latency
- Brief load spikes
- Temporary connectivity issues
- Balances responsiveness vs. false positives
### Why Check Every 60 Seconds?
- Allows 1.5 heartbeat intervals between checks
- Reduces database query frequency
- Adequate response time (stale workers removed within ~2 minutes)
### Thread Safety
- Monitor runs in separate tokio task
- Uses connection pool for database access
- No shared mutable state
- Safe to run multiple executor instances (each monitors independently)