more internal polish, resilient workers

2026-02-09 18:32:34 -06:00
parent 588b319fec
commit e31ecb781b
62 changed files with 9872 additions and 584 deletions
--- a/docs/QUICKREF-worker-heartbeat-monitoring.md
+++ b/docs/QUICKREF-worker-heartbeat-monitoring.md
@@ -0,0 +1,227 @@
+# Quick Reference: Worker Heartbeat Monitoring
+
+**Purpose**: Automatically detect and deactivate workers that have stopped sending heartbeats
+
+## Overview
+
+The executor service includes a background task that monitors worker heartbeats and automatically marks stale workers as inactive. This prevents the scheduler from attempting to assign work to workers that are no longer available.
+
+## How It Works
+
+### Background Monitor Task
+
+- **Location**: `crates/executor/src/service.rs` → `worker_heartbeat_monitor_loop()`
+- **Check Interval**: Every 60 seconds
+- **Staleness Threshold**: 90 seconds (3x the expected 30-second heartbeat interval)
+
+### Detection Logic
+
+The monitor checks all workers with `status = 'active'`:
+
+1. **No Heartbeat**: Workers with `last_heartbeat = NULL` → marked inactive
+2. **Stale Heartbeat**: Workers with heartbeat older than 90 seconds → marked inactive
+3. **Fresh Heartbeat**: Workers with heartbeat within 90 seconds → remain active
+
+### Automatic Deactivation
+
+When a stale worker is detected:
+- Worker status updated to `inactive` in database
+- Warning logged with worker name, ID, and heartbeat age
+- Summary logged with count of deactivated workers
+
+## Configuration
+
+### Constants (in scheduler.rs and service.rs)
+
+```rust
+DEFAULT_HEARTBEAT_INTERVAL: 30 seconds      // Expected worker heartbeat frequency
+HEARTBEAT_STALENESS_MULTIPLIER: 3          // Grace period multiplier
+MAX_STALENESS: 90 seconds                   // Calculated: 30 * 3
+```
+
+### Check Interval
+
+Currently hardcoded to 60 seconds. Configured when spawning the monitor task:
+
+```rust
+Self::worker_heartbeat_monitor_loop(worker_pool, 60).await;
+```
+
+## Worker Lifecycle
+
+### Normal Operation
+
+```
+Worker Starts → Registers → Sends Heartbeats (30s) → Remains Active
+```
+
+### Graceful Shutdown
+
+```
+Worker Stops → No More Heartbeats → Monitor Detects (60s) → Marked Inactive
+```
+
+### Crash/Network Failure
+
+```
+Worker Crashes → Heartbeats Stop → Monitor Detects (60s) → Marked Inactive
+```
+
+## Monitoring
+
+### Check Active Workers
+
+```sql
+SELECT name, worker_role, status, last_heartbeat 
+FROM worker 
+WHERE status = 'active' 
+ORDER BY last_heartbeat DESC;
+```
+
+### Check Recent Deactivations
+
+```sql
+SELECT name, worker_role, status, last_heartbeat, updated
+FROM worker 
+WHERE status = 'inactive' 
+  AND updated > NOW() - INTERVAL '5 minutes'
+ORDER BY updated DESC;
+```
+
+### Count Workers by Status
+
+```sql
+SELECT status, COUNT(*) 
+FROM worker 
+GROUP BY status;
+```
+
+## Logs
+
+### Monitor Startup
+
+```
+INFO: Starting worker heartbeat monitor...
+INFO: Worker heartbeat monitor started (check interval: 60s, staleness threshold: 90s)
+```
+
+### Worker Deactivation
+
+```
+WARN: Worker sensor-77cd23b50478 (ID: 27) heartbeat is stale (1289s old), marking as inactive
+INFO: Deactivated 5 worker(s) with stale heartbeats
+```
+
+### Error Handling
+
+```
+ERROR: Failed to deactivate worker worker-123 (stale heartbeat): <error details>
+ERROR: Failed to query active workers for heartbeat check: <error details>
+```
+
+## Scheduler Integration
+
+The scheduler already filters out stale workers during worker selection:
+
+```rust
+// Filter by heartbeat freshness
+let fresh_workers: Vec<_> = active_workers
+    .into_iter()
+    .filter(|w| Self::is_worker_heartbeat_fresh(w))
+    .collect();
+```
+
+**Before Heartbeat Monitor**: Scheduler filtered at selection time, but workers stayed "active" in DB
+**After Heartbeat Monitor**: Workers marked inactive in DB, scheduler sees accurate state
+
+## Troubleshooting
+
+### Workers Constantly Becoming Inactive
+
+**Symptoms**: Active workers being marked inactive despite running
+**Causes**:
+- Worker heartbeat interval > 30 seconds
+- Network issues preventing heartbeat messages
+- Worker service crash loop
+
+**Solutions**:
+1. Check worker logs for heartbeat send attempts
+2. Verify RabbitMQ connectivity
+3. Check worker configuration for heartbeat interval
+
+### Stale Workers Not Being Deactivated
+
+**Symptoms**: Workers with old heartbeats remain active
+**Causes**:
+- Executor service not running
+- Monitor task crashed
+
+**Solutions**:
+1. Check executor service logs
+2. Verify monitor task started: `grep "heartbeat monitor started" executor.log`
+3. Restart executor service
+
+### Too Many Inactive Workers
+
+**Symptoms**: Database has hundreds of inactive workers
+**Causes**: Historical workers from development/testing
+
+**Solutions**:
+```sql
+-- Delete inactive workers older than 7 days
+DELETE FROM worker 
+WHERE status = 'inactive' 
+  AND updated < NOW() - INTERVAL '7 days';
+```
+
+## Best Practices
+
+### Worker Registration
+
+Workers should:
+- Set appropriate unique name (hostname-based)
+- Send heartbeat every 30 seconds
+- Handle graceful shutdown (optional: mark self inactive)
+
+### Database Maintenance
+
+- Periodically clean up old inactive workers
+- Monitor worker table growth
+- Index on `status` and `last_heartbeat` for efficient queries
+
+### Monitoring & Alerts
+
+- Track worker deactivation rate (should be low in production)
+- Alert on sudden increase in deactivations (infrastructure issue)
+- Monitor active worker count vs. expected
+
+## Related Documentation
+
+- `docs/architecture/worker-service.md` - Worker architecture
+- `docs/architecture/executor-service.md` - Executor architecture
+- `docs/deployment/ops-runbook-queues.md` - Operational procedures
+- `AGENTS.md` - Project rules and conventions
+
+## Implementation Notes
+
+### Why 90 Seconds?
+
+- Worker sends heartbeat every 30 seconds
+- 3x multiplier provides grace period for:
+  - Network latency
+  - Brief load spikes
+  - Temporary connectivity issues
+- Balances responsiveness vs. false positives
+
+### Why Check Every 60 Seconds?
+
+- Allows 1.5 heartbeat intervals between checks
+- Reduces database query frequency
+- Adequate response time (stale workers removed within ~2 minutes)
+
+### Thread Safety
+
+- Monitor runs in separate tokio task
+- Uses connection pool for database access
+- No shared mutable state
+- Safe to run multiple executor instances (each monitors independently)