5.9 KiB
Quick Reference: Worker Heartbeat Monitoring
Purpose: Automatically detect and deactivate workers that have stopped sending heartbeats
Overview
The executor service includes a background task that monitors worker heartbeats and automatically marks stale workers as inactive. This prevents the scheduler from attempting to assign work to workers that are no longer available.
How It Works
Background Monitor Task
- Location:
crates/executor/src/service.rs→worker_heartbeat_monitor_loop() - Check Interval: Every 60 seconds
- Staleness Threshold: 90 seconds (3x the expected 30-second heartbeat interval)
Detection Logic
The monitor checks all workers with status = 'active':
- No Heartbeat: Workers with
last_heartbeat = NULL→ marked inactive - Stale Heartbeat: Workers with heartbeat older than 90 seconds → marked inactive
- Fresh Heartbeat: Workers with heartbeat within 90 seconds → remain active
Automatic Deactivation
When a stale worker is detected:
- Worker status updated to
inactivein database - Warning logged with worker name, ID, and heartbeat age
- Summary logged with count of deactivated workers
Configuration
Constants (in scheduler.rs and service.rs)
DEFAULT_HEARTBEAT_INTERVAL: 30 seconds // Expected worker heartbeat frequency
HEARTBEAT_STALENESS_MULTIPLIER: 3 // Grace period multiplier
MAX_STALENESS: 90 seconds // Calculated: 30 * 3
Check Interval
Currently hardcoded to 60 seconds. Configured when spawning the monitor task:
Self::worker_heartbeat_monitor_loop(worker_pool, 60).await;
Worker Lifecycle
Normal Operation
Worker Starts → Registers → Sends Heartbeats (30s) → Remains Active
Graceful Shutdown
Worker Stops → No More Heartbeats → Monitor Detects (60s) → Marked Inactive
Crash/Network Failure
Worker Crashes → Heartbeats Stop → Monitor Detects (60s) → Marked Inactive
Monitoring
Check Active Workers
SELECT name, worker_role, status, last_heartbeat
FROM worker
WHERE status = 'active'
ORDER BY last_heartbeat DESC;
Check Recent Deactivations
SELECT name, worker_role, status, last_heartbeat, updated
FROM worker
WHERE status = 'inactive'
AND updated > NOW() - INTERVAL '5 minutes'
ORDER BY updated DESC;
Count Workers by Status
SELECT status, COUNT(*)
FROM worker
GROUP BY status;
Logs
Monitor Startup
INFO: Starting worker heartbeat monitor...
INFO: Worker heartbeat monitor started (check interval: 60s, staleness threshold: 90s)
Worker Deactivation
WARN: Worker sensor-77cd23b50478 (ID: 27) heartbeat is stale (1289s old), marking as inactive
INFO: Deactivated 5 worker(s) with stale heartbeats
Error Handling
ERROR: Failed to deactivate worker worker-123 (stale heartbeat): <error details>
ERROR: Failed to query active workers for heartbeat check: <error details>
Scheduler Integration
The scheduler already filters out stale workers during worker selection:
// Filter by heartbeat freshness
let fresh_workers: Vec<_> = active_workers
.into_iter()
.filter(|w| Self::is_worker_heartbeat_fresh(w))
.collect();
Before Heartbeat Monitor: Scheduler filtered at selection time, but workers stayed "active" in DB After Heartbeat Monitor: Workers marked inactive in DB, scheduler sees accurate state
Troubleshooting
Workers Constantly Becoming Inactive
Symptoms: Active workers being marked inactive despite running Causes:
- Worker heartbeat interval > 30 seconds
- Network issues preventing heartbeat messages
- Worker service crash loop
Solutions:
- Check worker logs for heartbeat send attempts
- Verify RabbitMQ connectivity
- Check worker configuration for heartbeat interval
Stale Workers Not Being Deactivated
Symptoms: Workers with old heartbeats remain active Causes:
- Executor service not running
- Monitor task crashed
Solutions:
- Check executor service logs
- Verify monitor task started:
grep "heartbeat monitor started" executor.log - Restart executor service
Too Many Inactive Workers
Symptoms: Database has hundreds of inactive workers Causes: Historical workers from development/testing
Solutions:
-- Delete inactive workers older than 7 days
DELETE FROM worker
WHERE status = 'inactive'
AND updated < NOW() - INTERVAL '7 days';
Best Practices
Worker Registration
Workers should:
- Set appropriate unique name (hostname-based)
- Send heartbeat every 30 seconds
- Handle graceful shutdown (optional: mark self inactive)
Database Maintenance
- Periodically clean up old inactive workers
- Monitor worker table growth
- Index on
statusandlast_heartbeatfor efficient queries
Monitoring & Alerts
- Track worker deactivation rate (should be low in production)
- Alert on sudden increase in deactivations (infrastructure issue)
- Monitor active worker count vs. expected
Related Documentation
docs/architecture/worker-service.md- Worker architecturedocs/architecture/executor-service.md- Executor architecturedocs/deployment/ops-runbook-queues.md- Operational proceduresAGENTS.md- Project rules and conventions
Implementation Notes
Why 90 Seconds?
- Worker sends heartbeat every 30 seconds
- 3x multiplier provides grace period for:
- Network latency
- Brief load spikes
- Temporary connectivity issues
- Balances responsiveness vs. false positives
Why Check Every 60 Seconds?
- Allows 1.5 heartbeat intervals between checks
- Reduces database query frequency
- Adequate response time (stale workers removed within ~2 minutes)
Thread Safety
- Monitor runs in separate tokio task
- Uses connection pool for database access
- No shared mutable state
- Safe to run multiple executor instances (each monitors independently)