attune-system/attune

Fork 0

Files

David Culbreth e31ecb781b more internal polish, resilient workers

2026-02-09 18:32:34 -06:00

5.9 KiB

Raw Blame History

Quick Reference: Worker Heartbeat Monitoring

Purpose: Automatically detect and deactivate workers that have stopped sending heartbeats

Overview

The executor service includes a background task that monitors worker heartbeats and automatically marks stale workers as inactive. This prevents the scheduler from attempting to assign work to workers that are no longer available.

How It Works

Background Monitor Task

Location: crates/executor/src/service.rs → worker_heartbeat_monitor_loop()
Check Interval: Every 60 seconds
Staleness Threshold: 90 seconds (3x the expected 30-second heartbeat interval)

Detection Logic

The monitor checks all workers with status = 'active':

No Heartbeat: Workers with last_heartbeat = NULL → marked inactive
Stale Heartbeat: Workers with heartbeat older than 90 seconds → marked inactive
Fresh Heartbeat: Workers with heartbeat within 90 seconds → remain active

Automatic Deactivation

When a stale worker is detected:

Worker status updated to inactive in database
Warning logged with worker name, ID, and heartbeat age
Summary logged with count of deactivated workers

Configuration

Constants (in scheduler.rs and service.rs)

DEFAULT_HEARTBEAT_INTERVAL: 30 seconds      // Expected worker heartbeat frequency
HEARTBEAT_STALENESS_MULTIPLIER: 3          // Grace period multiplier
MAX_STALENESS: 90 seconds                   // Calculated: 30 * 3

Check Interval

Currently hardcoded to 60 seconds. Configured when spawning the monitor task:

Self::worker_heartbeat_monitor_loop(worker_pool, 60).await;

Worker Lifecycle

Normal Operation

Worker Starts → Registers → Sends Heartbeats (30s) → Remains Active

Graceful Shutdown

Worker Stops → No More Heartbeats → Monitor Detects (60s) → Marked Inactive

Crash/Network Failure

Worker Crashes → Heartbeats Stop → Monitor Detects (60s) → Marked Inactive

Monitoring

Check Active Workers

SELECT name, worker_role, status, last_heartbeat 
FROM worker 
WHERE status = 'active' 
ORDER BY last_heartbeat DESC;

Check Recent Deactivations

SELECT name, worker_role, status, last_heartbeat, updated
FROM worker 
WHERE status = 'inactive' 
  AND updated > NOW() - INTERVAL '5 minutes'
ORDER BY updated DESC;

Count Workers by Status

SELECT status, COUNT(*) 
FROM worker 
GROUP BY status;

Logs

Monitor Startup

INFO: Starting worker heartbeat monitor...
INFO: Worker heartbeat monitor started (check interval: 60s, staleness threshold: 90s)

Worker Deactivation

WARN: Worker sensor-77cd23b50478 (ID: 27) heartbeat is stale (1289s old), marking as inactive
INFO: Deactivated 5 worker(s) with stale heartbeats

Error Handling

ERROR: Failed to deactivate worker worker-123 (stale heartbeat): <error details>
ERROR: Failed to query active workers for heartbeat check: <error details>

Scheduler Integration

The scheduler already filters out stale workers during worker selection:

// Filter by heartbeat freshness
let fresh_workers: Vec<_> = active_workers
    .into_iter()
    .filter(|w| Self::is_worker_heartbeat_fresh(w))
    .collect();

Before Heartbeat Monitor: Scheduler filtered at selection time, but workers stayed "active" in DB After Heartbeat Monitor: Workers marked inactive in DB, scheduler sees accurate state

Troubleshooting

Workers Constantly Becoming Inactive

Symptoms: Active workers being marked inactive despite running Causes:

Worker heartbeat interval > 30 seconds
Network issues preventing heartbeat messages
Worker service crash loop

Solutions:

Check worker logs for heartbeat send attempts
Verify RabbitMQ connectivity
Check worker configuration for heartbeat interval

Stale Workers Not Being Deactivated

Symptoms: Workers with old heartbeats remain active Causes:

Executor service not running
Monitor task crashed

Solutions:

Check executor service logs
Verify monitor task started: grep "heartbeat monitor started" executor.log
Restart executor service

Too Many Inactive Workers

Symptoms: Database has hundreds of inactive workers Causes: Historical workers from development/testing

Solutions:

-- Delete inactive workers older than 7 days
DELETE FROM worker 
WHERE status = 'inactive' 
  AND updated < NOW() - INTERVAL '7 days';

Best Practices

Worker Registration

Workers should:

Set appropriate unique name (hostname-based)
Send heartbeat every 30 seconds
Handle graceful shutdown (optional: mark self inactive)

Database Maintenance

Periodically clean up old inactive workers
Monitor worker table growth
Index on status and last_heartbeat for efficient queries

Monitoring & Alerts

Track worker deactivation rate (should be low in production)
Alert on sudden increase in deactivations (infrastructure issue)
Monitor active worker count vs. expected

docs/architecture/worker-service.md - Worker architecture
docs/architecture/executor-service.md - Executor architecture
docs/deployment/ops-runbook-queues.md - Operational procedures
AGENTS.md - Project rules and conventions

Implementation Notes

Why 90 Seconds?

Worker sends heartbeat every 30 seconds
3x multiplier provides grace period for:
- Network latency
- Brief load spikes
- Temporary connectivity issues
Balances responsiveness vs. false positives

Why Check Every 60 Seconds?

Allows 1.5 heartbeat intervals between checks
Reduces database query frequency
Adequate response time (stale workers removed within ~2 minutes)

Thread Safety

Monitor runs in separate tokio task
Uses connection pool for database access
No shared mutable state
Safe to run multiple executor instances (each monitors independently)

5.9 KiB Raw Blame History

Quick Reference: Worker Heartbeat Monitoring

Overview

How It Works

Background Monitor Task

Detection Logic

Automatic Deactivation

Configuration

Constants (in scheduler.rs and service.rs)

Check Interval

Worker Lifecycle

Normal Operation

Graceful Shutdown

Crash/Network Failure

Monitoring

Check Active Workers

Check Recent Deactivations

Count Workers by Status

Logs

Monitor Startup

Worker Deactivation

Error Handling

Scheduler Integration

Troubleshooting

Workers Constantly Becoming Inactive

Stale Workers Not Being Deactivated

Too Many Inactive Workers

Best Practices

Worker Registration

Database Maintenance

Monitoring & Alerts

Related Documentation

Implementation Notes

Why 90 Seconds?

Why Check Every 60 Seconds?

Thread Safety

5.9 KiB

Raw Blame History