attune/docs/QUICKREF-worker-heartbeat-monitoring.md

# Quick Reference: Worker Heartbeat Monitoring

**Purpose**: Automatically detect and deactivate workers that have stopped sending heartbeats

## Overview

The executor service includes a background task that monitors worker heartbeats and automatically marks stale workers as inactive. This prevents the scheduler from attempting to assign work to workers that are no longer available.

## How It Works

### Background Monitor Task

- **Location**: `crates/executor/src/service.rs` → `worker_heartbeat_monitor_loop()`
- **Check Interval**: Every 60 seconds
- **Staleness Threshold**: 90 seconds (3x the expected 30-second heartbeat interval)

### Detection Logic

The monitor checks all workers with `status = 'active'`:

1. **No Heartbeat**: Workers with `last_heartbeat = NULL` → marked inactive
2. **Stale Heartbeat**: Workers with heartbeat older than 90 seconds → marked inactive
3. **Fresh Heartbeat**: Workers with heartbeat within 90 seconds → remain active

### Automatic Deactivation

When a stale worker is detected:
- Worker status updated to `inactive` in database
- Warning logged with worker name, ID, and heartbeat age
- Summary logged with count of deactivated workers

## Configuration

### Constants (in scheduler.rs and service.rs)

```rust
DEFAULT_HEARTBEAT_INTERVAL: 30 seconds      // Expected worker heartbeat frequency
HEARTBEAT_STALENESS_MULTIPLIER: 3          // Grace period multiplier
MAX_STALENESS: 90 seconds                   // Calculated: 30 * 3
```

### Check Interval

Currently hardcoded to 60 seconds. Configured when spawning the monitor task:

```rust
Self::worker_heartbeat_monitor_loop(worker_pool, 60).await;
```

## Worker Lifecycle

### Normal Operation

```
Worker Starts → Registers → Sends Heartbeats (30s) → Remains Active
```

### Graceful Shutdown

```
Worker Stops → No More Heartbeats → Monitor Detects (60s) → Marked Inactive
```

### Crash/Network Failure

```
Worker Crashes → Heartbeats Stop → Monitor Detects (60s) → Marked Inactive
```

## Monitoring

### Check Active Workers

```sql
SELECT name, worker_role, status, last_heartbeat
FROM worker
WHERE status = 'active'
ORDER BY last_heartbeat DESC;
```

### Check Recent Deactivations

```sql
SELECT name, worker_role, status, last_heartbeat, updated
FROM worker
WHERE status = 'inactive'
  AND updated > NOW() - INTERVAL '5 minutes'
ORDER BY updated DESC;
```

### Count Workers by Status

```sql
SELECT status, COUNT(*)
FROM worker
GROUP BY status;
```

## Logs

### Monitor Startup

```
INFO: Starting worker heartbeat monitor...
INFO: Worker heartbeat monitor started (check interval: 60s, staleness threshold: 90s)
```

### Worker Deactivation

```
WARN: Worker sensor-77cd23b50478 (ID: 27) heartbeat is stale (1289s old), marking as inactive
INFO: Deactivated 5 worker(s) with stale heartbeats
```

### Error Handling

```
ERROR: Failed to deactivate worker worker-123 (stale heartbeat): <error details>
ERROR: Failed to query active workers for heartbeat check: <error details>
```

## Scheduler Integration

The scheduler already filters out stale workers during worker selection:

```rust
// Filter by heartbeat freshness
let fresh_workers: Vec<_> = active_workers
    .into_iter()
    .filter(|w| Self::is_worker_heartbeat_fresh(w))
    .collect();
```

**Before Heartbeat Monitor**: Scheduler filtered at selection time, but workers stayed "active" in DB
**After Heartbeat Monitor**: Workers marked inactive in DB, scheduler sees accurate state

## Troubleshooting

### Workers Constantly Becoming Inactive

**Symptoms**: Active workers being marked inactive despite running
**Causes**:
- Worker heartbeat interval > 30 seconds
- Network issues preventing heartbeat messages
- Worker service crash loop

**Solutions**:
1. Check worker logs for heartbeat send attempts
2. Verify RabbitMQ connectivity
3. Check worker configuration for heartbeat interval

### Stale Workers Not Being Deactivated

**Symptoms**: Workers with old heartbeats remain active
**Causes**:
- Executor service not running
- Monitor task crashed

**Solutions**:
1. Check executor service logs
2. Verify monitor task started: `grep "heartbeat monitor started" executor.log`
3. Restart executor service

### Too Many Inactive Workers

**Symptoms**: Database has hundreds of inactive workers
**Causes**: Historical workers from development/testing

**Solutions**:
```sql
-- Delete inactive workers older than 7 days
DELETE FROM worker
WHERE status = 'inactive'
  AND updated < NOW() - INTERVAL '7 days';
```

## Best Practices

### Worker Registration

Workers should:
- Set appropriate unique name (hostname-based)
- Send heartbeat every 30 seconds
- Handle graceful shutdown (optional: mark self inactive)

### Database Maintenance

- Periodically clean up old inactive workers
- Monitor worker table growth
- Index on `status` and `last_heartbeat` for efficient queries

### Monitoring & Alerts

- Track worker deactivation rate (should be low in production)
- Alert on sudden increase in deactivations (infrastructure issue)
- Monitor active worker count vs. expected

## Related Documentation

- `docs/architecture/worker-service.md` - Worker architecture
- `docs/architecture/executor-service.md` - Executor architecture
- `docs/deployment/ops-runbook-queues.md` - Operational procedures
- `AGENTS.md` - Project rules and conventions

## Implementation Notes

### Why 90 Seconds?

- Worker sends heartbeat every 30 seconds
- 3x multiplier provides grace period for:
  - Network latency
  - Brief load spikes
  - Temporary connectivity issues
- Balances responsiveness vs. false positives

### Why Check Every 60 Seconds?

- Allows 1.5 heartbeat intervals between checks
- Reduces database query frequency
- Adequate response time (stale workers removed within ~2 minutes)

### Thread Safety

- Monitor runs in separate tokio task
- Uses connection pool for database access
- No shared mutable state
- Safe to run multiple executor instances (each monitors independently)