attune/docs/QUICKREF-worker-lifecycle-heartbeat.md

# Quick Reference: Worker Lifecycle & Heartbeat Validation

**Last Updated:** 2026-02-04
**Status:** Production Ready

## Overview

Workers use graceful shutdown and heartbeat validation to ensure reliable execution scheduling.

## Worker Lifecycle

### Startup
1. Load configuration
2. Connect to database and message queue
3. Detect runtime capabilities
4. Register in database (status = `Active`)
5. Start heartbeat loop
6. Start consuming execution messages

### Normal Operation
- **Heartbeat:** Updates `worker.last_heartbeat` every 30 seconds (default)
- **Status:** Remains `Active`
- **Executions:** Processes messages from worker-specific queue

### Shutdown (Graceful)
1. Receive SIGINT or SIGTERM signal
2. Stop heartbeat loop
3. Mark worker as `Inactive` in database
4. Exit cleanly

### Shutdown (Crash/Kill)
- Worker does not deregister
- Status remains `Active` in database
- Heartbeat stops updating
- **Executor detects as stale after 90 seconds**

## Heartbeat Validation

### Configuration
```yaml
worker:
  heartbeat_interval: 30  # seconds (default)
```

### Staleness Threshold
- **Formula:** `heartbeat_interval * 3 = 90 seconds`
- **Rationale:** Allows 2 missed heartbeats + buffer
- **Detection:** Executor checks on every scheduling attempt

### Worker States

| Last Heartbeat Age | Status | Schedulable |
|-------------------|--------|-------------|
| < 90 seconds      | Fresh  | ✅ Yes      |
| ≥ 90 seconds      | Stale  | ❌ No       |
| None/NULL         | Stale  | ❌ No       |

## Executor Scheduling Flow

```
Execution Requested
    ↓
Find Action Workers
    ↓
Filter by Runtime Compatibility
    ↓
Filter by Active Status
    ↓
Filter by Heartbeat Freshness ← NEW
    ↓
Select Best Worker
    ↓
Queue to Worker
```

## Signal Handling

### Supported Signals
- **SIGINT** (Ctrl+C) - Graceful shutdown
- **SIGTERM** (docker stop, k8s termination) - Graceful shutdown
- **SIGKILL** (force kill) - No cleanup possible

### Docker Example
```bash
# Graceful shutdown (10s grace period)
docker compose stop worker-shell

# Force kill (immediate)
docker compose kill worker-shell
```

### Kubernetes Example
```yaml
spec:
  terminationGracePeriodSeconds: 30  # Time for graceful shutdown
```

## Monitoring & Debugging

### Check Worker Status
```sql
SELECT id, name, status, last_heartbeat,
       EXTRACT(EPOCH FROM (NOW() - last_heartbeat)) as seconds_ago
FROM worker
WHERE worker_role = 'action'
ORDER BY last_heartbeat DESC;
```

### Identify Stale Workers
```sql
SELECT id, name, status,
       EXTRACT(EPOCH FROM (NOW() - last_heartbeat)) as seconds_ago
FROM worker
WHERE worker_role = 'action'
  AND status = 'active'
  AND (last_heartbeat IS NULL OR last_heartbeat < NOW() - INTERVAL '90 seconds');
```

### View Worker Logs
```bash
# Docker Compose
docker compose logs -f worker-shell

# Look for:
# - "Worker registered with ID: X"
# - "Heartbeat sent successfully" (debug level)
# - "Received SIGTERM signal"
# - "Deregistering worker ID: X"
```

### View Executor Logs
```bash
docker compose logs -f executor

# Look for:
# - "Worker X heartbeat is stale: last seen N seconds ago"
# - "No workers with fresh heartbeats available"
```

## Common Issues

### Issue: "No workers with fresh heartbeats available"

**Causes:**
1. All workers crashed/terminated
2. Workers paused/frozen
3. Network partition between workers and database
4. Database connection issues

**Solutions:**
1. Check if workers are running: `docker compose ps`
2. Restart workers: `docker compose restart worker-shell`
3. Check worker logs for errors
4. Verify database connectivity

### Issue: Worker not deregistering on shutdown

**Causes:**
1. SIGKILL used instead of SIGTERM
2. Grace period too short
3. Database connection lost before deregister

**Solutions:**
1. Use `docker compose stop` not `docker compose kill`
2. Increase grace period: `docker compose down -t 30`
3. Check network connectivity

### Issue: Worker stuck in Active status after crash

**Behavior:** Normal - executor will detect as stale after 90s

**Manual Cleanup (if needed):**
```sql
UPDATE worker
SET status = 'inactive'
WHERE last_heartbeat < NOW() - INTERVAL '5 minutes';
```

## Testing

### Test Graceful Shutdown
```bash
# Start worker
docker compose up -d worker-shell

# Wait for registration
sleep 5

# Check status (should be 'active')
docker compose exec postgres psql -U attune -c \
  "SELECT name, status FROM worker WHERE name LIKE 'worker-shell%';"

# Graceful shutdown
docker compose stop worker-shell

# Check status (should be 'inactive')
docker compose exec postgres psql -U attune -c \
  "SELECT name, status FROM worker WHERE name LIKE 'worker-shell%';"
```

### Test Heartbeat Validation
```bash
# Pause worker (simulate freeze)
docker compose pause worker-shell

# Wait for staleness (90+ seconds)
sleep 100

# Try to schedule execution (should fail)
# Use API or CLI to trigger execution
attune execution create --action core.echo --param message="test"

# Should see: "No workers with fresh heartbeats available"
```

## Configuration Reference

### Worker Config
```yaml
worker:
  name: "worker-01"
  heartbeat_interval: 30      # Heartbeat update frequency (seconds)
  max_concurrent_tasks: 10    # Concurrent execution limit
  task_timeout: 300           # Per-task timeout (seconds)
```

### Relevant Constants
```rust
// crates/executor/src/scheduler.rs
const DEFAULT_HEARTBEAT_INTERVAL: u64 = 30;
const HEARTBEAT_STALENESS_MULTIPLIER: u64 = 3;
// Max age = 90 seconds
```

## Best Practices

1. **Use Graceful Shutdown:** Always use SIGTERM, not SIGKILL
2. **Monitor Heartbeats:** Alert when workers go stale
3. **Set Grace Periods:** Allow 10-30s for worker shutdown in production
4. **Health Checks:** Implement liveness probes in Kubernetes
5. **Auto-Restart:** Configure restart policies for crashed workers

## Related Documentation

- `work-summary/2026-02-worker-graceful-shutdown-heartbeat-validation.md` - Implementation details
- `docs/architecture/worker-service.md` - Worker architecture
- `docs/architecture/executor-service.md` - Executor architecture
- `AGENTS.md` - Project conventions

## Future Enhancements

- [ ] Configurable staleness multiplier
- [ ] Active health probing
- [ ] Graceful work completion before shutdown
- [ ] Worker reconnection logic
- [ ] Load-based worker selection