Files
attune/docs/QUICKREF-worker-lifecycle-heartbeat.md

256 lines
6.2 KiB
Markdown

# Quick Reference: Worker Lifecycle & Heartbeat Validation
**Last Updated:** 2026-02-04
**Status:** Production Ready
## Overview
Workers use graceful shutdown and heartbeat validation to ensure reliable execution scheduling.
## Worker Lifecycle
### Startup
1. Load configuration
2. Connect to database and message queue
3. Detect runtime capabilities
4. Register in database (status = `Active`)
5. Start heartbeat loop
6. Start consuming execution messages
### Normal Operation
- **Heartbeat:** Updates `worker.last_heartbeat` every 30 seconds (default)
- **Status:** Remains `Active`
- **Executions:** Processes messages from worker-specific queue
### Shutdown (Graceful)
1. Receive SIGINT or SIGTERM signal
2. Stop heartbeat loop
3. Mark worker as `Inactive` in database
4. Exit cleanly
### Shutdown (Crash/Kill)
- Worker does not deregister
- Status remains `Active` in database
- Heartbeat stops updating
- **Executor detects as stale after 90 seconds**
## Heartbeat Validation
### Configuration
```yaml
worker:
heartbeat_interval: 30 # seconds (default)
```
### Staleness Threshold
- **Formula:** `heartbeat_interval * 3 = 90 seconds`
- **Rationale:** Allows 2 missed heartbeats + buffer
- **Detection:** Executor checks on every scheduling attempt
### Worker States
| Last Heartbeat Age | Status | Schedulable |
|-------------------|--------|-------------|
| < 90 seconds | Fresh | ✅ Yes |
| ≥ 90 seconds | Stale | ❌ No |
| None/NULL | Stale | ❌ No |
## Executor Scheduling Flow
```
Execution Requested
Find Action Workers
Filter by Runtime Compatibility
Filter by Active Status
Filter by Heartbeat Freshness ← NEW
Select Best Worker
Queue to Worker
```
## Signal Handling
### Supported Signals
- **SIGINT** (Ctrl+C) - Graceful shutdown
- **SIGTERM** (docker stop, k8s termination) - Graceful shutdown
- **SIGKILL** (force kill) - No cleanup possible
### Docker Example
```bash
# Graceful shutdown (10s grace period)
docker compose stop worker-shell
# Force kill (immediate)
docker compose kill worker-shell
```
### Kubernetes Example
```yaml
spec:
terminationGracePeriodSeconds: 30 # Time for graceful shutdown
```
## Monitoring & Debugging
### Check Worker Status
```sql
SELECT id, name, status, last_heartbeat,
EXTRACT(EPOCH FROM (NOW() - last_heartbeat)) as seconds_ago
FROM worker
WHERE worker_role = 'action'
ORDER BY last_heartbeat DESC;
```
### Identify Stale Workers
```sql
SELECT id, name, status,
EXTRACT(EPOCH FROM (NOW() - last_heartbeat)) as seconds_ago
FROM worker
WHERE worker_role = 'action'
AND status = 'active'
AND (last_heartbeat IS NULL OR last_heartbeat < NOW() - INTERVAL '90 seconds');
```
### View Worker Logs
```bash
# Docker Compose
docker compose logs -f worker-shell
# Look for:
# - "Worker registered with ID: X"
# - "Heartbeat sent successfully" (debug level)
# - "Received SIGTERM signal"
# - "Deregistering worker ID: X"
```
### View Executor Logs
```bash
docker compose logs -f executor
# Look for:
# - "Worker X heartbeat is stale: last seen N seconds ago"
# - "No workers with fresh heartbeats available"
```
## Common Issues
### Issue: "No workers with fresh heartbeats available"
**Causes:**
1. All workers crashed/terminated
2. Workers paused/frozen
3. Network partition between workers and database
4. Database connection issues
**Solutions:**
1. Check if workers are running: `docker compose ps`
2. Restart workers: `docker compose restart worker-shell`
3. Check worker logs for errors
4. Verify database connectivity
### Issue: Worker not deregistering on shutdown
**Causes:**
1. SIGKILL used instead of SIGTERM
2. Grace period too short
3. Database connection lost before deregister
**Solutions:**
1. Use `docker compose stop` not `docker compose kill`
2. Increase grace period: `docker compose down -t 30`
3. Check network connectivity
### Issue: Worker stuck in Active status after crash
**Behavior:** Normal - executor will detect as stale after 90s
**Manual Cleanup (if needed):**
```sql
UPDATE worker
SET status = 'inactive'
WHERE last_heartbeat < NOW() - INTERVAL '5 minutes';
```
## Testing
### Test Graceful Shutdown
```bash
# Start worker
docker compose up -d worker-shell
# Wait for registration
sleep 5
# Check status (should be 'active')
docker compose exec postgres psql -U attune -c \
"SELECT name, status FROM worker WHERE name LIKE 'worker-shell%';"
# Graceful shutdown
docker compose stop worker-shell
# Check status (should be 'inactive')
docker compose exec postgres psql -U attune -c \
"SELECT name, status FROM worker WHERE name LIKE 'worker-shell%';"
```
### Test Heartbeat Validation
```bash
# Pause worker (simulate freeze)
docker compose pause worker-shell
# Wait for staleness (90+ seconds)
sleep 100
# Try to schedule execution (should fail)
# Use API or CLI to trigger execution
attune execution create --action core.echo --param message="test"
# Should see: "No workers with fresh heartbeats available"
```
## Configuration Reference
### Worker Config
```yaml
worker:
name: "worker-01"
heartbeat_interval: 30 # Heartbeat update frequency (seconds)
max_concurrent_tasks: 10 # Concurrent execution limit
task_timeout: 300 # Per-task timeout (seconds)
```
### Relevant Constants
```rust
// crates/executor/src/scheduler.rs
const DEFAULT_HEARTBEAT_INTERVAL: u64 = 30;
const HEARTBEAT_STALENESS_MULTIPLIER: u64 = 3;
// Max age = 90 seconds
```
## Best Practices
1. **Use Graceful Shutdown:** Always use SIGTERM, not SIGKILL
2. **Monitor Heartbeats:** Alert when workers go stale
3. **Set Grace Periods:** Allow 10-30s for worker shutdown in production
4. **Health Checks:** Implement liveness probes in Kubernetes
5. **Auto-Restart:** Configure restart policies for crashed workers
## Related Documentation
- `work-summary/2026-02-worker-graceful-shutdown-heartbeat-validation.md` - Implementation details
- `docs/architecture/worker-service.md` - Worker architecture
- `docs/architecture/executor-service.md` - Executor architecture
- `AGENTS.md` - Project conventions
## Future Enhancements
- [ ] Configurable staleness multiplier
- [ ] Active health probing
- [ ] Graceful work completion before shutdown
- [ ] Worker reconnection logic
- [ ] Load-based worker selection