6.2 KiB
6.2 KiB
Quick Reference: Worker Lifecycle & Heartbeat Validation
Last Updated: 2026-02-04
Status: Production Ready
Overview
Workers use graceful shutdown and heartbeat validation to ensure reliable execution scheduling.
Worker Lifecycle
Startup
- Load configuration
- Connect to database and message queue
- Detect runtime capabilities
- Register in database (status =
Active) - Start heartbeat loop
- Start consuming execution messages
Normal Operation
- Heartbeat: Updates
worker.last_heartbeatevery 30 seconds (default) - Status: Remains
Active - Executions: Processes messages from worker-specific queue
Shutdown (Graceful)
- Receive SIGINT or SIGTERM signal
- Stop heartbeat loop
- Mark worker as
Inactivein database - Exit cleanly
Shutdown (Crash/Kill)
- Worker does not deregister
- Status remains
Activein database - Heartbeat stops updating
- Executor detects as stale after 90 seconds
Heartbeat Validation
Configuration
worker:
heartbeat_interval: 30 # seconds (default)
Staleness Threshold
- Formula:
heartbeat_interval * 3 = 90 seconds - Rationale: Allows 2 missed heartbeats + buffer
- Detection: Executor checks on every scheduling attempt
Worker States
| Last Heartbeat Age | Status | Schedulable |
|---|---|---|
| < 90 seconds | Fresh | ✅ Yes |
| ≥ 90 seconds | Stale | ❌ No |
| None/NULL | Stale | ❌ No |
Executor Scheduling Flow
Execution Requested
↓
Find Action Workers
↓
Filter by Runtime Compatibility
↓
Filter by Active Status
↓
Filter by Heartbeat Freshness ← NEW
↓
Select Best Worker
↓
Queue to Worker
Signal Handling
Supported Signals
- SIGINT (Ctrl+C) - Graceful shutdown
- SIGTERM (docker stop, k8s termination) - Graceful shutdown
- SIGKILL (force kill) - No cleanup possible
Docker Example
# Graceful shutdown (10s grace period)
docker compose stop worker-shell
# Force kill (immediate)
docker compose kill worker-shell
Kubernetes Example
spec:
terminationGracePeriodSeconds: 30 # Time for graceful shutdown
Monitoring & Debugging
Check Worker Status
SELECT id, name, status, last_heartbeat,
EXTRACT(EPOCH FROM (NOW() - last_heartbeat)) as seconds_ago
FROM worker
WHERE worker_role = 'action'
ORDER BY last_heartbeat DESC;
Identify Stale Workers
SELECT id, name, status,
EXTRACT(EPOCH FROM (NOW() - last_heartbeat)) as seconds_ago
FROM worker
WHERE worker_role = 'action'
AND status = 'active'
AND (last_heartbeat IS NULL OR last_heartbeat < NOW() - INTERVAL '90 seconds');
View Worker Logs
# Docker Compose
docker compose logs -f worker-shell
# Look for:
# - "Worker registered with ID: X"
# - "Heartbeat sent successfully" (debug level)
# - "Received SIGTERM signal"
# - "Deregistering worker ID: X"
View Executor Logs
docker compose logs -f executor
# Look for:
# - "Worker X heartbeat is stale: last seen N seconds ago"
# - "No workers with fresh heartbeats available"
Common Issues
Issue: "No workers with fresh heartbeats available"
Causes:
- All workers crashed/terminated
- Workers paused/frozen
- Network partition between workers and database
- Database connection issues
Solutions:
- Check if workers are running:
docker compose ps - Restart workers:
docker compose restart worker-shell - Check worker logs for errors
- Verify database connectivity
Issue: Worker not deregistering on shutdown
Causes:
- SIGKILL used instead of SIGTERM
- Grace period too short
- Database connection lost before deregister
Solutions:
- Use
docker compose stopnotdocker compose kill - Increase grace period:
docker compose down -t 30 - Check network connectivity
Issue: Worker stuck in Active status after crash
Behavior: Normal - executor will detect as stale after 90s
Manual Cleanup (if needed):
UPDATE worker
SET status = 'inactive'
WHERE last_heartbeat < NOW() - INTERVAL '5 minutes';
Testing
Test Graceful Shutdown
# Start worker
docker compose up -d worker-shell
# Wait for registration
sleep 5
# Check status (should be 'active')
docker compose exec postgres psql -U attune -c \
"SELECT name, status FROM worker WHERE name LIKE 'worker-shell%';"
# Graceful shutdown
docker compose stop worker-shell
# Check status (should be 'inactive')
docker compose exec postgres psql -U attune -c \
"SELECT name, status FROM worker WHERE name LIKE 'worker-shell%';"
Test Heartbeat Validation
# Pause worker (simulate freeze)
docker compose pause worker-shell
# Wait for staleness (90+ seconds)
sleep 100
# Try to schedule execution (should fail)
# Use API or CLI to trigger execution
attune execution create --action core.echo --param message="test"
# Should see: "No workers with fresh heartbeats available"
Configuration Reference
Worker Config
worker:
name: "worker-01"
heartbeat_interval: 30 # Heartbeat update frequency (seconds)
max_concurrent_tasks: 10 # Concurrent execution limit
task_timeout: 300 # Per-task timeout (seconds)
Relevant Constants
// crates/executor/src/scheduler.rs
const DEFAULT_HEARTBEAT_INTERVAL: u64 = 30;
const HEARTBEAT_STALENESS_MULTIPLIER: u64 = 3;
// Max age = 90 seconds
Best Practices
- Use Graceful Shutdown: Always use SIGTERM, not SIGKILL
- Monitor Heartbeats: Alert when workers go stale
- Set Grace Periods: Allow 10-30s for worker shutdown in production
- Health Checks: Implement liveness probes in Kubernetes
- Auto-Restart: Configure restart policies for crashed workers
Related Documentation
work-summary/2026-02-worker-graceful-shutdown-heartbeat-validation.md- Implementation detailsdocs/architecture/worker-service.md- Worker architecturedocs/architecture/executor-service.md- Executor architectureAGENTS.md- Project conventions
Future Enhancements
- Configurable staleness multiplier
- Active health probing
- Graceful work completion before shutdown
- Worker reconnection logic
- Load-based worker selection