attune-system/attune

Fork 0

Files

David Culbreth a74e13fa0b working out the worker/execution interface

2026-02-08 12:55:33 -06:00

6.2 KiB

Raw Blame History

Quick Reference: Worker Lifecycle & Heartbeat Validation

Last Updated: 2026-02-04
Status: Production Ready

Overview

Workers use graceful shutdown and heartbeat validation to ensure reliable execution scheduling.

Worker Lifecycle

Startup

Load configuration
Connect to database and message queue
Detect runtime capabilities
Register in database (status = Active)
Start heartbeat loop
Start consuming execution messages

Normal Operation

Heartbeat: Updates worker.last_heartbeat every 30 seconds (default)
Status: Remains Active
Executions: Processes messages from worker-specific queue

Shutdown (Graceful)

Receive SIGINT or SIGTERM signal
Stop heartbeat loop
Mark worker as Inactive in database
Exit cleanly

Shutdown (Crash/Kill)

Worker does not deregister
Status remains Active in database
Heartbeat stops updating
Executor detects as stale after 90 seconds

Heartbeat Validation

Configuration

worker:
  heartbeat_interval: 30  # seconds (default)

Staleness Threshold

Formula: heartbeat_interval * 3 = 90 seconds
Rationale: Allows 2 missed heartbeats + buffer
Detection: Executor checks on every scheduling attempt

Worker States

Last Heartbeat Age	Status	Schedulable
< 90 seconds	Fresh	✅ Yes
≥ 90 seconds	Stale	❌ No
None/NULL	Stale	❌ No

Executor Scheduling Flow

Execution Requested
    ↓
Find Action Workers
    ↓
Filter by Runtime Compatibility
    ↓
Filter by Active Status
    ↓
Filter by Heartbeat Freshness ← NEW
    ↓
Select Best Worker
    ↓
Queue to Worker

Signal Handling

Supported Signals

SIGINT (Ctrl+C) - Graceful shutdown
SIGTERM (docker stop, k8s termination) - Graceful shutdown
SIGKILL (force kill) - No cleanup possible

Docker Example

# Graceful shutdown (10s grace period)
docker compose stop worker-shell

# Force kill (immediate)
docker compose kill worker-shell

Kubernetes Example

spec:
  terminationGracePeriodSeconds: 30  # Time for graceful shutdown

Monitoring & Debugging

Check Worker Status

SELECT id, name, status, last_heartbeat,
       EXTRACT(EPOCH FROM (NOW() - last_heartbeat)) as seconds_ago
FROM worker
WHERE worker_role = 'action'
ORDER BY last_heartbeat DESC;

Identify Stale Workers

SELECT id, name, status,
       EXTRACT(EPOCH FROM (NOW() - last_heartbeat)) as seconds_ago
FROM worker
WHERE worker_role = 'action'
  AND status = 'active'
  AND (last_heartbeat IS NULL OR last_heartbeat < NOW() - INTERVAL '90 seconds');

View Worker Logs

# Docker Compose
docker compose logs -f worker-shell

# Look for:
# - "Worker registered with ID: X"
# - "Heartbeat sent successfully" (debug level)
# - "Received SIGTERM signal"
# - "Deregistering worker ID: X"

View Executor Logs

docker compose logs -f executor

# Look for:
# - "Worker X heartbeat is stale: last seen N seconds ago"
# - "No workers with fresh heartbeats available"

Common Issues

Issue: "No workers with fresh heartbeats available"

Causes:

All workers crashed/terminated
Workers paused/frozen
Network partition between workers and database
Database connection issues

Solutions:

Check if workers are running: docker compose ps
Restart workers: docker compose restart worker-shell
Check worker logs for errors
Verify database connectivity

Issue: Worker not deregistering on shutdown