working out the worker/execution interface
This commit is contained in:
256
docs/QUICKREF-worker-lifecycle-heartbeat.md
Normal file
256
docs/QUICKREF-worker-lifecycle-heartbeat.md
Normal file
@@ -0,0 +1,256 @@
|
||||
# Quick Reference: Worker Lifecycle & Heartbeat Validation
|
||||
|
||||
**Last Updated:** 2026-02-04
|
||||
**Status:** Production Ready
|
||||
|
||||
## Overview
|
||||
|
||||
Workers use graceful shutdown and heartbeat validation to ensure reliable execution scheduling.
|
||||
|
||||
## Worker Lifecycle
|
||||
|
||||
### Startup
|
||||
1. Load configuration
|
||||
2. Connect to database and message queue
|
||||
3. Detect runtime capabilities
|
||||
4. Register in database (status = `Active`)
|
||||
5. Start heartbeat loop
|
||||
6. Start consuming execution messages
|
||||
|
||||
### Normal Operation
|
||||
- **Heartbeat:** Updates `worker.last_heartbeat` every 30 seconds (default)
|
||||
- **Status:** Remains `Active`
|
||||
- **Executions:** Processes messages from worker-specific queue
|
||||
|
||||
### Shutdown (Graceful)
|
||||
1. Receive SIGINT or SIGTERM signal
|
||||
2. Stop heartbeat loop
|
||||
3. Mark worker as `Inactive` in database
|
||||
4. Exit cleanly
|
||||
|
||||
### Shutdown (Crash/Kill)
|
||||
- Worker does not deregister
|
||||
- Status remains `Active` in database
|
||||
- Heartbeat stops updating
|
||||
- **Executor detects as stale after 90 seconds**
|
||||
|
||||
## Heartbeat Validation
|
||||
|
||||
### Configuration
|
||||
```yaml
|
||||
worker:
|
||||
heartbeat_interval: 30 # seconds (default)
|
||||
```
|
||||
|
||||
### Staleness Threshold
|
||||
- **Formula:** `heartbeat_interval * 3 = 90 seconds`
|
||||
- **Rationale:** Allows 2 missed heartbeats + buffer
|
||||
- **Detection:** Executor checks on every scheduling attempt
|
||||
|
||||
### Worker States
|
||||
|
||||
| Last Heartbeat Age | Status | Schedulable |
|
||||
|-------------------|--------|-------------|
|
||||
| < 90 seconds | Fresh | ✅ Yes |
|
||||
| ≥ 90 seconds | Stale | ❌ No |
|
||||
| None/NULL | Stale | ❌ No |
|
||||
|
||||
## Executor Scheduling Flow
|
||||
|
||||
```
|
||||
Execution Requested
|
||||
↓
|
||||
Find Action Workers
|
||||
↓
|
||||
Filter by Runtime Compatibility
|
||||
↓
|
||||
Filter by Active Status
|
||||
↓
|
||||
Filter by Heartbeat Freshness ← NEW
|
||||
↓
|
||||
Select Best Worker
|
||||
↓
|
||||
Queue to Worker
|
||||
```
|
||||
|
||||
## Signal Handling
|
||||
|
||||
### Supported Signals
|
||||
- **SIGINT** (Ctrl+C) - Graceful shutdown
|
||||
- **SIGTERM** (docker stop, k8s termination) - Graceful shutdown
|
||||
- **SIGKILL** (force kill) - No cleanup possible
|
||||
|
||||
### Docker Example
|
||||
```bash
|
||||
# Graceful shutdown (10s grace period)
|
||||
docker compose stop worker-shell
|
||||
|
||||
# Force kill (immediate)
|
||||
docker compose kill worker-shell
|
||||
```
|
||||
|
||||
### Kubernetes Example
|
||||
```yaml
|
||||
spec:
|
||||
terminationGracePeriodSeconds: 30 # Time for graceful shutdown
|
||||
```
|
||||
|
||||
## Monitoring & Debugging
|
||||
|
||||
### Check Worker Status
|
||||
```sql
|
||||
SELECT id, name, status, last_heartbeat,
|
||||
EXTRACT(EPOCH FROM (NOW() - last_heartbeat)) as seconds_ago
|
||||
FROM worker
|
||||
WHERE worker_role = 'action'
|
||||
ORDER BY last_heartbeat DESC;
|
||||
```
|
||||
|
||||
### Identify Stale Workers
|
||||
```sql
|
||||
SELECT id, name, status,
|
||||
EXTRACT(EPOCH FROM (NOW() - last_heartbeat)) as seconds_ago
|
||||
FROM worker
|
||||
WHERE worker_role = 'action'
|
||||
AND status = 'active'
|
||||
AND (last_heartbeat IS NULL OR last_heartbeat < NOW() - INTERVAL '90 seconds');
|
||||
```
|
||||
|
||||
### View Worker Logs
|
||||
```bash
|
||||
# Docker Compose
|
||||
docker compose logs -f worker-shell
|
||||
|
||||
# Look for:
|
||||
# - "Worker registered with ID: X"
|
||||
# - "Heartbeat sent successfully" (debug level)
|
||||
# - "Received SIGTERM signal"
|
||||
# - "Deregistering worker ID: X"
|
||||
```
|
||||
|
||||
### View Executor Logs
|
||||
```bash
|
||||
docker compose logs -f executor
|
||||
|
||||
# Look for:
|
||||
# - "Worker X heartbeat is stale: last seen N seconds ago"
|
||||
# - "No workers with fresh heartbeats available"
|
||||
```
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Issue: "No workers with fresh heartbeats available"
|
||||
|
||||
**Causes:**
|
||||
1. All workers crashed/terminated
|
||||
2. Workers paused/frozen
|
||||
3. Network partition between workers and database
|
||||
4. Database connection issues
|
||||
|
||||
**Solutions:**
|
||||
1. Check if workers are running: `docker compose ps`
|
||||
2. Restart workers: `docker compose restart worker-shell`
|
||||
3. Check worker logs for errors
|
||||
4. Verify database connectivity
|
||||
|
||||
### Issue: Worker not deregistering on shutdown
|
||||
|
||||
**Causes:**
|
||||
1. SIGKILL used instead of SIGTERM
|
||||
2. Grace period too short
|
||||
3. Database connection lost before deregister
|
||||
|
||||
**Solutions:**
|
||||
1. Use `docker compose stop` not `docker compose kill`
|
||||
2. Increase grace period: `docker compose down -t 30`
|
||||
3. Check network connectivity
|
||||
|
||||
### Issue: Worker stuck in Active status after crash
|
||||
|
||||
**Behavior:** Normal - executor will detect as stale after 90s
|
||||
|
||||
**Manual Cleanup (if needed):**
|
||||
```sql
|
||||
UPDATE worker
|
||||
SET status = 'inactive'
|
||||
WHERE last_heartbeat < NOW() - INTERVAL '5 minutes';
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
### Test Graceful Shutdown
|
||||
```bash
|
||||
# Start worker
|
||||
docker compose up -d worker-shell
|
||||
|
||||
# Wait for registration
|
||||
sleep 5
|
||||
|
||||
# Check status (should be 'active')
|
||||
docker compose exec postgres psql -U attune -c \
|
||||
"SELECT name, status FROM worker WHERE name LIKE 'worker-shell%';"
|
||||
|
||||
# Graceful shutdown
|
||||
docker compose stop worker-shell
|
||||
|
||||
# Check status (should be 'inactive')
|
||||
docker compose exec postgres psql -U attune -c \
|
||||
"SELECT name, status FROM worker WHERE name LIKE 'worker-shell%';"
|
||||
```
|
||||
|
||||
### Test Heartbeat Validation
|
||||
```bash
|
||||
# Pause worker (simulate freeze)
|
||||
docker compose pause worker-shell
|
||||
|
||||
# Wait for staleness (90+ seconds)
|
||||
sleep 100
|
||||
|
||||
# Try to schedule execution (should fail)
|
||||
# Use API or CLI to trigger execution
|
||||
attune execution create --action core.echo --param message="test"
|
||||
|
||||
# Should see: "No workers with fresh heartbeats available"
|
||||
```
|
||||
|
||||
## Configuration Reference
|
||||
|
||||
### Worker Config
|
||||
```yaml
|
||||
worker:
|
||||
name: "worker-01"
|
||||
heartbeat_interval: 30 # Heartbeat update frequency (seconds)
|
||||
max_concurrent_tasks: 10 # Concurrent execution limit
|
||||
task_timeout: 300 # Per-task timeout (seconds)
|
||||
```
|
||||
|
||||
### Relevant Constants
|
||||
```rust
|
||||
// crates/executor/src/scheduler.rs
|
||||
const DEFAULT_HEARTBEAT_INTERVAL: u64 = 30;
|
||||
const HEARTBEAT_STALENESS_MULTIPLIER: u64 = 3;
|
||||
// Max age = 90 seconds
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use Graceful Shutdown:** Always use SIGTERM, not SIGKILL
|
||||
2. **Monitor Heartbeats:** Alert when workers go stale
|
||||
3. **Set Grace Periods:** Allow 10-30s for worker shutdown in production
|
||||
4. **Health Checks:** Implement liveness probes in Kubernetes
|
||||
5. **Auto-Restart:** Configure restart policies for crashed workers
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- `work-summary/2026-02-worker-graceful-shutdown-heartbeat-validation.md` - Implementation details
|
||||
- `docs/architecture/worker-service.md` - Worker architecture
|
||||
- `docs/architecture/executor-service.md` - Executor architecture
|
||||
- `AGENTS.md` - Project conventions
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
- [ ] Configurable staleness multiplier
|
||||
- [ ] Active health probing
|
||||
- [ ] Graceful work completion before shutdown
|
||||
- [ ] Worker reconnection logic
|
||||
- [ ] Load-based worker selection
|
||||
Reference in New Issue
Block a user