20 KiB
Operational Runbook: Queue Management
Service: Attune Executor
Component: Execution Queue Manager
Audience: Operations, SRE, DevOps
Last Updated: 2025-01-27
Table of Contents
- Overview
- Quick Reference
- Monitoring
- Common Issues
- Troubleshooting Procedures
- Maintenance Tasks
- Emergency Procedures
- Capacity Planning
Overview
The Attune Executor service manages per-action FIFO execution queues to ensure fair, ordered processing when policy limits (concurrency, rate limits) are enforced. This runbook covers operational procedures for monitoring and managing these queues.
Key Concepts
- Queue: Per-action FIFO buffer of waiting executions
- Active Count: Number of currently running executions for an action
- Max Concurrent: Policy-enforced limit on parallel executions
- Queue Length: Number of executions waiting in queue
- FIFO: First-In-First-Out ordering guarantee
System Components
- ExecutionQueueManager: Core queue management (in-memory)
- CompletionListener: Processes worker completion messages
- QueueStatsRepository: Persists statistics to database
- API Endpoint:
/api/v1/actions/:ref/queue-stats
Quick Reference
Health Check Commands
# Check executor service status
systemctl status attune-executor
# Check active queues
curl -s http://localhost:8080/api/v1/actions/core.http.get/queue-stats | jq
# Database query for all active queues
psql -U attune -d attune -c "
SELECT a.ref, qs.queue_length, qs.active_count, qs.max_concurrent,
qs.oldest_enqueued_at, qs.last_updated
FROM attune.queue_stats qs
JOIN attune.action a ON a.id = qs.action_id
WHERE queue_length > 0 OR active_count > 0
ORDER BY queue_length DESC;
"
# Check executor logs for queue issues
journalctl -u attune-executor -n 100 --no-pager | grep -i queue
Emergency Actions
# Restart executor (clears in-memory queues)
sudo systemctl restart attune-executor
# Restart all workers (forces completion messages)
sudo systemctl restart attune-worker@*
# Clear stale queue stats (older than 1 hour, inactive)
psql -U attune -d attune -c "
DELETE FROM attune.queue_stats
WHERE last_updated < NOW() - INTERVAL '1 hour'
AND queue_length = 0
AND active_count = 0;
"
Monitoring
Key Metrics to Track
| Metric | Threshold | Action |
|---|---|---|
| Queue Length | > 100 | Investigate load |
| Queue Length | > 500 | Add workers |
| Queue Length | > 1000 | Emergency response |
| Oldest Enqueued | > 10 min | Check workers |
| Oldest Enqueued | > 30 min | Critical issue |
| Active < Max Concurrent | Any | Workers stuck |
| Last Updated | > 10 min | Executor issue |
Monitoring Queries
Active Queues Overview
SELECT
a.ref AS action,
qs.queue_length,
qs.active_count,
qs.max_concurrent,
ROUND(EXTRACT(EPOCH FROM (NOW() - qs.oldest_enqueued_at)) / 60, 1) AS wait_minutes,
ROUND(qs.total_completed::float / NULLIF(qs.total_enqueued, 0) * 100, 2) AS completion_pct,
qs.last_updated
FROM attune.queue_stats qs
JOIN attune.action a ON a.id = qs.action_id
WHERE queue_length > 0 OR active_count > 0
ORDER BY queue_length DESC;
Top Actions by Throughput
SELECT
a.ref AS action,
qs.total_enqueued,
qs.total_completed,
qs.total_enqueued - qs.total_completed AS pending,
ROUND(qs.total_completed::float / NULLIF(qs.total_enqueued, 0) * 100, 2) AS completion_pct
FROM attune.queue_stats qs
JOIN attune.action a ON a.id = qs.action_id
WHERE qs.total_enqueued > 0
ORDER BY qs.total_enqueued DESC
LIMIT 20;
Stuck Queues (Not Progressing)
SELECT
a.ref AS action,
qs.queue_length,
qs.active_count,
ROUND(EXTRACT(EPOCH FROM (NOW() - qs.last_updated)) / 60, 1) AS stale_minutes,
qs.oldest_enqueued_at
FROM attune.queue_stats qs
JOIN attune.action a ON a.id = qs.action_id
WHERE (queue_length > 0 OR active_count > 0)
AND last_updated < NOW() - INTERVAL '10 minutes'
ORDER BY stale_minutes DESC;
Queue Growth Rate
-- Create a monitoring table for snapshots
CREATE TABLE IF NOT EXISTS attune.queue_snapshots (
snapshot_time TIMESTAMPTZ DEFAULT NOW(),
action_id BIGINT,
queue_length INT,
active_count INT,
total_enqueued BIGINT
);
-- Take snapshot (run every 5 minutes)
INSERT INTO attune.queue_snapshots (action_id, queue_length, active_count, total_enqueued)
SELECT action_id, queue_length, active_count, total_enqueued
FROM attune.queue_stats
WHERE queue_length > 0 OR active_count > 0;
-- Analyze growth rate
SELECT
a.ref AS action,
s1.queue_length AS queue_now,
s2.queue_length AS queue_5min_ago,
s1.queue_length - s2.queue_length AS growth,
s1.total_enqueued - s2.total_enqueued AS new_requests
FROM attune.queue_snapshots s1
JOIN attune.queue_snapshots s2 ON s2.action_id = s1.action_id
JOIN attune.action a ON a.id = s1.action_id
WHERE s1.snapshot_time >= NOW() - INTERVAL '1 minute'
AND s2.snapshot_time >= NOW() - INTERVAL '6 minutes'
AND s2.snapshot_time < NOW() - INTERVAL '4 minutes'
ORDER BY growth DESC;
Alerting Rules
Prometheus/Grafana Alerts (if metrics exported):
groups:
- name: attune_queues
interval: 30s
rules:
- alert: HighQueueDepth
expr: attune_queue_length > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Queue depth high for {{ $labels.action }}"
description: "Queue has {{ $value }} waiting executions"
- alert: CriticalQueueDepth
expr: attune_queue_length > 500
for: 2m
labels:
severity: critical
annotations:
summary: "Critical queue depth for {{ $labels.action }}"
description: "Queue has {{ $value }} waiting executions - add workers"
- alert: StuckQueue
expr: attune_queue_last_updated < time() - 600
for: 5m
labels:
severity: critical
annotations:
summary: "Queue not progressing for {{ $labels.action }}"
description: "Queue hasn't updated in 10+ minutes"
- alert: OldestExecutionAging
expr: attune_queue_oldest_age_seconds > 1800
for: 5m
labels:
severity: warning
annotations:
summary: "Execution waiting 30+ minutes for {{ $labels.action }}"
Nagios/Icinga Check:
#!/bin/bash
# /usr/lib/nagios/plugins/check_attune_queues.sh
WARN_THRESHOLD=${1:-100}
CRIT_THRESHOLD=${2:-500}
MAX_QUEUE=$(psql -U attune -d attune -t -c "
SELECT COALESCE(MAX(queue_length), 0) FROM attune.queue_stats;
")
if [ "$MAX_QUEUE" -ge "$CRIT_THRESHOLD" ]; then
echo "CRITICAL: Max queue depth $MAX_QUEUE >= $CRIT_THRESHOLD"
exit 2
elif [ "$MAX_QUEUE" -ge "$WARN_THRESHOLD" ]; then
echo "WARNING: Max queue depth $MAX_QUEUE >= $WARN_THRESHOLD"
exit 1
else
echo "OK: Max queue depth $MAX_QUEUE"
exit 0
fi
Common Issues
Issue 1: Queue Growing Continuously
Symptoms:
- Queue length increases over time
- Never decreases even when workers are idle
oldest_enqueued_atgets older
Common Causes:
- Workers not processing fast enough
- Too many incoming requests
- Concurrency limit too low
- Worker crashes/restarts
Quick Diagnosis:
# Check worker status
systemctl status attune-worker@*
# Check worker resource usage
ps aux | grep attune-worker
top -p $(pgrep -d',' attune-worker)
# Check recent completions
psql -U attune -d attune -c "
SELECT COUNT(*), status
FROM attune.execution
WHERE updated > NOW() - INTERVAL '5 minutes'
GROUP BY status;
"
Resolution: See Troubleshooting: Growing Queue
Issue 2: Queue Not Progressing
Symptoms:
- Queue length stays constant
last_updatedtimestamp doesn't change- Active executions showing but not completing
Common Causes:
- Workers crashed/hung
- CompletionListener not running
- Message queue connection lost
- Database connection issue
Quick Diagnosis:
# Check executor process
ps aux | grep attune-executor
journalctl -u attune-executor -n 50 --no-pager
# Check message queue
rabbitmqctl list_queues name messages | grep execution.completed
# Check for stuck executions
psql -U attune -d attune -c "
SELECT id, action, status, created, updated
FROM attune.execution
WHERE status = 'running'
AND updated < NOW() - INTERVAL '10 minutes'
ORDER BY created DESC
LIMIT 10;
"
Resolution: See Troubleshooting: Stuck Queue
Issue 3: Queue Full Errors
Symptoms:
- API returns
Queue full (max length: 10000)errors - New executions rejected
- Users report action failures
Common Causes:
- Sudden traffic spike
- Worker capacity exhausted
max_queue_lengthtoo low- Slow action execution
Quick Diagnosis:
# Check current queue stats
curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq
# Check configuration
grep -A5 "queue:" /etc/attune/config.yaml
# Check worker count
systemctl list-units attune-worker@* | grep running
Resolution: See Troubleshooting: Queue Full
Issue 4: FIFO Order Violation
Symptoms:
- Executions complete out of order
- Later requests finish before earlier ones
- Workflow dependencies break
Severity: CRITICAL - This indicates a bug
Immediate Action:
- Capture executor logs immediately
- Document the violation with timestamps
- Restart executor service
- File critical bug report
Data to Collect:
# Capture logs
journalctl -u attune-executor --since "10 minutes ago" > /tmp/executor-fifo-violation.log
# Capture database state
psql -U attune -d attune -c "
SELECT id, action, status, created, updated
FROM attune.execution
WHERE action = <affected_action_id>
AND created > NOW() - INTERVAL '1 hour'
ORDER BY created;
" > /tmp/execution-order.txt
# Capture queue stats
curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq > /tmp/queue-stats.json
Troubleshooting Procedures
Growing Queue
Procedure:
-
Assess Severity
# Get current queue depth curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq '.data.queue_length' -
Check Worker Health
# Active workers systemctl list-units attune-worker@* | grep running | wc -l # Worker resource usage ps aux | grep attune-worker | awk '{print $3, $4, $11}' # Recent worker errors journalctl -u attune-worker@* -n 100 --no-pager | grep -i error -
Check Completion Rate
SELECT COUNT(*) FILTER (WHERE created > NOW() - INTERVAL '5 minutes') AS recent_created, COUNT(*) FILTER (WHERE updated > NOW() - INTERVAL '5 minutes' AND status IN ('succeeded', 'failed')) AS recent_completed FROM attune.execution WHERE action = <action_id>; -
Solutions (in order of preference):
a. Scale Workers (if completion rate too low):
# Add more worker instances sudo systemctl start attune-worker@2 sudo systemctl start attune-worker@3b. Increase Concurrency (if safe):
# In config.yaml or via API policies: actions: affected.action: concurrency_limit: 10 # Increase from 5c. Rate Limit at API (if traffic spike):
# In API config rate_limits: global: max_requests_per_minute: 1000d. Temporary Queue Increase (emergency only):
executor: queue: max_queue_length: 20000 # Increase from 10000Then restart executor:
sudo systemctl restart attune-executor -
Monitor Results
watch -n 5 "curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq '.data.queue_length'"
Stuck Queue
Procedure:
-
Identify Stuck Executions
SELECT id, status, created, updated, EXTRACT(EPOCH FROM (NOW() - updated)) / 60 AS stuck_minutes FROM attune.execution WHERE action = <action_id> AND status IN ('running', 'requested') AND updated < NOW() - INTERVAL '10 minutes' ORDER BY created; -
Check Worker Status
# Are workers running? systemctl status attune-worker@* # Are workers processing? tail -f /var/log/attune/worker.log | grep execution_id -
Check Message Queue
# Completion messages backing up? rabbitmqctl list_queues name messages | grep execution.completed # Connection issues? rabbitmqctl list_connections -
Check CompletionListener
# Is listener running? journalctl -u attune-executor -n 100 --no-pager | grep CompletionListener # Recent completions processed? journalctl -u attune-executor -n 100 --no-pager | grep "notify_completion" -
Solutions:
a. Restart Stuck Workers:
# Graceful restart sudo systemctl restart attune-worker@1b. Restart Executor (if CompletionListener stuck):
sudo systemctl restart attune-executorc. Force Complete Stuck Executions (emergency):
-- CAUTION: Only for truly stuck executions UPDATE attune.execution SET status = 'failed', result = '{"error": "Execution stuck, manually failed by operator"}', updated = NOW() WHERE id IN (<stuck_execution_ids>);d. Purge and Restart (nuclear option):
# Stop services sudo systemctl stop attune-executor sudo systemctl stop attune-worker@* # Clear message queues rabbitmqctl purge_queue execution.requested rabbitmqctl purge_queue execution.completed # Restart services sudo systemctl start attune-executor sudo systemctl start attune-worker@1
Queue Full
Procedure:
-
Immediate Mitigation (choose one):
a. Temporarily Increase Limit:
# config.yaml executor: queue: max_queue_length: 20000sudo systemctl restart attune-executorb. Add Workers:
sudo systemctl start attune-worker@{2..5}c. Increase Concurrency:
policies: actions: affected.action: concurrency_limit: 20 # Increase -
Analyze Root Cause
# Traffic pattern psql -U attune -d attune -c " SELECT DATE_TRUNC('minute', created) AS minute, COUNT(*) FROM attune.execution WHERE action = <action_id> AND created > NOW() - INTERVAL '1 hour' GROUP BY minute ORDER BY minute DESC; " # Action performance psql -U attune -d attune -c " SELECT AVG(EXTRACT(EPOCH FROM (updated - created))) AS avg_duration_seconds FROM attune.execution WHERE action = <action_id> AND status = 'succeeded' AND created > NOW() - INTERVAL '1 hour'; " -
Long-term Solution:
- Traffic spike: Add API rate limiting
- Slow action: Optimize action code
- Under-capacity: Permanently scale workers
- Configuration: Adjust concurrency limits
Maintenance Tasks
Daily
#!/bin/bash
# daily-queue-check.sh
echo "=== Active Queues ==="
psql -U attune -d attune -c "
SELECT a.ref, qs.queue_length, qs.active_count
FROM attune.queue_stats qs
JOIN attune.action a ON a.id = qs.action_id
WHERE queue_length > 0 OR active_count > 0;
"
echo "=== Stuck Queues ==="
psql -U attune -d attune -c "
SELECT a.ref, qs.queue_length,
ROUND(EXTRACT(EPOCH FROM (NOW() - qs.last_updated)) / 60, 1) AS stale_minutes
FROM attune.queue_stats qs
JOIN attune.action a ON a.id = qs.action_id
WHERE (queue_length > 0 OR active_count > 0)
AND last_updated < NOW() - INTERVAL '30 minutes';
"
echo "=== Top Actions by Volume ==="
psql -U attune -d attune -c "
SELECT a.ref, qs.total_enqueued, qs.total_completed
FROM attune.queue_stats qs
JOIN attune.action a ON a.id = qs.action_id
ORDER BY qs.total_enqueued DESC
LIMIT 10;
"
Weekly
#!/bin/bash
# weekly-queue-maintenance.sh
echo "=== Cleaning Stale Queue Stats ==="
psql -U attune -d attune -c "
DELETE FROM attune.queue_stats
WHERE last_updated < NOW() - INTERVAL '7 days'
AND queue_length = 0
AND active_count = 0;
"
echo "=== Queue Snapshots Cleanup ==="
psql -U attune -d attune -c "
DELETE FROM attune.queue_snapshots
WHERE snapshot_time < NOW() - INTERVAL '30 days';
"
echo "=== Executor Log Rotation ==="
journalctl --vacuum-time=30d -u attune-executor
Monthly
- Review queue capacity trends
- Analyze high-volume actions
- Plan scaling based on growth
- Update alert thresholds
- Review and test runbook procedures
Emergency Procedures
Emergency: System-Wide Queue Overload
Symptoms:
- Multiple actions with critical queue depths
- System-wide performance degradation
- API response times degraded
Procedure:
-
Enable Emergency Mode:
# config.yaml executor: emergency_mode: true # Relaxes limits queue: max_queue_length: 50000 -
Scale Workers Aggressively:
for i in {1..10}; do sudo systemctl start attune-worker@$i done -
Temporarily Disable Non-Critical Actions:
-- Disable low-priority actions UPDATE attune.action SET enabled = false WHERE priority < 5 OR tags @> '["low-priority"]'; -
Enable API Rate Limiting:
api: rate_limits: global: enabled: true max_requests_per_minute: 500 -
Monitor Recovery:
watch -n 10 "psql -U attune -d attune -t -c 'SELECT SUM(queue_length) FROM attune.queue_stats;'" -
Post-Incident:
- Document what happened
- Analyze root cause
- Update capacity plan
- Restore normal configuration
Emergency: Executor Crash Loop
Symptoms:
- Executor service repeatedly crashes
- Queues not progressing
- High memory usage before crash
Procedure:
-
Capture Crash Logs:
journalctl -u attune-executor --since "30 minutes ago" > /tmp/executor-crash.log dmesg | tail -100 > /tmp/dmesg-crash.log -
Check for Memory Issues:
# Check OOM kills grep -i "out of memory" /var/log/syslog grep -i "killed process" /var/log/kern.log -
Emergency Restart with Limited Queues:
# config.yaml executor: queue: max_queue_length: 1000 # Reduce drastically enable_metrics: false # Reduce overhead -
Start in Safe Mode:
sudo systemctl start attune-executor # Monitor memory watch -n 1 "ps aux | grep attune-executor | grep -v grep" -
If Still Crashing:
# Disable queue persistence temporarily # In code or via feature flag export ATTUNE__EXECUTOR__QUEUE__ENABLE_METRICS=false sudo systemctl restart attune-executor -
Escalate:
- Contact development team
- Provide crash logs and memory dumps
- Consider rolling back to previous version
Capacity Planning
Calculating Required Capacity
Formula:
Required Workers = (Peak Requests/Hour × Avg Duration) / 3600 / Concurrency Limit
Example:
- Peak: 10,000 requests/hour
- Avg Duration: 5 seconds
- Concurrency: 10 per worker
Workers = (10,000 × 5) / 3600 / 10 = 1.4 → 2 workers minimum
Add 50% buffer → 3 workers recommended
Growth Planning
Monitor monthly trends:
SELECT
DATE_TRUNC('day', created) AS day,
COUNT(*) AS executions,
AVG(EXTRACT(EPOCH FROM (updated - created))) AS avg_duration
FROM attune.execution
WHERE created > NOW() - INTERVAL '30 days'
GROUP BY day
ORDER BY day;
Capacity Recommendations
| Queue Depth | Worker Count | Action |
|---|---|---|
| < 10 | Current | Maintain |
| 10-50 | +25% | Plan scale-up |
| 50-100 | +50% | Scale soon |
| 100+ | +100% | Scale now |
Related Documentation
Version: 1.0
Maintained By: SRE Team
Last Updated: 2025-01-27