# Operational Runbook: Queue Management **Service**: Attune Executor **Component**: Execution Queue Manager **Audience**: Operations, SRE, DevOps **Last Updated**: 2025-01-27 --- ## Table of Contents 1. [Overview](#overview) 2. [Quick Reference](#quick-reference) 3. [Monitoring](#monitoring) 4. [Common Issues](#common-issues) 5. [Troubleshooting Procedures](#troubleshooting-procedures) 6. [Maintenance Tasks](#maintenance-tasks) 7. [Emergency Procedures](#emergency-procedures) 8. [Capacity Planning](#capacity-planning) --- ## Overview The Attune Executor service manages per-action FIFO execution queues to ensure fair, ordered processing when policy limits (concurrency, rate limits) are enforced. This runbook covers operational procedures for monitoring and managing these queues. ### Key Concepts - **Queue**: Per-action FIFO buffer of waiting executions - **Active Count**: Number of currently running executions for an action - **Max Concurrent**: Policy-enforced limit on parallel executions - **Queue Length**: Number of executions waiting in queue - **FIFO**: First-In-First-Out ordering guarantee ### System Components - **ExecutionQueueManager**: Core queue management (in-memory) - **CompletionListener**: Processes worker completion messages - **QueueStatsRepository**: Persists statistics to database - **API Endpoint**: `/api/v1/actions/:ref/queue-stats` --- ## Quick Reference ### Health Check Commands ```bash # Check executor service status systemctl status attune-executor # Check active queues curl -s http://localhost:8080/api/v1/actions/core.http.get/queue-stats | jq # Database query for all active queues psql -U attune -d attune -c " SELECT a.ref, qs.queue_length, qs.active_count, qs.max_concurrent, qs.oldest_enqueued_at, qs.last_updated FROM attune.queue_stats qs JOIN attune.action a ON a.id = qs.action_id WHERE queue_length > 0 OR active_count > 0 ORDER BY queue_length DESC; " # Check executor logs for queue issues journalctl -u attune-executor -n 100 --no-pager | grep -i queue ``` ### Emergency Actions ```bash # Restart executor (clears in-memory queues) sudo systemctl restart attune-executor # Restart all workers (forces completion messages) sudo systemctl restart attune-worker@* # Clear stale queue stats (older than 1 hour, inactive) psql -U attune -d attune -c " DELETE FROM attune.queue_stats WHERE last_updated < NOW() - INTERVAL '1 hour' AND queue_length = 0 AND active_count = 0; " ``` --- ## Monitoring ### Key Metrics to Track | Metric | Threshold | Action | |--------|-----------|--------| | Queue Length | > 100 | Investigate load | | Queue Length | > 500 | Add workers | | Queue Length | > 1000 | Emergency response | | Oldest Enqueued | > 10 min | Check workers | | Oldest Enqueued | > 30 min | Critical issue | | Active < Max Concurrent | Any | Workers stuck | | Last Updated | > 10 min | Executor issue | ### Monitoring Queries #### Active Queues Overview ```sql SELECT a.ref AS action, qs.queue_length, qs.active_count, qs.max_concurrent, ROUND(EXTRACT(EPOCH FROM (NOW() - qs.oldest_enqueued_at)) / 60, 1) AS wait_minutes, ROUND(qs.total_completed::float / NULLIF(qs.total_enqueued, 0) * 100, 2) AS completion_pct, qs.last_updated FROM attune.queue_stats qs JOIN attune.action a ON a.id = qs.action_id WHERE queue_length > 0 OR active_count > 0 ORDER BY queue_length DESC; ``` #### Top Actions by Throughput ```sql SELECT a.ref AS action, qs.total_enqueued, qs.total_completed, qs.total_enqueued - qs.total_completed AS pending, ROUND(qs.total_completed::float / NULLIF(qs.total_enqueued, 0) * 100, 2) AS completion_pct FROM attune.queue_stats qs JOIN attune.action a ON a.id = qs.action_id WHERE qs.total_enqueued > 0 ORDER BY qs.total_enqueued DESC LIMIT 20; ``` #### Stuck Queues (Not Progressing) ```sql SELECT a.ref AS action, qs.queue_length, qs.active_count, ROUND(EXTRACT(EPOCH FROM (NOW() - qs.last_updated)) / 60, 1) AS stale_minutes, qs.oldest_enqueued_at FROM attune.queue_stats qs JOIN attune.action a ON a.id = qs.action_id WHERE (queue_length > 0 OR active_count > 0) AND last_updated < NOW() - INTERVAL '10 minutes' ORDER BY stale_minutes DESC; ``` #### Queue Growth Rate ```sql -- Create a monitoring table for snapshots CREATE TABLE IF NOT EXISTS attune.queue_snapshots ( snapshot_time TIMESTAMPTZ DEFAULT NOW(), action_id BIGINT, queue_length INT, active_count INT, total_enqueued BIGINT ); -- Take snapshot (run every 5 minutes) INSERT INTO attune.queue_snapshots (action_id, queue_length, active_count, total_enqueued) SELECT action_id, queue_length, active_count, total_enqueued FROM attune.queue_stats WHERE queue_length > 0 OR active_count > 0; -- Analyze growth rate SELECT a.ref AS action, s1.queue_length AS queue_now, s2.queue_length AS queue_5min_ago, s1.queue_length - s2.queue_length AS growth, s1.total_enqueued - s2.total_enqueued AS new_requests FROM attune.queue_snapshots s1 JOIN attune.queue_snapshots s2 ON s2.action_id = s1.action_id JOIN attune.action a ON a.id = s1.action_id WHERE s1.snapshot_time >= NOW() - INTERVAL '1 minute' AND s2.snapshot_time >= NOW() - INTERVAL '6 minutes' AND s2.snapshot_time < NOW() - INTERVAL '4 minutes' ORDER BY growth DESC; ``` ### Alerting Rules **Prometheus/Grafana Alerts** (if metrics exported): ```yaml groups: - name: attune_queues interval: 30s rules: - alert: HighQueueDepth expr: attune_queue_length > 100 for: 5m labels: severity: warning annotations: summary: "Queue depth high for {{ $labels.action }}" description: "Queue has {{ $value }} waiting executions" - alert: CriticalQueueDepth expr: attune_queue_length > 500 for: 2m labels: severity: critical annotations: summary: "Critical queue depth for {{ $labels.action }}" description: "Queue has {{ $value }} waiting executions - add workers" - alert: StuckQueue expr: attune_queue_last_updated < time() - 600 for: 5m labels: severity: critical annotations: summary: "Queue not progressing for {{ $labels.action }}" description: "Queue hasn't updated in 10+ minutes" - alert: OldestExecutionAging expr: attune_queue_oldest_age_seconds > 1800 for: 5m labels: severity: warning annotations: summary: "Execution waiting 30+ minutes for {{ $labels.action }}" ``` **Nagios/Icinga Check**: ```bash #!/bin/bash # /usr/lib/nagios/plugins/check_attune_queues.sh WARN_THRESHOLD=${1:-100} CRIT_THRESHOLD=${2:-500} MAX_QUEUE=$(psql -U attune -d attune -t -c " SELECT COALESCE(MAX(queue_length), 0) FROM attune.queue_stats; ") if [ "$MAX_QUEUE" -ge "$CRIT_THRESHOLD" ]; then echo "CRITICAL: Max queue depth $MAX_QUEUE >= $CRIT_THRESHOLD" exit 2 elif [ "$MAX_QUEUE" -ge "$WARN_THRESHOLD" ]; then echo "WARNING: Max queue depth $MAX_QUEUE >= $WARN_THRESHOLD" exit 1 else echo "OK: Max queue depth $MAX_QUEUE" exit 0 fi ``` --- ## Common Issues ### Issue 1: Queue Growing Continuously **Symptoms:** - Queue length increases over time - Never decreases even when workers are idle - `oldest_enqueued_at` gets older **Common Causes:** 1. Workers not processing fast enough 2. Too many incoming requests 3. Concurrency limit too low 4. Worker crashes/restarts **Quick Diagnosis:** ```bash # Check worker status systemctl status attune-worker@* # Check worker resource usage ps aux | grep attune-worker top -p $(pgrep -d',' attune-worker) # Check recent completions psql -U attune -d attune -c " SELECT COUNT(*), status FROM attune.execution WHERE updated > NOW() - INTERVAL '5 minutes' GROUP BY status; " ``` **Resolution:** See [Troubleshooting: Growing Queue](#growing-queue) --- ### Issue 2: Queue Not Progressing **Symptoms:** - Queue length stays constant - `last_updated` timestamp doesn't change - Active executions showing but not completing **Common Causes:** 1. Workers crashed/hung 2. CompletionListener not running 3. Message queue connection lost 4. Database connection issue **Quick Diagnosis:** ```bash # Check executor process ps aux | grep attune-executor journalctl -u attune-executor -n 50 --no-pager # Check message queue rabbitmqctl list_queues name messages | grep execution.completed # Check for stuck executions psql -U attune -d attune -c " SELECT id, action, status, created, updated FROM attune.execution WHERE status = 'running' AND updated < NOW() - INTERVAL '10 minutes' ORDER BY created DESC LIMIT 10; " ``` **Resolution:** See [Troubleshooting: Stuck Queue](#stuck-queue) --- ### Issue 3: Queue Full Errors **Symptoms:** - API returns `Queue full (max length: 10000)` errors - New executions rejected - Users report action failures **Common Causes:** 1. Sudden traffic spike 2. Worker capacity exhausted 3. `max_queue_length` too low 4. Slow action execution **Quick Diagnosis:** ```bash # Check current queue stats curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq # Check configuration grep -A5 "queue:" /etc/attune/config.yaml # Check worker count systemctl list-units attune-worker@* | grep running ``` **Resolution:** See [Troubleshooting: Queue Full](#queue-full) --- ### Issue 4: FIFO Order Violation **Symptoms:** - Executions complete out of order - Later requests finish before earlier ones - Workflow dependencies break **Severity:** CRITICAL - This indicates a bug **Immediate Action:** 1. Capture executor logs immediately 2. Document the violation with timestamps 3. Restart executor service 4. File critical bug report **Data to Collect:** ```bash # Capture logs journalctl -u attune-executor --since "10 minutes ago" > /tmp/executor-fifo-violation.log # Capture database state psql -U attune -d attune -c " SELECT id, action, status, created, updated FROM attune.execution WHERE action = AND created > NOW() - INTERVAL '1 hour' ORDER BY created; " > /tmp/execution-order.txt # Capture queue stats curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq > /tmp/queue-stats.json ``` --- ## Troubleshooting Procedures ### Growing Queue **Procedure:** 1. **Assess Severity** ```bash # Get current queue depth curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq '.data.queue_length' ``` 2. **Check Worker Health** ```bash # Active workers systemctl list-units attune-worker@* | grep running | wc -l # Worker resource usage ps aux | grep attune-worker | awk '{print $3, $4, $11}' # Recent worker errors journalctl -u attune-worker@* -n 100 --no-pager | grep -i error ``` 3. **Check Completion Rate** ```sql SELECT COUNT(*) FILTER (WHERE created > NOW() - INTERVAL '5 minutes') AS recent_created, COUNT(*) FILTER (WHERE updated > NOW() - INTERVAL '5 minutes' AND status IN ('succeeded', 'failed')) AS recent_completed FROM attune.execution WHERE action = ; ``` 4. **Solutions (in order of preference)**: a. **Scale Workers** (if completion rate too low): ```bash # Add more worker instances sudo systemctl start attune-worker@2 sudo systemctl start attune-worker@3 ``` b. **Increase Concurrency** (if safe): ```yaml # In config.yaml or via API policies: actions: affected.action: concurrency_limit: 10 # Increase from 5 ``` c. **Rate Limit at API** (if traffic spike): ```yaml # In API config rate_limits: global: max_requests_per_minute: 1000 ``` d. **Temporary Queue Increase** (emergency only): ```yaml executor: queue: max_queue_length: 20000 # Increase from 10000 ``` Then restart executor: `sudo systemctl restart attune-executor` 5. **Monitor Results** ```bash watch -n 5 "curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq '.data.queue_length'" ``` --- ### Stuck Queue **Procedure:** 1. **Identify Stuck Executions** ```sql SELECT id, status, created, updated, EXTRACT(EPOCH FROM (NOW() - updated)) / 60 AS stuck_minutes FROM attune.execution WHERE action = AND status IN ('running', 'requested') AND updated < NOW() - INTERVAL '10 minutes' ORDER BY created; ``` 2. **Check Worker Status** ```bash # Are workers running? systemctl status attune-worker@* # Are workers processing? tail -f /var/log/attune/worker.log | grep execution_id ``` 3. **Check Message Queue** ```bash # Completion messages backing up? rabbitmqctl list_queues name messages | grep execution.completed # Connection issues? rabbitmqctl list_connections ``` 4. **Check CompletionListener** ```bash # Is listener running? journalctl -u attune-executor -n 100 --no-pager | grep CompletionListener # Recent completions processed? journalctl -u attune-executor -n 100 --no-pager | grep "notify_completion" ``` 5. **Solutions**: a. **Restart Stuck Workers**: ```bash # Graceful restart sudo systemctl restart attune-worker@1 ``` b. **Restart Executor** (if CompletionListener stuck): ```bash sudo systemctl restart attune-executor ``` c. **Force Complete Stuck Executions** (emergency): ```sql -- CAUTION: Only for truly stuck executions UPDATE attune.execution SET status = 'failed', result = '{"error": "Execution stuck, manually failed by operator"}', updated = NOW() WHERE id IN (); ``` d. **Purge and Restart** (nuclear option): ```bash # Stop services sudo systemctl stop attune-executor sudo systemctl stop attune-worker@* # Clear message queues rabbitmqctl purge_queue execution.requested rabbitmqctl purge_queue execution.completed # Restart services sudo systemctl start attune-executor sudo systemctl start attune-worker@1 ``` --- ### Queue Full **Procedure:** 1. **Immediate Mitigation** (choose one): a. **Temporarily Increase Limit**: ```yaml # config.yaml executor: queue: max_queue_length: 20000 ``` ```bash sudo systemctl restart attune-executor ``` b. **Add Workers**: ```bash sudo systemctl start attune-worker@{2..5} ``` c. **Increase Concurrency**: ```yaml policies: actions: affected.action: concurrency_limit: 20 # Increase ``` 2. **Analyze Root Cause** ```bash # Traffic pattern psql -U attune -d attune -c " SELECT DATE_TRUNC('minute', created) AS minute, COUNT(*) FROM attune.execution WHERE action = AND created > NOW() - INTERVAL '1 hour' GROUP BY minute ORDER BY minute DESC; " # Action performance psql -U attune -d attune -c " SELECT AVG(EXTRACT(EPOCH FROM (updated - created))) AS avg_duration_seconds FROM attune.execution WHERE action = AND status = 'succeeded' AND created > NOW() - INTERVAL '1 hour'; " ``` 3. **Long-term Solution**: - **Traffic spike**: Add API rate limiting - **Slow action**: Optimize action code - **Under-capacity**: Permanently scale workers - **Configuration**: Adjust concurrency limits --- ## Maintenance Tasks ### Daily ```bash #!/bin/bash # daily-queue-check.sh echo "=== Active Queues ===" psql -U attune -d attune -c " SELECT a.ref, qs.queue_length, qs.active_count FROM attune.queue_stats qs JOIN attune.action a ON a.id = qs.action_id WHERE queue_length > 0 OR active_count > 0; " echo "=== Stuck Queues ===" psql -U attune -d attune -c " SELECT a.ref, qs.queue_length, ROUND(EXTRACT(EPOCH FROM (NOW() - qs.last_updated)) / 60, 1) AS stale_minutes FROM attune.queue_stats qs JOIN attune.action a ON a.id = qs.action_id WHERE (queue_length > 0 OR active_count > 0) AND last_updated < NOW() - INTERVAL '30 minutes'; " echo "=== Top Actions by Volume ===" psql -U attune -d attune -c " SELECT a.ref, qs.total_enqueued, qs.total_completed FROM attune.queue_stats qs JOIN attune.action a ON a.id = qs.action_id ORDER BY qs.total_enqueued DESC LIMIT 10; " ``` ### Weekly ```bash #!/bin/bash # weekly-queue-maintenance.sh echo "=== Cleaning Stale Queue Stats ===" psql -U attune -d attune -c " DELETE FROM attune.queue_stats WHERE last_updated < NOW() - INTERVAL '7 days' AND queue_length = 0 AND active_count = 0; " echo "=== Queue Snapshots Cleanup ===" psql -U attune -d attune -c " DELETE FROM attune.queue_snapshots WHERE snapshot_time < NOW() - INTERVAL '30 days'; " echo "=== Executor Log Rotation ===" journalctl --vacuum-time=30d -u attune-executor ``` ### Monthly - Review queue capacity trends - Analyze high-volume actions - Plan scaling based on growth - Update alert thresholds - Review and test runbook procedures --- ## Emergency Procedures ### Emergency: System-Wide Queue Overload **Symptoms:** - Multiple actions with critical queue depths - System-wide performance degradation - API response times degraded **Procedure:** 1. **Enable Emergency Mode**: ```yaml # config.yaml executor: emergency_mode: true # Relaxes limits queue: max_queue_length: 50000 ``` 2. **Scale Workers Aggressively**: ```bash for i in {1..10}; do sudo systemctl start attune-worker@$i done ``` 3. **Temporarily Disable Non-Critical Actions**: ```sql -- Disable low-priority actions UPDATE attune.action SET enabled = false WHERE priority < 5 OR tags @> '["low-priority"]'; ``` 4. **Enable API Rate Limiting**: ```yaml api: rate_limits: global: enabled: true max_requests_per_minute: 500 ``` 5. **Monitor Recovery**: ```bash watch -n 10 "psql -U attune -d attune -t -c 'SELECT SUM(queue_length) FROM attune.queue_stats;'" ``` 6. **Post-Incident**: - Document what happened - Analyze root cause - Update capacity plan - Restore normal configuration --- ### Emergency: Executor Crash Loop **Symptoms:** - Executor service repeatedly crashes - Queues not progressing - High memory usage before crash **Procedure:** 1. **Capture Crash Logs**: ```bash journalctl -u attune-executor --since "30 minutes ago" > /tmp/executor-crash.log dmesg | tail -100 > /tmp/dmesg-crash.log ``` 2. **Check for Memory Issues**: ```bash # Check OOM kills grep -i "out of memory" /var/log/syslog grep -i "killed process" /var/log/kern.log ``` 3. **Emergency Restart with Limited Queues**: ```yaml # config.yaml executor: queue: max_queue_length: 1000 # Reduce drastically enable_metrics: false # Reduce overhead ``` 4. **Start in Safe Mode**: ```bash sudo systemctl start attune-executor # Monitor memory watch -n 1 "ps aux | grep attune-executor | grep -v grep" ``` 5. **If Still Crashing**: ```bash # Disable queue persistence temporarily # In code or via feature flag export ATTUNE__EXECUTOR__QUEUE__ENABLE_METRICS=false sudo systemctl restart attune-executor ``` 6. **Escalate**: - Contact development team - Provide crash logs and memory dumps - Consider rolling back to previous version --- ## Capacity Planning ### Calculating Required Capacity **Formula**: ``` Required Workers = (Peak Requests/Hour × Avg Duration) / 3600 / Concurrency Limit ``` **Example**: - Peak: 10,000 requests/hour - Avg Duration: 5 seconds - Concurrency: 10 per worker ``` Workers = (10,000 × 5) / 3600 / 10 = 1.4 → 2 workers minimum Add 50% buffer → 3 workers recommended ``` ### Growth Planning Monitor monthly trends: ```sql SELECT DATE_TRUNC('day', created) AS day, COUNT(*) AS executions, AVG(EXTRACT(EPOCH FROM (updated - created))) AS avg_duration FROM attune.execution WHERE created > NOW() - INTERVAL '30 days' GROUP BY day ORDER BY day; ``` ### Capacity Recommendations | Queue Depth | Worker Count | Action | |-------------|--------------|--------| | < 10 | Current | Maintain | | 10-50 | +25% | Plan scale-up | | 50-100 | +50% | Scale soon | | 100+ | +100% | Scale now | --- ## Related Documentation - [Queue Architecture](./queue-architecture.md) - [Executor Service](./executor-service.md) - [Worker Service](./worker-service.md) - [API: Actions - Queue Stats](./api-actions.md#get-queue-statistics) --- **Version**: 1.0 **Maintained By**: SRE Team **Last Updated**: 2025-01-27