Files
attune/docs/deployment/ops-runbook-queues.md
2026-02-04 17:46:30 -06:00

20 KiB
Raw Permalink Blame History

Operational Runbook: Queue Management

Service: Attune Executor
Component: Execution Queue Manager
Audience: Operations, SRE, DevOps
Last Updated: 2025-01-27


Table of Contents

  1. Overview
  2. Quick Reference
  3. Monitoring
  4. Common Issues
  5. Troubleshooting Procedures
  6. Maintenance Tasks
  7. Emergency Procedures
  8. Capacity Planning

Overview

The Attune Executor service manages per-action FIFO execution queues to ensure fair, ordered processing when policy limits (concurrency, rate limits) are enforced. This runbook covers operational procedures for monitoring and managing these queues.

Key Concepts

  • Queue: Per-action FIFO buffer of waiting executions
  • Active Count: Number of currently running executions for an action
  • Max Concurrent: Policy-enforced limit on parallel executions
  • Queue Length: Number of executions waiting in queue
  • FIFO: First-In-First-Out ordering guarantee

System Components

  • ExecutionQueueManager: Core queue management (in-memory)
  • CompletionListener: Processes worker completion messages
  • QueueStatsRepository: Persists statistics to database
  • API Endpoint: /api/v1/actions/:ref/queue-stats

Quick Reference

Health Check Commands

# Check executor service status
systemctl status attune-executor

# Check active queues
curl -s http://localhost:8080/api/v1/actions/core.http.get/queue-stats | jq

# Database query for all active queues
psql -U attune -d attune -c "
  SELECT a.ref, qs.queue_length, qs.active_count, qs.max_concurrent,
         qs.oldest_enqueued_at, qs.last_updated
  FROM attune.queue_stats qs
  JOIN attune.action a ON a.id = qs.action_id
  WHERE queue_length > 0 OR active_count > 0
  ORDER BY queue_length DESC;
"

# Check executor logs for queue issues
journalctl -u attune-executor -n 100 --no-pager | grep -i queue

Emergency Actions

# Restart executor (clears in-memory queues)
sudo systemctl restart attune-executor

# Restart all workers (forces completion messages)
sudo systemctl restart attune-worker@*

# Clear stale queue stats (older than 1 hour, inactive)
psql -U attune -d attune -c "
  DELETE FROM attune.queue_stats
  WHERE last_updated < NOW() - INTERVAL '1 hour'
    AND queue_length = 0
    AND active_count = 0;
"

Monitoring

Key Metrics to Track

Metric Threshold Action
Queue Length > 100 Investigate load
Queue Length > 500 Add workers
Queue Length > 1000 Emergency response
Oldest Enqueued > 10 min Check workers
Oldest Enqueued > 30 min Critical issue
Active < Max Concurrent Any Workers stuck
Last Updated > 10 min Executor issue

Monitoring Queries

Active Queues Overview

SELECT 
  a.ref AS action,
  qs.queue_length,
  qs.active_count,
  qs.max_concurrent,
  ROUND(EXTRACT(EPOCH FROM (NOW() - qs.oldest_enqueued_at)) / 60, 1) AS wait_minutes,
  ROUND(qs.total_completed::float / NULLIF(qs.total_enqueued, 0) * 100, 2) AS completion_pct,
  qs.last_updated
FROM attune.queue_stats qs
JOIN attune.action a ON a.id = qs.action_id
WHERE queue_length > 0 OR active_count > 0
ORDER BY queue_length DESC;

Top Actions by Throughput

SELECT 
  a.ref AS action,
  qs.total_enqueued,
  qs.total_completed,
  qs.total_enqueued - qs.total_completed AS pending,
  ROUND(qs.total_completed::float / NULLIF(qs.total_enqueued, 0) * 100, 2) AS completion_pct
FROM attune.queue_stats qs
JOIN attune.action a ON a.id = qs.action_id
WHERE qs.total_enqueued > 0
ORDER BY qs.total_enqueued DESC
LIMIT 20;

Stuck Queues (Not Progressing)

SELECT 
  a.ref AS action,
  qs.queue_length,
  qs.active_count,
  ROUND(EXTRACT(EPOCH FROM (NOW() - qs.last_updated)) / 60, 1) AS stale_minutes,
  qs.oldest_enqueued_at
FROM attune.queue_stats qs
JOIN attune.action a ON a.id = qs.action_id
WHERE (queue_length > 0 OR active_count > 0)
  AND last_updated < NOW() - INTERVAL '10 minutes'
ORDER BY stale_minutes DESC;

Queue Growth Rate

-- Create a monitoring table for snapshots
CREATE TABLE IF NOT EXISTS attune.queue_snapshots (
  snapshot_time TIMESTAMPTZ DEFAULT NOW(),
  action_id BIGINT,
  queue_length INT,
  active_count INT,
  total_enqueued BIGINT
);

-- Take snapshot (run every 5 minutes)
INSERT INTO attune.queue_snapshots (action_id, queue_length, active_count, total_enqueued)
SELECT action_id, queue_length, active_count, total_enqueued
FROM attune.queue_stats
WHERE queue_length > 0 OR active_count > 0;

-- Analyze growth rate
SELECT 
  a.ref AS action,
  s1.queue_length AS queue_now,
  s2.queue_length AS queue_5min_ago,
  s1.queue_length - s2.queue_length AS growth,
  s1.total_enqueued - s2.total_enqueued AS new_requests
FROM attune.queue_snapshots s1
JOIN attune.queue_snapshots s2 ON s2.action_id = s1.action_id
JOIN attune.action a ON a.id = s1.action_id
WHERE s1.snapshot_time >= NOW() - INTERVAL '1 minute'
  AND s2.snapshot_time >= NOW() - INTERVAL '6 minutes'
  AND s2.snapshot_time < NOW() - INTERVAL '4 minutes'
ORDER BY growth DESC;

Alerting Rules

Prometheus/Grafana Alerts (if metrics exported):

groups:
  - name: attune_queues
    interval: 30s
    rules:
      - alert: HighQueueDepth
        expr: attune_queue_length > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Queue depth high for {{ $labels.action }}"
          description: "Queue has {{ $value }} waiting executions"

      - alert: CriticalQueueDepth
        expr: attune_queue_length > 500
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Critical queue depth for {{ $labels.action }}"
          description: "Queue has {{ $value }} waiting executions - add workers"

      - alert: StuckQueue
        expr: attune_queue_last_updated < time() - 600
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Queue not progressing for {{ $labels.action }}"
          description: "Queue hasn't updated in 10+ minutes"

      - alert: OldestExecutionAging
        expr: attune_queue_oldest_age_seconds > 1800
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Execution waiting 30+ minutes for {{ $labels.action }}"

Nagios/Icinga Check:

#!/bin/bash
# /usr/lib/nagios/plugins/check_attune_queues.sh

WARN_THRESHOLD=${1:-100}
CRIT_THRESHOLD=${2:-500}

MAX_QUEUE=$(psql -U attune -d attune -t -c "
  SELECT COALESCE(MAX(queue_length), 0) FROM attune.queue_stats;
")

if [ "$MAX_QUEUE" -ge "$CRIT_THRESHOLD" ]; then
  echo "CRITICAL: Max queue depth $MAX_QUEUE >= $CRIT_THRESHOLD"
  exit 2
elif [ "$MAX_QUEUE" -ge "$WARN_THRESHOLD" ]; then
  echo "WARNING: Max queue depth $MAX_QUEUE >= $WARN_THRESHOLD"
  exit 1
else
  echo "OK: Max queue depth $MAX_QUEUE"
  exit 0
fi

Common Issues

Issue 1: Queue Growing Continuously

Symptoms:

  • Queue length increases over time
  • Never decreases even when workers are idle
  • oldest_enqueued_at gets older

Common Causes:

  1. Workers not processing fast enough
  2. Too many incoming requests
  3. Concurrency limit too low
  4. Worker crashes/restarts

Quick Diagnosis:

# Check worker status
systemctl status attune-worker@*

# Check worker resource usage
ps aux | grep attune-worker
top -p $(pgrep -d',' attune-worker)

# Check recent completions
psql -U attune -d attune -c "
  SELECT COUNT(*), status
  FROM attune.execution
  WHERE updated > NOW() - INTERVAL '5 minutes'
  GROUP BY status;
"

Resolution: See Troubleshooting: Growing Queue


Issue 2: Queue Not Progressing

Symptoms:

  • Queue length stays constant
  • last_updated timestamp doesn't change
  • Active executions showing but not completing

Common Causes:

  1. Workers crashed/hung
  2. CompletionListener not running
  3. Message queue connection lost
  4. Database connection issue

Quick Diagnosis:

# Check executor process
ps aux | grep attune-executor
journalctl -u attune-executor -n 50 --no-pager

# Check message queue
rabbitmqctl list_queues name messages | grep execution.completed

# Check for stuck executions
psql -U attune -d attune -c "
  SELECT id, action, status, created, updated
  FROM attune.execution
  WHERE status = 'running'
    AND updated < NOW() - INTERVAL '10 minutes'
  ORDER BY created DESC
  LIMIT 10;
"

Resolution: See Troubleshooting: Stuck Queue


Issue 3: Queue Full Errors

Symptoms:

  • API returns Queue full (max length: 10000) errors
  • New executions rejected
  • Users report action failures

Common Causes:

  1. Sudden traffic spike
  2. Worker capacity exhausted
  3. max_queue_length too low
  4. Slow action execution

Quick Diagnosis:

# Check current queue stats
curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq

# Check configuration
grep -A5 "queue:" /etc/attune/config.yaml

# Check worker count
systemctl list-units attune-worker@* | grep running

Resolution: See Troubleshooting: Queue Full


Issue 4: FIFO Order Violation

Symptoms:

  • Executions complete out of order
  • Later requests finish before earlier ones
  • Workflow dependencies break

Severity: CRITICAL - This indicates a bug

Immediate Action:

  1. Capture executor logs immediately
  2. Document the violation with timestamps
  3. Restart executor service
  4. File critical bug report

Data to Collect:

# Capture logs
journalctl -u attune-executor --since "10 minutes ago" > /tmp/executor-fifo-violation.log

# Capture database state
psql -U attune -d attune -c "
  SELECT id, action, status, created, updated
  FROM attune.execution
  WHERE action = <affected_action_id>
    AND created > NOW() - INTERVAL '1 hour'
  ORDER BY created;
" > /tmp/execution-order.txt

# Capture queue stats
curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq > /tmp/queue-stats.json

Troubleshooting Procedures

Growing Queue

Procedure:

  1. Assess Severity

    # Get current queue depth
    curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq '.data.queue_length'
    
  2. Check Worker Health

    # Active workers
    systemctl list-units attune-worker@* | grep running | wc -l
    
    # Worker resource usage
    ps aux | grep attune-worker | awk '{print $3, $4, $11}'
    
    # Recent worker errors
    journalctl -u attune-worker@* -n 100 --no-pager | grep -i error
    
  3. Check Completion Rate

    SELECT 
      COUNT(*) FILTER (WHERE created > NOW() - INTERVAL '5 minutes') AS recent_created,
      COUNT(*) FILTER (WHERE updated > NOW() - INTERVAL '5 minutes' AND status IN ('succeeded', 'failed')) AS recent_completed
    FROM attune.execution
    WHERE action = <action_id>;
    
  4. Solutions (in order of preference):

    a. Scale Workers (if completion rate too low):

    # Add more worker instances
    sudo systemctl start attune-worker@2
    sudo systemctl start attune-worker@3
    

    b. Increase Concurrency (if safe):

    # In config.yaml or via API
    policies:
      actions:
        affected.action:
          concurrency_limit: 10  # Increase from 5
    

    c. Rate Limit at API (if traffic spike):

    # In API config
    rate_limits:
      global:
        max_requests_per_minute: 1000
    

    d. Temporary Queue Increase (emergency only):

    executor:
      queue:
        max_queue_length: 20000  # Increase from 10000
    

    Then restart executor: sudo systemctl restart attune-executor

  5. Monitor Results

    watch -n 5 "curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq '.data.queue_length'"
    

Stuck Queue

Procedure:

  1. Identify Stuck Executions

    SELECT id, status, created, updated, 
           EXTRACT(EPOCH FROM (NOW() - updated)) / 60 AS stuck_minutes
    FROM attune.execution
    WHERE action = <action_id>
      AND status IN ('running', 'requested')
      AND updated < NOW() - INTERVAL '10 minutes'
    ORDER BY created;
    
  2. Check Worker Status

    # Are workers running?
    systemctl status attune-worker@*
    
    # Are workers processing?
    tail -f /var/log/attune/worker.log | grep execution_id
    
  3. Check Message Queue

    # Completion messages backing up?
    rabbitmqctl list_queues name messages | grep execution.completed
    
    # Connection issues?
    rabbitmqctl list_connections
    
  4. Check CompletionListener

    # Is listener running?
    journalctl -u attune-executor -n 100 --no-pager | grep CompletionListener
    
    # Recent completions processed?
    journalctl -u attune-executor -n 100 --no-pager | grep "notify_completion"
    
  5. Solutions:

    a. Restart Stuck Workers:

    # Graceful restart
    sudo systemctl restart attune-worker@1
    

    b. Restart Executor (if CompletionListener stuck):

    sudo systemctl restart attune-executor
    

    c. Force Complete Stuck Executions (emergency):

    -- CAUTION: Only for truly stuck executions
    UPDATE attune.execution
    SET status = 'failed', 
        result = '{"error": "Execution stuck, manually failed by operator"}',
        updated = NOW()
    WHERE id IN (<stuck_execution_ids>);
    

    d. Purge and Restart (nuclear option):

    # Stop services
    sudo systemctl stop attune-executor
    sudo systemctl stop attune-worker@*
    
    # Clear message queues
    rabbitmqctl purge_queue execution.requested
    rabbitmqctl purge_queue execution.completed
    
    # Restart services
    sudo systemctl start attune-executor
    sudo systemctl start attune-worker@1
    

Queue Full

Procedure:

  1. Immediate Mitigation (choose one):

    a. Temporarily Increase Limit:

    # config.yaml
    executor:
      queue:
        max_queue_length: 20000
    
    sudo systemctl restart attune-executor
    

    b. Add Workers:

    sudo systemctl start attune-worker@{2..5}
    

    c. Increase Concurrency:

    policies:
      actions:
        affected.action:
          concurrency_limit: 20  # Increase
    
  2. Analyze Root Cause

    # Traffic pattern
    psql -U attune -d attune -c "
      SELECT DATE_TRUNC('minute', created) AS minute, COUNT(*)
      FROM attune.execution
      WHERE action = <action_id>
        AND created > NOW() - INTERVAL '1 hour'
      GROUP BY minute
      ORDER BY minute DESC;
    "
    
    # Action performance
    psql -U attune -d attune -c "
      SELECT AVG(EXTRACT(EPOCH FROM (updated - created))) AS avg_duration_seconds
      FROM attune.execution
      WHERE action = <action_id>
        AND status = 'succeeded'
        AND created > NOW() - INTERVAL '1 hour';
    "
    
  3. Long-term Solution:

    • Traffic spike: Add API rate limiting
    • Slow action: Optimize action code
    • Under-capacity: Permanently scale workers
    • Configuration: Adjust concurrency limits

Maintenance Tasks

Daily

#!/bin/bash
# daily-queue-check.sh

echo "=== Active Queues ==="
psql -U attune -d attune -c "
  SELECT a.ref, qs.queue_length, qs.active_count
  FROM attune.queue_stats qs
  JOIN attune.action a ON a.id = qs.action_id
  WHERE queue_length > 0 OR active_count > 0;
"

echo "=== Stuck Queues ==="
psql -U attune -d attune -c "
  SELECT a.ref, qs.queue_length, 
         ROUND(EXTRACT(EPOCH FROM (NOW() - qs.last_updated)) / 60, 1) AS stale_minutes
  FROM attune.queue_stats qs
  JOIN attune.action a ON a.id = qs.action_id
  WHERE (queue_length > 0 OR active_count > 0)
    AND last_updated < NOW() - INTERVAL '30 minutes';
"

echo "=== Top Actions by Volume ==="
psql -U attune -d attune -c "
  SELECT a.ref, qs.total_enqueued, qs.total_completed
  FROM attune.queue_stats qs
  JOIN attune.action a ON a.id = qs.action_id
  ORDER BY qs.total_enqueued DESC
  LIMIT 10;
"

Weekly

#!/bin/bash
# weekly-queue-maintenance.sh

echo "=== Cleaning Stale Queue Stats ==="
psql -U attune -d attune -c "
  DELETE FROM attune.queue_stats
  WHERE last_updated < NOW() - INTERVAL '7 days'
    AND queue_length = 0
    AND active_count = 0;
"

echo "=== Queue Snapshots Cleanup ==="
psql -U attune -d attune -c "
  DELETE FROM attune.queue_snapshots
  WHERE snapshot_time < NOW() - INTERVAL '30 days';
"

echo "=== Executor Log Rotation ==="
journalctl --vacuum-time=30d -u attune-executor

Monthly

  • Review queue capacity trends
  • Analyze high-volume actions
  • Plan scaling based on growth
  • Update alert thresholds
  • Review and test runbook procedures

Emergency Procedures

Emergency: System-Wide Queue Overload

Symptoms:

  • Multiple actions with critical queue depths
  • System-wide performance degradation
  • API response times degraded

Procedure:

  1. Enable Emergency Mode:

    # config.yaml
    executor:
      emergency_mode: true  # Relaxes limits
      queue:
        max_queue_length: 50000
    
  2. Scale Workers Aggressively:

    for i in {1..10}; do
      sudo systemctl start attune-worker@$i
    done
    
  3. Temporarily Disable Non-Critical Actions:

    -- Disable low-priority actions
    UPDATE attune.action
    SET enabled = false
    WHERE priority < 5 OR tags @> '["low-priority"]';
    
  4. Enable API Rate Limiting:

    api:
      rate_limits:
        global:
          enabled: true
          max_requests_per_minute: 500
    
  5. Monitor Recovery:

    watch -n 10 "psql -U attune -d attune -t -c 'SELECT SUM(queue_length) FROM attune.queue_stats;'"
    
  6. Post-Incident:

    • Document what happened
    • Analyze root cause
    • Update capacity plan
    • Restore normal configuration

Emergency: Executor Crash Loop

Symptoms:

  • Executor service repeatedly crashes
  • Queues not progressing
  • High memory usage before crash

Procedure:

  1. Capture Crash Logs:

    journalctl -u attune-executor --since "30 minutes ago" > /tmp/executor-crash.log
    dmesg | tail -100 > /tmp/dmesg-crash.log
    
  2. Check for Memory Issues:

    # Check OOM kills
    grep -i "out of memory" /var/log/syslog
    grep -i "killed process" /var/log/kern.log
    
  3. Emergency Restart with Limited Queues:

    # config.yaml
    executor:
      queue:
        max_queue_length: 1000  # Reduce drastically
        enable_metrics: false    # Reduce overhead
    
  4. Start in Safe Mode:

    sudo systemctl start attune-executor
    # Monitor memory
    watch -n 1 "ps aux | grep attune-executor | grep -v grep"
    
  5. If Still Crashing:

    # Disable queue persistence temporarily
    # In code or via feature flag
    export ATTUNE__EXECUTOR__QUEUE__ENABLE_METRICS=false
    sudo systemctl restart attune-executor
    
  6. Escalate:

    • Contact development team
    • Provide crash logs and memory dumps
    • Consider rolling back to previous version

Capacity Planning

Calculating Required Capacity

Formula:

Required Workers = (Peak Requests/Hour × Avg Duration) / 3600 / Concurrency Limit

Example:

  • Peak: 10,000 requests/hour
  • Avg Duration: 5 seconds
  • Concurrency: 10 per worker
Workers = (10,000 × 5) / 3600 / 10 = 1.4 → 2 workers minimum
Add 50% buffer → 3 workers recommended

Growth Planning

Monitor monthly trends:

SELECT 
  DATE_TRUNC('day', created) AS day,
  COUNT(*) AS executions,
  AVG(EXTRACT(EPOCH FROM (updated - created))) AS avg_duration
FROM attune.execution
WHERE created > NOW() - INTERVAL '30 days'
GROUP BY day
ORDER BY day;

Capacity Recommendations

Queue Depth Worker Count Action
< 10 Current Maintain
10-50 +25% Plan scale-up
50-100 +50% Scale soon
100+ +100% Scale now


Version: 1.0
Maintained By: SRE Team
Last Updated: 2025-01-27