re-uploading work
This commit is contained in:
851
docs/deployment/ops-runbook-queues.md
Normal file
851
docs/deployment/ops-runbook-queues.md
Normal file
@@ -0,0 +1,851 @@
|
||||
# Operational Runbook: Queue Management
|
||||
|
||||
**Service**: Attune Executor
|
||||
**Component**: Execution Queue Manager
|
||||
**Audience**: Operations, SRE, DevOps
|
||||
**Last Updated**: 2025-01-27
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Overview](#overview)
|
||||
2. [Quick Reference](#quick-reference)
|
||||
3. [Monitoring](#monitoring)
|
||||
4. [Common Issues](#common-issues)
|
||||
5. [Troubleshooting Procedures](#troubleshooting-procedures)
|
||||
6. [Maintenance Tasks](#maintenance-tasks)
|
||||
7. [Emergency Procedures](#emergency-procedures)
|
||||
8. [Capacity Planning](#capacity-planning)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
The Attune Executor service manages per-action FIFO execution queues to ensure fair, ordered processing when policy limits (concurrency, rate limits) are enforced. This runbook covers operational procedures for monitoring and managing these queues.
|
||||
|
||||
### Key Concepts
|
||||
|
||||
- **Queue**: Per-action FIFO buffer of waiting executions
|
||||
- **Active Count**: Number of currently running executions for an action
|
||||
- **Max Concurrent**: Policy-enforced limit on parallel executions
|
||||
- **Queue Length**: Number of executions waiting in queue
|
||||
- **FIFO**: First-In-First-Out ordering guarantee
|
||||
|
||||
### System Components
|
||||
|
||||
- **ExecutionQueueManager**: Core queue management (in-memory)
|
||||
- **CompletionListener**: Processes worker completion messages
|
||||
- **QueueStatsRepository**: Persists statistics to database
|
||||
- **API Endpoint**: `/api/v1/actions/:ref/queue-stats`
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference
|
||||
|
||||
### Health Check Commands
|
||||
|
||||
```bash
|
||||
# Check executor service status
|
||||
systemctl status attune-executor
|
||||
|
||||
# Check active queues
|
||||
curl -s http://localhost:8080/api/v1/actions/core.http.get/queue-stats | jq
|
||||
|
||||
# Database query for all active queues
|
||||
psql -U attune -d attune -c "
|
||||
SELECT a.ref, qs.queue_length, qs.active_count, qs.max_concurrent,
|
||||
qs.oldest_enqueued_at, qs.last_updated
|
||||
FROM attune.queue_stats qs
|
||||
JOIN attune.action a ON a.id = qs.action_id
|
||||
WHERE queue_length > 0 OR active_count > 0
|
||||
ORDER BY queue_length DESC;
|
||||
"
|
||||
|
||||
# Check executor logs for queue issues
|
||||
journalctl -u attune-executor -n 100 --no-pager | grep -i queue
|
||||
```
|
||||
|
||||
### Emergency Actions
|
||||
|
||||
```bash
|
||||
# Restart executor (clears in-memory queues)
|
||||
sudo systemctl restart attune-executor
|
||||
|
||||
# Restart all workers (forces completion messages)
|
||||
sudo systemctl restart attune-worker@*
|
||||
|
||||
# Clear stale queue stats (older than 1 hour, inactive)
|
||||
psql -U attune -d attune -c "
|
||||
DELETE FROM attune.queue_stats
|
||||
WHERE last_updated < NOW() - INTERVAL '1 hour'
|
||||
AND queue_length = 0
|
||||
AND active_count = 0;
|
||||
"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Key Metrics to Track
|
||||
|
||||
| Metric | Threshold | Action |
|
||||
|--------|-----------|--------|
|
||||
| Queue Length | > 100 | Investigate load |
|
||||
| Queue Length | > 500 | Add workers |
|
||||
| Queue Length | > 1000 | Emergency response |
|
||||
| Oldest Enqueued | > 10 min | Check workers |
|
||||
| Oldest Enqueued | > 30 min | Critical issue |
|
||||
| Active < Max Concurrent | Any | Workers stuck |
|
||||
| Last Updated | > 10 min | Executor issue |
|
||||
|
||||
### Monitoring Queries
|
||||
|
||||
#### Active Queues Overview
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
a.ref AS action,
|
||||
qs.queue_length,
|
||||
qs.active_count,
|
||||
qs.max_concurrent,
|
||||
ROUND(EXTRACT(EPOCH FROM (NOW() - qs.oldest_enqueued_at)) / 60, 1) AS wait_minutes,
|
||||
ROUND(qs.total_completed::float / NULLIF(qs.total_enqueued, 0) * 100, 2) AS completion_pct,
|
||||
qs.last_updated
|
||||
FROM attune.queue_stats qs
|
||||
JOIN attune.action a ON a.id = qs.action_id
|
||||
WHERE queue_length > 0 OR active_count > 0
|
||||
ORDER BY queue_length DESC;
|
||||
```
|
||||
|
||||
#### Top Actions by Throughput
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
a.ref AS action,
|
||||
qs.total_enqueued,
|
||||
qs.total_completed,
|
||||
qs.total_enqueued - qs.total_completed AS pending,
|
||||
ROUND(qs.total_completed::float / NULLIF(qs.total_enqueued, 0) * 100, 2) AS completion_pct
|
||||
FROM attune.queue_stats qs
|
||||
JOIN attune.action a ON a.id = qs.action_id
|
||||
WHERE qs.total_enqueued > 0
|
||||
ORDER BY qs.total_enqueued DESC
|
||||
LIMIT 20;
|
||||
```
|
||||
|
||||
#### Stuck Queues (Not Progressing)
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
a.ref AS action,
|
||||
qs.queue_length,
|
||||
qs.active_count,
|
||||
ROUND(EXTRACT(EPOCH FROM (NOW() - qs.last_updated)) / 60, 1) AS stale_minutes,
|
||||
qs.oldest_enqueued_at
|
||||
FROM attune.queue_stats qs
|
||||
JOIN attune.action a ON a.id = qs.action_id
|
||||
WHERE (queue_length > 0 OR active_count > 0)
|
||||
AND last_updated < NOW() - INTERVAL '10 minutes'
|
||||
ORDER BY stale_minutes DESC;
|
||||
```
|
||||
|
||||
#### Queue Growth Rate
|
||||
|
||||
```sql
|
||||
-- Create a monitoring table for snapshots
|
||||
CREATE TABLE IF NOT EXISTS attune.queue_snapshots (
|
||||
snapshot_time TIMESTAMPTZ DEFAULT NOW(),
|
||||
action_id BIGINT,
|
||||
queue_length INT,
|
||||
active_count INT,
|
||||
total_enqueued BIGINT
|
||||
);
|
||||
|
||||
-- Take snapshot (run every 5 minutes)
|
||||
INSERT INTO attune.queue_snapshots (action_id, queue_length, active_count, total_enqueued)
|
||||
SELECT action_id, queue_length, active_count, total_enqueued
|
||||
FROM attune.queue_stats
|
||||
WHERE queue_length > 0 OR active_count > 0;
|
||||
|
||||
-- Analyze growth rate
|
||||
SELECT
|
||||
a.ref AS action,
|
||||
s1.queue_length AS queue_now,
|
||||
s2.queue_length AS queue_5min_ago,
|
||||
s1.queue_length - s2.queue_length AS growth,
|
||||
s1.total_enqueued - s2.total_enqueued AS new_requests
|
||||
FROM attune.queue_snapshots s1
|
||||
JOIN attune.queue_snapshots s2 ON s2.action_id = s1.action_id
|
||||
JOIN attune.action a ON a.id = s1.action_id
|
||||
WHERE s1.snapshot_time >= NOW() - INTERVAL '1 minute'
|
||||
AND s2.snapshot_time >= NOW() - INTERVAL '6 minutes'
|
||||
AND s2.snapshot_time < NOW() - INTERVAL '4 minutes'
|
||||
ORDER BY growth DESC;
|
||||
```
|
||||
|
||||
### Alerting Rules
|
||||
|
||||
**Prometheus/Grafana Alerts** (if metrics exported):
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: attune_queues
|
||||
interval: 30s
|
||||
rules:
|
||||
- alert: HighQueueDepth
|
||||
expr: attune_queue_length > 100
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Queue depth high for {{ $labels.action }}"
|
||||
description: "Queue has {{ $value }} waiting executions"
|
||||
|
||||
- alert: CriticalQueueDepth
|
||||
expr: attune_queue_length > 500
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Critical queue depth for {{ $labels.action }}"
|
||||
description: "Queue has {{ $value }} waiting executions - add workers"
|
||||
|
||||
- alert: StuckQueue
|
||||
expr: attune_queue_last_updated < time() - 600
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Queue not progressing for {{ $labels.action }}"
|
||||
description: "Queue hasn't updated in 10+ minutes"
|
||||
|
||||
- alert: OldestExecutionAging
|
||||
expr: attune_queue_oldest_age_seconds > 1800
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Execution waiting 30+ minutes for {{ $labels.action }}"
|
||||
```
|
||||
|
||||
**Nagios/Icinga Check**:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# /usr/lib/nagios/plugins/check_attune_queues.sh
|
||||
|
||||
WARN_THRESHOLD=${1:-100}
|
||||
CRIT_THRESHOLD=${2:-500}
|
||||
|
||||
MAX_QUEUE=$(psql -U attune -d attune -t -c "
|
||||
SELECT COALESCE(MAX(queue_length), 0) FROM attune.queue_stats;
|
||||
")
|
||||
|
||||
if [ "$MAX_QUEUE" -ge "$CRIT_THRESHOLD" ]; then
|
||||
echo "CRITICAL: Max queue depth $MAX_QUEUE >= $CRIT_THRESHOLD"
|
||||
exit 2
|
||||
elif [ "$MAX_QUEUE" -ge "$WARN_THRESHOLD" ]; then
|
||||
echo "WARNING: Max queue depth $MAX_QUEUE >= $WARN_THRESHOLD"
|
||||
exit 1
|
||||
else
|
||||
echo "OK: Max queue depth $MAX_QUEUE"
|
||||
exit 0
|
||||
fi
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Issue 1: Queue Growing Continuously
|
||||
|
||||
**Symptoms:**
|
||||
- Queue length increases over time
|
||||
- Never decreases even when workers are idle
|
||||
- `oldest_enqueued_at` gets older
|
||||
|
||||
**Common Causes:**
|
||||
1. Workers not processing fast enough
|
||||
2. Too many incoming requests
|
||||
3. Concurrency limit too low
|
||||
4. Worker crashes/restarts
|
||||
|
||||
**Quick Diagnosis:**
|
||||
```bash
|
||||
# Check worker status
|
||||
systemctl status attune-worker@*
|
||||
|
||||
# Check worker resource usage
|
||||
ps aux | grep attune-worker
|
||||
top -p $(pgrep -d',' attune-worker)
|
||||
|
||||
# Check recent completions
|
||||
psql -U attune -d attune -c "
|
||||
SELECT COUNT(*), status
|
||||
FROM attune.execution
|
||||
WHERE updated > NOW() - INTERVAL '5 minutes'
|
||||
GROUP BY status;
|
||||
"
|
||||
```
|
||||
|
||||
**Resolution:** See [Troubleshooting: Growing Queue](#growing-queue)
|
||||
|
||||
---
|
||||
|
||||
### Issue 2: Queue Not Progressing
|
||||
|
||||
**Symptoms:**
|
||||
- Queue length stays constant
|
||||
- `last_updated` timestamp doesn't change
|
||||
- Active executions showing but not completing
|
||||
|
||||
**Common Causes:**
|
||||
1. Workers crashed/hung
|
||||
2. CompletionListener not running
|
||||
3. Message queue connection lost
|
||||
4. Database connection issue
|
||||
|
||||
**Quick Diagnosis:**
|
||||
```bash
|
||||
# Check executor process
|
||||
ps aux | grep attune-executor
|
||||
journalctl -u attune-executor -n 50 --no-pager
|
||||
|
||||
# Check message queue
|
||||
rabbitmqctl list_queues name messages | grep execution.completed
|
||||
|
||||
# Check for stuck executions
|
||||
psql -U attune -d attune -c "
|
||||
SELECT id, action, status, created, updated
|
||||
FROM attune.execution
|
||||
WHERE status = 'running'
|
||||
AND updated < NOW() - INTERVAL '10 minutes'
|
||||
ORDER BY created DESC
|
||||
LIMIT 10;
|
||||
"
|
||||
```
|
||||
|
||||
**Resolution:** See [Troubleshooting: Stuck Queue](#stuck-queue)
|
||||
|
||||
---
|
||||
|
||||
### Issue 3: Queue Full Errors
|
||||
|
||||
**Symptoms:**
|
||||
- API returns `Queue full (max length: 10000)` errors
|
||||
- New executions rejected
|
||||
- Users report action failures
|
||||
|
||||
**Common Causes:**
|
||||
1. Sudden traffic spike
|
||||
2. Worker capacity exhausted
|
||||
3. `max_queue_length` too low
|
||||
4. Slow action execution
|
||||
|
||||
**Quick Diagnosis:**
|
||||
```bash
|
||||
# Check current queue stats
|
||||
curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq
|
||||
|
||||
# Check configuration
|
||||
grep -A5 "queue:" /etc/attune/config.yaml
|
||||
|
||||
# Check worker count
|
||||
systemctl list-units attune-worker@* | grep running
|
||||
```
|
||||
|
||||
**Resolution:** See [Troubleshooting: Queue Full](#queue-full)
|
||||
|
||||
---
|
||||
|
||||
### Issue 4: FIFO Order Violation
|
||||
|
||||
**Symptoms:**
|
||||
- Executions complete out of order
|
||||
- Later requests finish before earlier ones
|
||||
- Workflow dependencies break
|
||||
|
||||
**Severity:** CRITICAL - This indicates a bug
|
||||
|
||||
**Immediate Action:**
|
||||
1. Capture executor logs immediately
|
||||
2. Document the violation with timestamps
|
||||
3. Restart executor service
|
||||
4. File critical bug report
|
||||
|
||||
**Data to Collect:**
|
||||
```bash
|
||||
# Capture logs
|
||||
journalctl -u attune-executor --since "10 minutes ago" > /tmp/executor-fifo-violation.log
|
||||
|
||||
# Capture database state
|
||||
psql -U attune -d attune -c "
|
||||
SELECT id, action, status, created, updated
|
||||
FROM attune.execution
|
||||
WHERE action = <affected_action_id>
|
||||
AND created > NOW() - INTERVAL '1 hour'
|
||||
ORDER BY created;
|
||||
" > /tmp/execution-order.txt
|
||||
|
||||
# Capture queue stats
|
||||
curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq > /tmp/queue-stats.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting Procedures
|
||||
|
||||
### Growing Queue
|
||||
|
||||
**Procedure:**
|
||||
|
||||
1. **Assess Severity**
|
||||
```bash
|
||||
# Get current queue depth
|
||||
curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq '.data.queue_length'
|
||||
```
|
||||
|
||||
2. **Check Worker Health**
|
||||
```bash
|
||||
# Active workers
|
||||
systemctl list-units attune-worker@* | grep running | wc -l
|
||||
|
||||
# Worker resource usage
|
||||
ps aux | grep attune-worker | awk '{print $3, $4, $11}'
|
||||
|
||||
# Recent worker errors
|
||||
journalctl -u attune-worker@* -n 100 --no-pager | grep -i error
|
||||
```
|
||||
|
||||
3. **Check Completion Rate**
|
||||
```sql
|
||||
SELECT
|
||||
COUNT(*) FILTER (WHERE created > NOW() - INTERVAL '5 minutes') AS recent_created,
|
||||
COUNT(*) FILTER (WHERE updated > NOW() - INTERVAL '5 minutes' AND status IN ('succeeded', 'failed')) AS recent_completed
|
||||
FROM attune.execution
|
||||
WHERE action = <action_id>;
|
||||
```
|
||||
|
||||
4. **Solutions (in order of preference)**:
|
||||
|
||||
a. **Scale Workers** (if completion rate too low):
|
||||
```bash
|
||||
# Add more worker instances
|
||||
sudo systemctl start attune-worker@2
|
||||
sudo systemctl start attune-worker@3
|
||||
```
|
||||
|
||||
b. **Increase Concurrency** (if safe):
|
||||
```yaml
|
||||
# In config.yaml or via API
|
||||
policies:
|
||||
actions:
|
||||
affected.action:
|
||||
concurrency_limit: 10 # Increase from 5
|
||||
```
|
||||
|
||||
c. **Rate Limit at API** (if traffic spike):
|
||||
```yaml
|
||||
# In API config
|
||||
rate_limits:
|
||||
global:
|
||||
max_requests_per_minute: 1000
|
||||
```
|
||||
|
||||
d. **Temporary Queue Increase** (emergency only):
|
||||
```yaml
|
||||
executor:
|
||||
queue:
|
||||
max_queue_length: 20000 # Increase from 10000
|
||||
```
|
||||
Then restart executor: `sudo systemctl restart attune-executor`
|
||||
|
||||
5. **Monitor Results**
|
||||
```bash
|
||||
watch -n 5 "curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq '.data.queue_length'"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Stuck Queue
|
||||
|
||||
**Procedure:**
|
||||
|
||||
1. **Identify Stuck Executions**
|
||||
```sql
|
||||
SELECT id, status, created, updated,
|
||||
EXTRACT(EPOCH FROM (NOW() - updated)) / 60 AS stuck_minutes
|
||||
FROM attune.execution
|
||||
WHERE action = <action_id>
|
||||
AND status IN ('running', 'requested')
|
||||
AND updated < NOW() - INTERVAL '10 minutes'
|
||||
ORDER BY created;
|
||||
```
|
||||
|
||||
2. **Check Worker Status**
|
||||
```bash
|
||||
# Are workers running?
|
||||
systemctl status attune-worker@*
|
||||
|
||||
# Are workers processing?
|
||||
tail -f /var/log/attune/worker.log | grep execution_id
|
||||
```
|
||||
|
||||
3. **Check Message Queue**
|
||||
```bash
|
||||
# Completion messages backing up?
|
||||
rabbitmqctl list_queues name messages | grep execution.completed
|
||||
|
||||
# Connection issues?
|
||||
rabbitmqctl list_connections
|
||||
```
|
||||
|
||||
4. **Check CompletionListener**
|
||||
```bash
|
||||
# Is listener running?
|
||||
journalctl -u attune-executor -n 100 --no-pager | grep CompletionListener
|
||||
|
||||
# Recent completions processed?
|
||||
journalctl -u attune-executor -n 100 --no-pager | grep "notify_completion"
|
||||
```
|
||||
|
||||
5. **Solutions**:
|
||||
|
||||
a. **Restart Stuck Workers**:
|
||||
```bash
|
||||
# Graceful restart
|
||||
sudo systemctl restart attune-worker@1
|
||||
```
|
||||
|
||||
b. **Restart Executor** (if CompletionListener stuck):
|
||||
```bash
|
||||
sudo systemctl restart attune-executor
|
||||
```
|
||||
|
||||
c. **Force Complete Stuck Executions** (emergency):
|
||||
```sql
|
||||
-- CAUTION: Only for truly stuck executions
|
||||
UPDATE attune.execution
|
||||
SET status = 'failed',
|
||||
result = '{"error": "Execution stuck, manually failed by operator"}',
|
||||
updated = NOW()
|
||||
WHERE id IN (<stuck_execution_ids>);
|
||||
```
|
||||
|
||||
d. **Purge and Restart** (nuclear option):
|
||||
```bash
|
||||
# Stop services
|
||||
sudo systemctl stop attune-executor
|
||||
sudo systemctl stop attune-worker@*
|
||||
|
||||
# Clear message queues
|
||||
rabbitmqctl purge_queue execution.requested
|
||||
rabbitmqctl purge_queue execution.completed
|
||||
|
||||
# Restart services
|
||||
sudo systemctl start attune-executor
|
||||
sudo systemctl start attune-worker@1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Queue Full
|
||||
|
||||
**Procedure:**
|
||||
|
||||
1. **Immediate Mitigation** (choose one):
|
||||
|
||||
a. **Temporarily Increase Limit**:
|
||||
```yaml
|
||||
# config.yaml
|
||||
executor:
|
||||
queue:
|
||||
max_queue_length: 20000
|
||||
```
|
||||
```bash
|
||||
sudo systemctl restart attune-executor
|
||||
```
|
||||
|
||||
b. **Add Workers**:
|
||||
```bash
|
||||
sudo systemctl start attune-worker@{2..5}
|
||||
```
|
||||
|
||||
c. **Increase Concurrency**:
|
||||
```yaml
|
||||
policies:
|
||||
actions:
|
||||
affected.action:
|
||||
concurrency_limit: 20 # Increase
|
||||
```
|
||||
|
||||
2. **Analyze Root Cause**
|
||||
```bash
|
||||
# Traffic pattern
|
||||
psql -U attune -d attune -c "
|
||||
SELECT DATE_TRUNC('minute', created) AS minute, COUNT(*)
|
||||
FROM attune.execution
|
||||
WHERE action = <action_id>
|
||||
AND created > NOW() - INTERVAL '1 hour'
|
||||
GROUP BY minute
|
||||
ORDER BY minute DESC;
|
||||
"
|
||||
|
||||
# Action performance
|
||||
psql -U attune -d attune -c "
|
||||
SELECT AVG(EXTRACT(EPOCH FROM (updated - created))) AS avg_duration_seconds
|
||||
FROM attune.execution
|
||||
WHERE action = <action_id>
|
||||
AND status = 'succeeded'
|
||||
AND created > NOW() - INTERVAL '1 hour';
|
||||
"
|
||||
```
|
||||
|
||||
3. **Long-term Solution**:
|
||||
|
||||
- **Traffic spike**: Add API rate limiting
|
||||
- **Slow action**: Optimize action code
|
||||
- **Under-capacity**: Permanently scale workers
|
||||
- **Configuration**: Adjust concurrency limits
|
||||
|
||||
---
|
||||
|
||||
## Maintenance Tasks
|
||||
|
||||
### Daily
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# daily-queue-check.sh
|
||||
|
||||
echo "=== Active Queues ==="
|
||||
psql -U attune -d attune -c "
|
||||
SELECT a.ref, qs.queue_length, qs.active_count
|
||||
FROM attune.queue_stats qs
|
||||
JOIN attune.action a ON a.id = qs.action_id
|
||||
WHERE queue_length > 0 OR active_count > 0;
|
||||
"
|
||||
|
||||
echo "=== Stuck Queues ==="
|
||||
psql -U attune -d attune -c "
|
||||
SELECT a.ref, qs.queue_length,
|
||||
ROUND(EXTRACT(EPOCH FROM (NOW() - qs.last_updated)) / 60, 1) AS stale_minutes
|
||||
FROM attune.queue_stats qs
|
||||
JOIN attune.action a ON a.id = qs.action_id
|
||||
WHERE (queue_length > 0 OR active_count > 0)
|
||||
AND last_updated < NOW() - INTERVAL '30 minutes';
|
||||
"
|
||||
|
||||
echo "=== Top Actions by Volume ==="
|
||||
psql -U attune -d attune -c "
|
||||
SELECT a.ref, qs.total_enqueued, qs.total_completed
|
||||
FROM attune.queue_stats qs
|
||||
JOIN attune.action a ON a.id = qs.action_id
|
||||
ORDER BY qs.total_enqueued DESC
|
||||
LIMIT 10;
|
||||
"
|
||||
```
|
||||
|
||||
### Weekly
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# weekly-queue-maintenance.sh
|
||||
|
||||
echo "=== Cleaning Stale Queue Stats ==="
|
||||
psql -U attune -d attune -c "
|
||||
DELETE FROM attune.queue_stats
|
||||
WHERE last_updated < NOW() - INTERVAL '7 days'
|
||||
AND queue_length = 0
|
||||
AND active_count = 0;
|
||||
"
|
||||
|
||||
echo "=== Queue Snapshots Cleanup ==="
|
||||
psql -U attune -d attune -c "
|
||||
DELETE FROM attune.queue_snapshots
|
||||
WHERE snapshot_time < NOW() - INTERVAL '30 days';
|
||||
"
|
||||
|
||||
echo "=== Executor Log Rotation ==="
|
||||
journalctl --vacuum-time=30d -u attune-executor
|
||||
```
|
||||
|
||||
### Monthly
|
||||
|
||||
- Review queue capacity trends
|
||||
- Analyze high-volume actions
|
||||
- Plan scaling based on growth
|
||||
- Update alert thresholds
|
||||
- Review and test runbook procedures
|
||||
|
||||
---
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### Emergency: System-Wide Queue Overload
|
||||
|
||||
**Symptoms:**
|
||||
- Multiple actions with critical queue depths
|
||||
- System-wide performance degradation
|
||||
- API response times degraded
|
||||
|
||||
**Procedure:**
|
||||
|
||||
1. **Enable Emergency Mode**:
|
||||
```yaml
|
||||
# config.yaml
|
||||
executor:
|
||||
emergency_mode: true # Relaxes limits
|
||||
queue:
|
||||
max_queue_length: 50000
|
||||
```
|
||||
|
||||
2. **Scale Workers Aggressively**:
|
||||
```bash
|
||||
for i in {1..10}; do
|
||||
sudo systemctl start attune-worker@$i
|
||||
done
|
||||
```
|
||||
|
||||
3. **Temporarily Disable Non-Critical Actions**:
|
||||
```sql
|
||||
-- Disable low-priority actions
|
||||
UPDATE attune.action
|
||||
SET enabled = false
|
||||
WHERE priority < 5 OR tags @> '["low-priority"]';
|
||||
```
|
||||
|
||||
4. **Enable API Rate Limiting**:
|
||||
```yaml
|
||||
api:
|
||||
rate_limits:
|
||||
global:
|
||||
enabled: true
|
||||
max_requests_per_minute: 500
|
||||
```
|
||||
|
||||
5. **Monitor Recovery**:
|
||||
```bash
|
||||
watch -n 10 "psql -U attune -d attune -t -c 'SELECT SUM(queue_length) FROM attune.queue_stats;'"
|
||||
```
|
||||
|
||||
6. **Post-Incident**:
|
||||
- Document what happened
|
||||
- Analyze root cause
|
||||
- Update capacity plan
|
||||
- Restore normal configuration
|
||||
|
||||
---
|
||||
|
||||
### Emergency: Executor Crash Loop
|
||||
|
||||
**Symptoms:**
|
||||
- Executor service repeatedly crashes
|
||||
- Queues not progressing
|
||||
- High memory usage before crash
|
||||
|
||||
**Procedure:**
|
||||
|
||||
1. **Capture Crash Logs**:
|
||||
```bash
|
||||
journalctl -u attune-executor --since "30 minutes ago" > /tmp/executor-crash.log
|
||||
dmesg | tail -100 > /tmp/dmesg-crash.log
|
||||
```
|
||||
|
||||
2. **Check for Memory Issues**:
|
||||
```bash
|
||||
# Check OOM kills
|
||||
grep -i "out of memory" /var/log/syslog
|
||||
grep -i "killed process" /var/log/kern.log
|
||||
```
|
||||
|
||||
3. **Emergency Restart with Limited Queues**:
|
||||
```yaml
|
||||
# config.yaml
|
||||
executor:
|
||||
queue:
|
||||
max_queue_length: 1000 # Reduce drastically
|
||||
enable_metrics: false # Reduce overhead
|
||||
```
|
||||
|
||||
4. **Start in Safe Mode**:
|
||||
```bash
|
||||
sudo systemctl start attune-executor
|
||||
# Monitor memory
|
||||
watch -n 1 "ps aux | grep attune-executor | grep -v grep"
|
||||
```
|
||||
|
||||
5. **If Still Crashing**:
|
||||
```bash
|
||||
# Disable queue persistence temporarily
|
||||
# In code or via feature flag
|
||||
export ATTUNE__EXECUTOR__QUEUE__ENABLE_METRICS=false
|
||||
sudo systemctl restart attune-executor
|
||||
```
|
||||
|
||||
6. **Escalate**:
|
||||
- Contact development team
|
||||
- Provide crash logs and memory dumps
|
||||
- Consider rolling back to previous version
|
||||
|
||||
---
|
||||
|
||||
## Capacity Planning
|
||||
|
||||
### Calculating Required Capacity
|
||||
|
||||
**Formula**:
|
||||
```
|
||||
Required Workers = (Peak Requests/Hour × Avg Duration) / 3600 / Concurrency Limit
|
||||
```
|
||||
|
||||
**Example**:
|
||||
- Peak: 10,000 requests/hour
|
||||
- Avg Duration: 5 seconds
|
||||
- Concurrency: 10 per worker
|
||||
|
||||
```
|
||||
Workers = (10,000 × 5) / 3600 / 10 = 1.4 → 2 workers minimum
|
||||
Add 50% buffer → 3 workers recommended
|
||||
```
|
||||
|
||||
### Growth Planning
|
||||
|
||||
Monitor monthly trends:
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
DATE_TRUNC('day', created) AS day,
|
||||
COUNT(*) AS executions,
|
||||
AVG(EXTRACT(EPOCH FROM (updated - created))) AS avg_duration
|
||||
FROM attune.execution
|
||||
WHERE created > NOW() - INTERVAL '30 days'
|
||||
GROUP BY day
|
||||
ORDER BY day;
|
||||
```
|
||||
|
||||
### Capacity Recommendations
|
||||
|
||||
| Queue Depth | Worker Count | Action |
|
||||
|-------------|--------------|--------|
|
||||
| < 10 | Current | Maintain |
|
||||
| 10-50 | +25% | Plan scale-up |
|
||||
| 50-100 | +50% | Scale soon |
|
||||
| 100+ | +100% | Scale now |
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Queue Architecture](./queue-architecture.md)
|
||||
- [Executor Service](./executor-service.md)
|
||||
- [Worker Service](./worker-service.md)
|
||||
- [API: Actions - Queue Stats](./api-actions.md#get-queue-statistics)
|
||||
|
||||
---
|
||||
|
||||
**Version**: 1.0
|
||||
**Maintained By**: SRE Team
|
||||
**Last Updated**: 2025-01-27
|
||||
447
docs/deployment/production-deployment.md
Normal file
447
docs/deployment/production-deployment.md
Normal file
@@ -0,0 +1,447 @@
|
||||
# Production Deployment Guide
|
||||
|
||||
This document provides guidelines and checklists for deploying Attune to production environments.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Pre-Deployment Checklist](#pre-deployment-checklist)
|
||||
- [Database Configuration](#database-configuration)
|
||||
- [Environment Variables](#environment-variables)
|
||||
- [Schema Verification](#schema-verification)
|
||||
- [Security Best Practices](#security-best-practices)
|
||||
- [Deployment Steps](#deployment-steps)
|
||||
- [Post-Deployment Validation](#post-deployment-validation)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
|
||||
---
|
||||
|
||||
## Pre-Deployment Checklist
|
||||
|
||||
Before deploying Attune to production, verify the following:
|
||||
|
||||
- [ ] PostgreSQL 14+ database is provisioned and accessible
|
||||
- [ ] RabbitMQ 3.12+ message queue is configured
|
||||
- [ ] All required environment variables are set (see below)
|
||||
- [ ] Database migrations have been tested in staging
|
||||
- [ ] SSL/TLS certificates are configured for HTTPS
|
||||
- [ ] Log aggregation and monitoring are configured
|
||||
- [ ] Backup and disaster recovery procedures are in place
|
||||
- [ ] Security audit has been completed
|
||||
- [ ] Load balancing and high availability are configured (if applicable)
|
||||
|
||||
---
|
||||
|
||||
## Database Configuration
|
||||
|
||||
### Critical: Schema Configuration
|
||||
|
||||
**Production MUST use the `attune` schema.**
|
||||
|
||||
The schema configuration is set in `config.production.yaml`:
|
||||
|
||||
```yaml
|
||||
database:
|
||||
schema: "attune" # REQUIRED: Do not remove or change
|
||||
```
|
||||
|
||||
### Why This Matters
|
||||
|
||||
- **Test Isolation**: Tests use dynamic schemas (e.g., `test_uuid`) for isolation
|
||||
- **Production Consistency**: All production services must use the same schema
|
||||
- **Migration Safety**: Migrations expect the `attune` schema in production
|
||||
|
||||
### Verification
|
||||
|
||||
You can verify the schema configuration in several ways:
|
||||
|
||||
1. **Check Configuration File**: Ensure `config.production.yaml` has `schema: "attune"`
|
||||
|
||||
2. **Check Environment Variable** (if overriding):
|
||||
```bash
|
||||
echo $ATTUNE__DATABASE__SCHEMA
|
||||
# Should output: attune
|
||||
```
|
||||
|
||||
3. **Check Application Logs** on startup:
|
||||
```
|
||||
INFO Using production schema: attune
|
||||
```
|
||||
|
||||
4. **Query Database**:
|
||||
```sql
|
||||
SELECT current_schema();
|
||||
-- Should return: attune
|
||||
```
|
||||
|
||||
### ⚠️ WARNING
|
||||
|
||||
If the schema is **not** set to `attune` in production, you will see this warning in logs:
|
||||
|
||||
```
|
||||
WARN Using non-standard schema: 'test_xyz'. Production should use 'attune'
|
||||
```
|
||||
|
||||
**If you see this warning in production, STOP and fix the configuration immediately.**
|
||||
|
||||
---
|
||||
|
||||
## Environment Variables
|
||||
|
||||
### Required Variables
|
||||
|
||||
These environment variables **MUST** be set before deploying:
|
||||
|
||||
```bash
|
||||
# Database connection (required)
|
||||
export DATABASE_URL="postgresql://username:password@host:port/database"
|
||||
|
||||
# JWT secret for authentication (required, 64+ characters)
|
||||
# Generate with: openssl rand -base64 64
|
||||
export JWT_SECRET="your-secure-jwt-secret-here"
|
||||
|
||||
# Encryption key for secrets storage (required, 32+ characters)
|
||||
# Generate with: openssl rand -base64 32
|
||||
export ENCRYPTION_KEY="your-secure-encryption-key-here"
|
||||
```
|
||||
|
||||
### Optional Variables
|
||||
|
||||
```bash
|
||||
# Redis (for caching)
|
||||
export REDIS_URL="redis://host:6379"
|
||||
|
||||
# RabbitMQ (for message queue)
|
||||
export RABBITMQ_URL="amqp://user:pass@host:5672/%2f"
|
||||
|
||||
# CORS origins (comma-separated)
|
||||
export ATTUNE__SERVER__CORS_ORIGINS="https://app.example.com,https://www.example.com"
|
||||
|
||||
# Log level override
|
||||
export ATTUNE__LOG__LEVEL="info"
|
||||
|
||||
# Server port override
|
||||
export ATTUNE__SERVER__PORT="8080"
|
||||
|
||||
# Schema override (should always be 'attune' in production)
|
||||
export ATTUNE__DATABASE__SCHEMA="attune"
|
||||
```
|
||||
|
||||
### Environment Variable Format
|
||||
|
||||
Attune uses hierarchical configuration with the prefix `ATTUNE__` and separator `__`:
|
||||
|
||||
- `ATTUNE__DATABASE__URL` → `database.url`
|
||||
- `ATTUNE__SERVER__PORT` → `server.port`
|
||||
- `ATTUNE__LOG__LEVEL` → `log.level`
|
||||
|
||||
---
|
||||
|
||||
## Schema Verification
|
||||
|
||||
### Automatic Verification
|
||||
|
||||
Attune includes built-in schema validation:
|
||||
|
||||
1. **Schema Name Validation**: Only alphanumeric and underscores allowed (max 63 chars)
|
||||
2. **SQL Injection Prevention**: Schema names are validated before use
|
||||
3. **Logging**: Production schema usage is logged prominently at startup
|
||||
|
||||
### Manual Verification Script
|
||||
|
||||
Run this verification before deployment:
|
||||
|
||||
```bash
|
||||
# Verify configuration loads correctly
|
||||
cargo run --release --bin attune-api -- --config config.production.yaml --dry-run
|
||||
|
||||
# Check logs for schema confirmation
|
||||
cargo run --release --bin attune-api 2>&1 | grep -i schema
|
||||
```
|
||||
|
||||
Expected output:
|
||||
```
|
||||
INFO Using production schema: attune
|
||||
INFO Connecting to database with max_connections=20, schema=attune
|
||||
```
|
||||
|
||||
### Database Schema Check
|
||||
|
||||
After deployment, verify the schema in the database:
|
||||
|
||||
```bash
|
||||
# Connect to your production database
|
||||
psql $DATABASE_URL
|
||||
|
||||
# Verify schema exists
|
||||
\dn attune
|
||||
|
||||
# Verify search_path includes attune
|
||||
SHOW search_path;
|
||||
|
||||
# Verify tables are in attune schema
|
||||
SELECT schemaname, tablename
|
||||
FROM pg_tables
|
||||
WHERE schemaname = 'attune'
|
||||
ORDER BY tablename;
|
||||
```
|
||||
|
||||
You should see all 17 Attune tables:
|
||||
- `action`
|
||||
- `enforcement`
|
||||
- `event`
|
||||
- `execution`
|
||||
- `execution_log`
|
||||
- `identity`
|
||||
- `inquiry`
|
||||
- `inquiry_response`
|
||||
- `key`
|
||||
- `pack`
|
||||
- `rule`
|
||||
- `rule_enforcement`
|
||||
- `sensor`
|
||||
- `sensor_instance`
|
||||
- `trigger`
|
||||
- `trigger_instance`
|
||||
- `workflow_definition`
|
||||
|
||||
---
|
||||
|
||||
## Security Best Practices
|
||||
|
||||
### Secrets Management
|
||||
|
||||
1. **Never commit secrets to version control**
|
||||
2. **Use environment variables or secret management systems** (e.g., AWS Secrets Manager, HashiCorp Vault)
|
||||
3. **Rotate secrets regularly** (JWT secret, encryption key, database passwords)
|
||||
4. **Use strong, randomly generated secrets** (use provided generation commands)
|
||||
|
||||
### Database Security
|
||||
|
||||
1. **Use dedicated database user** with minimal required permissions
|
||||
2. **Enable SSL/TLS** for database connections
|
||||
3. **Use connection pooling** (configured via `max_connections`)
|
||||
4. **Restrict network access** to database (firewall rules, VPC, etc.)
|
||||
5. **Enable audit logging** for sensitive operations
|
||||
|
||||
### Application Security
|
||||
|
||||
1. **Run as non-root user** in containers/VMs
|
||||
2. **Enable HTTPS** for all API endpoints (use reverse proxy like nginx)
|
||||
3. **Configure CORS properly** (only allow trusted origins)
|
||||
4. **Set up rate limiting** and DDoS protection
|
||||
5. **Enable security headers** (CSP, HSTS, X-Frame-Options, etc.)
|
||||
6. **Keep dependencies updated** (run `cargo audit` regularly)
|
||||
|
||||
---
|
||||
|
||||
## Deployment Steps
|
||||
|
||||
### 1. Prepare Database
|
||||
|
||||
```bash
|
||||
# Create production database (if not exists)
|
||||
createdb -h your-db-host -U your-db-user attune_prod
|
||||
|
||||
# Run migrations
|
||||
export DATABASE_URL="postgresql://user:pass@host:port/attune_prod"
|
||||
export ATTUNE__DATABASE__SCHEMA="attune"
|
||||
sqlx migrate run --source ./migrations
|
||||
```
|
||||
|
||||
### 2. Build Application
|
||||
|
||||
```bash
|
||||
# Build release binary
|
||||
cargo build --release --bin attune-api
|
||||
|
||||
# Or build Docker image
|
||||
docker build -t attune-api:latest -f docker/api.Dockerfile .
|
||||
```
|
||||
|
||||
### 3. Configure Environment
|
||||
|
||||
```bash
|
||||
# Set all required environment variables
|
||||
export DATABASE_URL="postgresql://..."
|
||||
export JWT_SECRET="$(openssl rand -base64 64)"
|
||||
export ENCRYPTION_KEY="$(openssl rand -base64 32)"
|
||||
export ATTUNE__DATABASE__SCHEMA="attune"
|
||||
# ... etc
|
||||
```
|
||||
|
||||
### 4. Deploy Services
|
||||
|
||||
```bash
|
||||
# Start API service
|
||||
./target/release/attune-api --config config.production.yaml
|
||||
|
||||
# Or with Docker
|
||||
docker run -d \
|
||||
--name attune-api \
|
||||
-p 8080:8080 \
|
||||
-e DATABASE_URL="$DATABASE_URL" \
|
||||
-e JWT_SECRET="$JWT_SECRET" \
|
||||
-e ENCRYPTION_KEY="$ENCRYPTION_KEY" \
|
||||
-v ./config.production.yaml:/app/config.production.yaml \
|
||||
attune-api:latest
|
||||
```
|
||||
|
||||
### 5. Load Core Pack
|
||||
|
||||
```bash
|
||||
# Load the core pack (provides essential actions and sensors)
|
||||
./scripts/load-core-pack.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Post-Deployment Validation
|
||||
|
||||
### Health Check
|
||||
|
||||
```bash
|
||||
# Check API health endpoint
|
||||
curl http://your-api-host:8080/health
|
||||
|
||||
# Expected response:
|
||||
# {"status":"ok","timestamp":"2024-01-15T12:00:00Z"}
|
||||
```
|
||||
|
||||
### Schema Validation
|
||||
|
||||
```bash
|
||||
# Check application logs for schema confirmation
|
||||
docker logs attune-api 2>&1 | grep -i schema
|
||||
|
||||
# Expected output:
|
||||
# INFO Using production schema: attune
|
||||
```
|
||||
|
||||
### Functional Tests
|
||||
|
||||
```bash
|
||||
# Test authentication
|
||||
curl -X POST http://your-api-host:8080/auth/login \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"username":"admin","password":"your-password"}'
|
||||
|
||||
# Test pack listing
|
||||
curl http://your-api-host:8080/api/v1/packs \
|
||||
-H "Authorization: Bearer YOUR_TOKEN"
|
||||
|
||||
# Test action execution
|
||||
curl -X POST http://your-api-host:8080/api/v1/executions \
|
||||
-H "Authorization: Bearer YOUR_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"action_ref":"core.echo","parameters":{"message":"Hello"}}'
|
||||
```
|
||||
|
||||
### Monitoring
|
||||
|
||||
Set up monitoring for:
|
||||
|
||||
- **Application health**: `/health` endpoint availability
|
||||
- **Database connections**: Pool size and connection errors
|
||||
- **Error rates**: 4xx and 5xx HTTP responses
|
||||
- **Response times**: P50, P95, P99 latencies
|
||||
- **Resource usage**: CPU, memory, disk, network
|
||||
- **Schema usage**: Verify `attune` schema in logs
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: Wrong Schema in Production
|
||||
|
||||
**Symptoms:**
|
||||
- Log shows: `WARN Using non-standard schema: 'something_else'`
|
||||
- Database queries fail or return no data
|
||||
|
||||
**Solution:**
|
||||
1. Check `config.production.yaml` has `schema: "attune"`
|
||||
2. Check for environment variable override: `echo $ATTUNE__DATABASE__SCHEMA`
|
||||
3. Restart the application after fixing configuration
|
||||
4. Verify logs show: `INFO Using production schema: attune`
|
||||
|
||||
### Issue: Schema Not Found
|
||||
|
||||
**Symptoms:**
|
||||
- Application startup fails with "schema does not exist"
|
||||
- Database queries fail with "schema not found"
|
||||
|
||||
**Solution:**
|
||||
1. Verify schema exists: `psql $DATABASE_URL -c "\dn attune"`
|
||||
2. If missing, run migrations: `sqlx migrate run --source ./migrations`
|
||||
3. Check migration files uncommented schema creation (first migration)
|
||||
|
||||
### Issue: Connection Pool Exhausted
|
||||
|
||||
**Symptoms:**
|
||||
- Timeout errors
|
||||
- "connection pool exhausted" errors
|
||||
- Slow response times
|
||||
|
||||
**Solution:**
|
||||
1. Increase `max_connections` in config
|
||||
2. Check for connection leaks in application logs
|
||||
3. Verify database can handle the connection load
|
||||
4. Consider scaling horizontally (multiple instances)
|
||||
|
||||
### Issue: Authentication Fails
|
||||
|
||||
**Symptoms:**
|
||||
- All requests return 401 Unauthorized
|
||||
- Token validation errors in logs
|
||||
|
||||
**Solution:**
|
||||
1. Verify `JWT_SECRET` is set correctly
|
||||
2. Check token expiration times in config
|
||||
3. Ensure clocks are synchronized (NTP)
|
||||
4. Verify `enable_auth: true` in config
|
||||
|
||||
### Issue: Migrations Fail
|
||||
|
||||
**Symptoms:**
|
||||
- `sqlx migrate run` errors
|
||||
- "relation already exists" or "schema already exists"
|
||||
|
||||
**Solution:**
|
||||
1. Check `_sqlx_migrations` table: `SELECT * FROM attune._sqlx_migrations;`
|
||||
2. Verify migrations are in correct order
|
||||
3. For fresh deployment, drop and recreate schema if safe
|
||||
4. Check PostgreSQL version compatibility (requires 14+)
|
||||
|
||||
---
|
||||
|
||||
## Rollback Procedure
|
||||
|
||||
If issues occur after deployment:
|
||||
|
||||
1. **Stop the application**: `systemctl stop attune-api` (or equivalent)
|
||||
2. **Revert to previous version**: Deploy previous known-good version
|
||||
3. **Restore database backup** (if migrations were run):
|
||||
```bash
|
||||
pg_restore -d attune_prod backup.dump
|
||||
```
|
||||
4. **Verify old version works**: Run post-deployment validation steps
|
||||
5. **Investigate issue**: Review logs, error messages, configuration changes
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- [Configuration Guide](./configuration.md)
|
||||
- [Schema-Per-Test Architecture](./schema-per-test.md)
|
||||
- [API Documentation](./api-overview.md)
|
||||
- [Security Best Practices](./security.md)
|
||||
- [Monitoring and Observability](./monitoring.md)
|
||||
|
||||
---
|
||||
|
||||
## Support
|
||||
|
||||
For production issues or questions:
|
||||
|
||||
- GitHub Issues: https://github.com/your-org/attune/issues
|
||||
- Documentation: https://docs.attune.example.com
|
||||
- Community: https://community.attune.example.com
|
||||
Reference in New Issue
Block a user