re-uploading work

2026-02-04 17:46:30 -06:00
commit 3b14c65998
1388 changed files with 381262 additions and 0 deletions
--- a/docs/deployment/ops-runbook-queues.md
+++ b/docs/deployment/ops-runbook-queues.md
@@ -0,0 +1,851 @@
+# Operational Runbook: Queue Management
+
+**Service**: Attune Executor  
+**Component**: Execution Queue Manager  
+**Audience**: Operations, SRE, DevOps  
+**Last Updated**: 2025-01-27
+
+---
+
+## Table of Contents
+
+1. [Overview](#overview)
+2. [Quick Reference](#quick-reference)
+3. [Monitoring](#monitoring)
+4. [Common Issues](#common-issues)
+5. [Troubleshooting Procedures](#troubleshooting-procedures)
+6. [Maintenance Tasks](#maintenance-tasks)
+7. [Emergency Procedures](#emergency-procedures)
+8. [Capacity Planning](#capacity-planning)
+
+---
+
+## Overview
+
+The Attune Executor service manages per-action FIFO execution queues to ensure fair, ordered processing when policy limits (concurrency, rate limits) are enforced. This runbook covers operational procedures for monitoring and managing these queues.
+
+### Key Concepts
+
+- **Queue**: Per-action FIFO buffer of waiting executions
+- **Active Count**: Number of currently running executions for an action
+- **Max Concurrent**: Policy-enforced limit on parallel executions
+- **Queue Length**: Number of executions waiting in queue
+- **FIFO**: First-In-First-Out ordering guarantee
+
+### System Components
+
+- **ExecutionQueueManager**: Core queue management (in-memory)
+- **CompletionListener**: Processes worker completion messages
+- **QueueStatsRepository**: Persists statistics to database
+- **API Endpoint**: `/api/v1/actions/:ref/queue-stats`
+
+---
+
+## Quick Reference
+
+### Health Check Commands
+
+```bash
+# Check executor service status
+systemctl status attune-executor
+
+# Check active queues
+curl -s http://localhost:8080/api/v1/actions/core.http.get/queue-stats | jq
+
+# Database query for all active queues
+psql -U attune -d attune -c "
+  SELECT a.ref, qs.queue_length, qs.active_count, qs.max_concurrent,
+         qs.oldest_enqueued_at, qs.last_updated
+  FROM attune.queue_stats qs
+  JOIN attune.action a ON a.id = qs.action_id
+  WHERE queue_length > 0 OR active_count > 0
+  ORDER BY queue_length DESC;
+"
+
+# Check executor logs for queue issues
+journalctl -u attune-executor -n 100 --no-pager | grep -i queue
+```
+
+### Emergency Actions
+
+```bash
+# Restart executor (clears in-memory queues)
+sudo systemctl restart attune-executor
+
+# Restart all workers (forces completion messages)
+sudo systemctl restart attune-worker@*
+
+# Clear stale queue stats (older than 1 hour, inactive)
+psql -U attune -d attune -c "
+  DELETE FROM attune.queue_stats
+  WHERE last_updated < NOW() - INTERVAL '1 hour'
+    AND queue_length = 0
+    AND active_count = 0;
+"
+```
+
+---
+
+## Monitoring
+
+### Key Metrics to Track
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| Queue Length | > 100 | Investigate load |
+| Queue Length | > 500 | Add workers |
+| Queue Length | > 1000 | Emergency response |
+| Oldest Enqueued | > 10 min | Check workers |
+| Oldest Enqueued | > 30 min | Critical issue |
+| Active < Max Concurrent | Any | Workers stuck |
+| Last Updated | > 10 min | Executor issue |
+
+### Monitoring Queries
+
+#### Active Queues Overview
+
+```sql
+SELECT 
+  a.ref AS action,
+  qs.queue_length,
+  qs.active_count,
+  qs.max_concurrent,
+  ROUND(EXTRACT(EPOCH FROM (NOW() - qs.oldest_enqueued_at)) / 60, 1) AS wait_minutes,
+  ROUND(qs.total_completed::float / NULLIF(qs.total_enqueued, 0) * 100, 2) AS completion_pct,
+  qs.last_updated
+FROM attune.queue_stats qs
+JOIN attune.action a ON a.id = qs.action_id
+WHERE queue_length > 0 OR active_count > 0
+ORDER BY queue_length DESC;
+```
+
+#### Top Actions by Throughput
+
+```sql
+SELECT 
+  a.ref AS action,
+  qs.total_enqueued,
+  qs.total_completed,
+  qs.total_enqueued - qs.total_completed AS pending,
+  ROUND(qs.total_completed::float / NULLIF(qs.total_enqueued, 0) * 100, 2) AS completion_pct
+FROM attune.queue_stats qs
+JOIN attune.action a ON a.id = qs.action_id
+WHERE qs.total_enqueued > 0
+ORDER BY qs.total_enqueued DESC
+LIMIT 20;
+```
+
+#### Stuck Queues (Not Progressing)
+
+```sql
+SELECT 
+  a.ref AS action,
+  qs.queue_length,
+  qs.active_count,
+  ROUND(EXTRACT(EPOCH FROM (NOW() - qs.last_updated)) / 60, 1) AS stale_minutes,
+  qs.oldest_enqueued_at
+FROM attune.queue_stats qs
+JOIN attune.action a ON a.id = qs.action_id
+WHERE (queue_length > 0 OR active_count > 0)
+  AND last_updated < NOW() - INTERVAL '10 minutes'
+ORDER BY stale_minutes DESC;
+```
+
+#### Queue Growth Rate
+
+```sql
+-- Create a monitoring table for snapshots
+CREATE TABLE IF NOT EXISTS attune.queue_snapshots (
+  snapshot_time TIMESTAMPTZ DEFAULT NOW(),
+  action_id BIGINT,
+  queue_length INT,
+  active_count INT,
+  total_enqueued BIGINT
+);
+
+-- Take snapshot (run every 5 minutes)
+INSERT INTO attune.queue_snapshots (action_id, queue_length, active_count, total_enqueued)
+SELECT action_id, queue_length, active_count, total_enqueued
+FROM attune.queue_stats
+WHERE queue_length > 0 OR active_count > 0;
+
+-- Analyze growth rate
+SELECT 
+  a.ref AS action,
+  s1.queue_length AS queue_now,
+  s2.queue_length AS queue_5min_ago,
+  s1.queue_length - s2.queue_length AS growth,
+  s1.total_enqueued - s2.total_enqueued AS new_requests
+FROM attune.queue_snapshots s1
+JOIN attune.queue_snapshots s2 ON s2.action_id = s1.action_id
+JOIN attune.action a ON a.id = s1.action_id
+WHERE s1.snapshot_time >= NOW() - INTERVAL '1 minute'
+  AND s2.snapshot_time >= NOW() - INTERVAL '6 minutes'
+  AND s2.snapshot_time < NOW() - INTERVAL '4 minutes'
+ORDER BY growth DESC;
+```
+
+### Alerting Rules
+
+**Prometheus/Grafana Alerts** (if metrics exported):
+
+```yaml
+groups:
+  - name: attune_queues
+    interval: 30s
+    rules:
+      - alert: HighQueueDepth
+        expr: attune_queue_length > 100
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Queue depth high for {{ $labels.action }}"
+          description: "Queue has {{ $value }} waiting executions"
+
+      - alert: CriticalQueueDepth
+        expr: attune_queue_length > 500
+        for: 2m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Critical queue depth for {{ $labels.action }}"
+          description: "Queue has {{ $value }} waiting executions - add workers"
+
+      - alert: StuckQueue
+        expr: attune_queue_last_updated < time() - 600
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Queue not progressing for {{ $labels.action }}"
+          description: "Queue hasn't updated in 10+ minutes"
+
+      - alert: OldestExecutionAging
+        expr: attune_queue_oldest_age_seconds > 1800
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Execution waiting 30+ minutes for {{ $labels.action }}"
+```
+
+**Nagios/Icinga Check**:
+
+```bash
+#!/bin/bash
+# /usr/lib/nagios/plugins/check_attune_queues.sh
+
+WARN_THRESHOLD=${1:-100}
+CRIT_THRESHOLD=${2:-500}
+
+MAX_QUEUE=$(psql -U attune -d attune -t -c "
+  SELECT COALESCE(MAX(queue_length), 0) FROM attune.queue_stats;
+")
+
+if [ "$MAX_QUEUE" -ge "$CRIT_THRESHOLD" ]; then
+  echo "CRITICAL: Max queue depth $MAX_QUEUE >= $CRIT_THRESHOLD"
+  exit 2
+elif [ "$MAX_QUEUE" -ge "$WARN_THRESHOLD" ]; then
+  echo "WARNING: Max queue depth $MAX_QUEUE >= $WARN_THRESHOLD"
+  exit 1
+else
+  echo "OK: Max queue depth $MAX_QUEUE"
+  exit 0
+fi
+```
+
+---
+
+## Common Issues
+
+### Issue 1: Queue Growing Continuously
+
+**Symptoms:**
+- Queue length increases over time
+- Never decreases even when workers are idle
+- `oldest_enqueued_at` gets older
+
+**Common Causes:**
+1. Workers not processing fast enough
+2. Too many incoming requests
+3. Concurrency limit too low
+4. Worker crashes/restarts
+
+**Quick Diagnosis:**
+```bash
+# Check worker status
+systemctl status attune-worker@*
+
+# Check worker resource usage
+ps aux | grep attune-worker
+top -p $(pgrep -d',' attune-worker)
+
+# Check recent completions
+psql -U attune -d attune -c "
+  SELECT COUNT(*), status
+  FROM attune.execution
+  WHERE updated > NOW() - INTERVAL '5 minutes'
+  GROUP BY status;
+"
+```
+
+**Resolution:** See [Troubleshooting: Growing Queue](#growing-queue)
+
+---
+
+### Issue 2: Queue Not Progressing
+
+**Symptoms:**
+- Queue length stays constant
+- `last_updated` timestamp doesn't change
+- Active executions showing but not completing
+
+**Common Causes:**
+1. Workers crashed/hung
+2. CompletionListener not running
+3. Message queue connection lost
+4. Database connection issue
+
+**Quick Diagnosis:**
+```bash
+# Check executor process
+ps aux | grep attune-executor
+journalctl -u attune-executor -n 50 --no-pager
+
+# Check message queue
+rabbitmqctl list_queues name messages | grep execution.completed
+
+# Check for stuck executions
+psql -U attune -d attune -c "
+  SELECT id, action, status, created, updated
+  FROM attune.execution
+  WHERE status = 'running'
+    AND updated < NOW() - INTERVAL '10 minutes'
+  ORDER BY created DESC
+  LIMIT 10;
+"
+```
+
+**Resolution:** See [Troubleshooting: Stuck Queue](#stuck-queue)
+
+---
+
+### Issue 3: Queue Full Errors
+
+**Symptoms:**
+- API returns `Queue full (max length: 10000)` errors
+- New executions rejected
+- Users report action failures
+
+**Common Causes:**
+1. Sudden traffic spike
+2. Worker capacity exhausted
+3. `max_queue_length` too low
+4. Slow action execution
+
+**Quick Diagnosis:**
+```bash
+# Check current queue stats
+curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq
+
+# Check configuration
+grep -A5 "queue:" /etc/attune/config.yaml
+
+# Check worker count
+systemctl list-units attune-worker@* | grep running
+```
+
+**Resolution:** See [Troubleshooting: Queue Full](#queue-full)
+
+---
+
+### Issue 4: FIFO Order Violation
+
+**Symptoms:**
+- Executions complete out of order
+- Later requests finish before earlier ones
+- Workflow dependencies break
+
+**Severity:** CRITICAL - This indicates a bug
+
+**Immediate Action:**
+1. Capture executor logs immediately
+2. Document the violation with timestamps
+3. Restart executor service
+4. File critical bug report
+
+**Data to Collect:**
+```bash
+# Capture logs
+journalctl -u attune-executor --since "10 minutes ago" > /tmp/executor-fifo-violation.log
+
+# Capture database state
+psql -U attune -d attune -c "
+  SELECT id, action, status, created, updated
+  FROM attune.execution
+  WHERE action = <affected_action_id>
+    AND created > NOW() - INTERVAL '1 hour'
+  ORDER BY created;
+" > /tmp/execution-order.txt
+
+# Capture queue stats
+curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq > /tmp/queue-stats.json
+```
+
+---
+
+## Troubleshooting Procedures
+
+### Growing Queue
+
+**Procedure:**
+
+1. **Assess Severity**
+   ```bash
+   # Get current queue depth
+   curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq '.data.queue_length'
+   ```
+
+2. **Check Worker Health**
+   ```bash
+   # Active workers
+   systemctl list-units attune-worker@* | grep running | wc -l
+   
+   # Worker resource usage
+   ps aux | grep attune-worker | awk '{print $3, $4, $11}'
+   
+   # Recent worker errors
+   journalctl -u attune-worker@* -n 100 --no-pager | grep -i error
+   ```
+
+3. **Check Completion Rate**
+   ```sql
+   SELECT 
+     COUNT(*) FILTER (WHERE created > NOW() - INTERVAL '5 minutes') AS recent_created,
+     COUNT(*) FILTER (WHERE updated > NOW() - INTERVAL '5 minutes' AND status IN ('succeeded', 'failed')) AS recent_completed
+   FROM attune.execution
+   WHERE action = <action_id>;
+   ```
+
+4. **Solutions (in order of preference)**:
+
+   a. **Scale Workers** (if completion rate too low):
+   ```bash
+   # Add more worker instances
+   sudo systemctl start attune-worker@2
+   sudo systemctl start attune-worker@3
+   ```
+
+   b. **Increase Concurrency** (if safe):
+   ```yaml
+   # In config.yaml or via API
+   policies:
+     actions:
+       affected.action:
+         concurrency_limit: 10  # Increase from 5
+   ```
+
+   c. **Rate Limit at API** (if traffic spike):
+   ```yaml
+   # In API config
+   rate_limits:
+     global:
+       max_requests_per_minute: 1000
+   ```
+
+   d. **Temporary Queue Increase** (emergency only):
+   ```yaml
+   executor:
+     queue:
+       max_queue_length: 20000  # Increase from 10000
+   ```
+   Then restart executor: `sudo systemctl restart attune-executor`
+
+5. **Monitor Results**
+   ```bash
+   watch -n 5 "curl -s http://localhost:8080/api/v1/actions/AFFECTED_ACTION/queue-stats | jq '.data.queue_length'"
+   ```
+
+---
+
+### Stuck Queue
+
+**Procedure:**
+
+1. **Identify Stuck Executions**
+   ```sql
+   SELECT id, status, created, updated, 
+          EXTRACT(EPOCH FROM (NOW() - updated)) / 60 AS stuck_minutes
+   FROM attune.execution
+   WHERE action = <action_id>
+     AND status IN ('running', 'requested')
+     AND updated < NOW() - INTERVAL '10 minutes'
+   ORDER BY created;
+   ```
+
+2. **Check Worker Status**
+   ```bash
+   # Are workers running?
+   systemctl status attune-worker@*
+   
+   # Are workers processing?
+   tail -f /var/log/attune/worker.log | grep execution_id
+   ```
+
+3. **Check Message Queue**
+   ```bash
+   # Completion messages backing up?
+   rabbitmqctl list_queues name messages | grep execution.completed
+   
+   # Connection issues?
+   rabbitmqctl list_connections
+   ```
+
+4. **Check CompletionListener**
+   ```bash
+   # Is listener running?
+   journalctl -u attune-executor -n 100 --no-pager | grep CompletionListener
+   
+   # Recent completions processed?
+   journalctl -u attune-executor -n 100 --no-pager | grep "notify_completion"
+   ```
+
+5. **Solutions**:
+
+   a. **Restart Stuck Workers**:
+   ```bash
+   # Graceful restart
+   sudo systemctl restart attune-worker@1
+   ```
+
+   b. **Restart Executor** (if CompletionListener stuck):
+   ```bash
+   sudo systemctl restart attune-executor
+   ```
+
+   c. **Force Complete Stuck Executions** (emergency):
+   ```sql
+   -- CAUTION: Only for truly stuck executions
+   UPDATE attune.execution
+   SET status = 'failed', 
+       result = '{"error": "Execution stuck, manually failed by operator"}',
+       updated = NOW()
+   WHERE id IN (<stuck_execution_ids>);
+   ```
+
+   d. **Purge and Restart** (nuclear option):
+   ```bash
+   # Stop services
+   sudo systemctl stop attune-executor
+   sudo systemctl stop attune-worker@*
+   
+   # Clear message queues
+   rabbitmqctl purge_queue execution.requested
+   rabbitmqctl purge_queue execution.completed
+   
+   # Restart services
+   sudo systemctl start attune-executor
+   sudo systemctl start attune-worker@1
+   ```
+
+---
+
+### Queue Full
+
+**Procedure:**
+
+1. **Immediate Mitigation** (choose one):
+
+   a. **Temporarily Increase Limit**:
+   ```yaml
+   # config.yaml
+   executor:
+     queue:
+       max_queue_length: 20000
+   ```
+   ```bash
+   sudo systemctl restart attune-executor
+   ```
+
+   b. **Add Workers**:
+   ```bash
+   sudo systemctl start attune-worker@{2..5}
+   ```
+
+   c. **Increase Concurrency**:
+   ```yaml
+   policies:
+     actions:
+       affected.action:
+         concurrency_limit: 20  # Increase
+   ```
+
+2. **Analyze Root Cause**
+   ```bash
+   # Traffic pattern
+   psql -U attune -d attune -c "
+     SELECT DATE_TRUNC('minute', created) AS minute, COUNT(*)
+     FROM attune.execution
+     WHERE action = <action_id>
+       AND created > NOW() - INTERVAL '1 hour'
+     GROUP BY minute
+     ORDER BY minute DESC;
+   "
+   
+   # Action performance
+   psql -U attune -d attune -c "
+     SELECT AVG(EXTRACT(EPOCH FROM (updated - created))) AS avg_duration_seconds
+     FROM attune.execution
+     WHERE action = <action_id>
+       AND status = 'succeeded'
+       AND created > NOW() - INTERVAL '1 hour';
+   "
+   ```
+
+3. **Long-term Solution**:
+
+   - **Traffic spike**: Add API rate limiting
+   - **Slow action**: Optimize action code
+   - **Under-capacity**: Permanently scale workers
+   - **Configuration**: Adjust concurrency limits
+
+---
+
+## Maintenance Tasks
+
+### Daily
+
+```bash
+#!/bin/bash
+# daily-queue-check.sh
+
+echo "=== Active Queues ==="
+psql -U attune -d attune -c "
+  SELECT a.ref, qs.queue_length, qs.active_count
+  FROM attune.queue_stats qs
+  JOIN attune.action a ON a.id = qs.action_id
+  WHERE queue_length > 0 OR active_count > 0;
+"
+
+echo "=== Stuck Queues ==="
+psql -U attune -d attune -c "
+  SELECT a.ref, qs.queue_length, 
+         ROUND(EXTRACT(EPOCH FROM (NOW() - qs.last_updated)) / 60, 1) AS stale_minutes
+  FROM attune.queue_stats qs
+  JOIN attune.action a ON a.id = qs.action_id
+  WHERE (queue_length > 0 OR active_count > 0)
+    AND last_updated < NOW() - INTERVAL '30 minutes';
+"
+
+echo "=== Top Actions by Volume ==="
+psql -U attune -d attune -c "
+  SELECT a.ref, qs.total_enqueued, qs.total_completed
+  FROM attune.queue_stats qs
+  JOIN attune.action a ON a.id = qs.action_id
+  ORDER BY qs.total_enqueued DESC
+  LIMIT 10;
+"
+```
+
+### Weekly
+
+```bash
+#!/bin/bash
+# weekly-queue-maintenance.sh
+
+echo "=== Cleaning Stale Queue Stats ==="
+psql -U attune -d attune -c "
+  DELETE FROM attune.queue_stats
+  WHERE last_updated < NOW() - INTERVAL '7 days'
+    AND queue_length = 0
+    AND active_count = 0;
+"
+
+echo "=== Queue Snapshots Cleanup ==="
+psql -U attune -d attune -c "
+  DELETE FROM attune.queue_snapshots
+  WHERE snapshot_time < NOW() - INTERVAL '30 days';
+"
+
+echo "=== Executor Log Rotation ==="
+journalctl --vacuum-time=30d -u attune-executor
+```
+
+### Monthly
+
+- Review queue capacity trends
+- Analyze high-volume actions
+- Plan scaling based on growth
+- Update alert thresholds
+- Review and test runbook procedures
+
+---
+
+## Emergency Procedures
+
+### Emergency: System-Wide Queue Overload
+
+**Symptoms:**
+- Multiple actions with critical queue depths
+- System-wide performance degradation
+- API response times degraded
+
+**Procedure:**
+
+1. **Enable Emergency Mode**:
+   ```yaml
+   # config.yaml
+   executor:
+     emergency_mode: true  # Relaxes limits
+     queue:
+       max_queue_length: 50000
+   ```
+
+2. **Scale Workers Aggressively**:
+   ```bash
+   for i in {1..10}; do
+     sudo systemctl start attune-worker@$i
+   done
+   ```
+
+3. **Temporarily Disable Non-Critical Actions**:
+   ```sql
+   -- Disable low-priority actions
+   UPDATE attune.action
+   SET enabled = false
+   WHERE priority < 5 OR tags @> '["low-priority"]';
+   ```
+
+4. **Enable API Rate Limiting**:
+   ```yaml
+   api:
+     rate_limits:
+       global:
+         enabled: true
+         max_requests_per_minute: 500
+   ```
+
+5. **Monitor Recovery**:
+   ```bash
+   watch -n 10 "psql -U attune -d attune -t -c 'SELECT SUM(queue_length) FROM attune.queue_stats;'"
+   ```
+
+6. **Post-Incident**:
+   - Document what happened
+   - Analyze root cause
+   - Update capacity plan
+   - Restore normal configuration
+
+---
+
+### Emergency: Executor Crash Loop
+
+**Symptoms:**
+- Executor service repeatedly crashes
+- Queues not progressing
+- High memory usage before crash
+
+**Procedure:**
+
+1. **Capture Crash Logs**:
+   ```bash
+   journalctl -u attune-executor --since "30 minutes ago" > /tmp/executor-crash.log
+   dmesg | tail -100 > /tmp/dmesg-crash.log
+   ```
+
+2. **Check for Memory Issues**:
+   ```bash
+   # Check OOM kills
+   grep -i "out of memory" /var/log/syslog
+   grep -i "killed process" /var/log/kern.log
+   ```
+
+3. **Emergency Restart with Limited Queues**:
+   ```yaml
+   # config.yaml
+   executor:
+     queue:
+       max_queue_length: 1000  # Reduce drastically
+       enable_metrics: false    # Reduce overhead
+   ```
+
+4. **Start in Safe Mode**:
+   ```bash
+   sudo systemctl start attune-executor
+   # Monitor memory
+   watch -n 1 "ps aux | grep attune-executor | grep -v grep"
+   ```
+
+5. **If Still Crashing**:
+   ```bash
+   # Disable queue persistence temporarily
+   # In code or via feature flag
+   export ATTUNE__EXECUTOR__QUEUE__ENABLE_METRICS=false
+   sudo systemctl restart attune-executor
+   ```
+
+6. **Escalate**:
+   - Contact development team
+   - Provide crash logs and memory dumps
+   - Consider rolling back to previous version
+
+---
+
+## Capacity Planning
+
+### Calculating Required Capacity
+
+**Formula**:
+```
+Required Workers = (Peak Requests/Hour × Avg Duration) / 3600 / Concurrency Limit
+```
+
+**Example**:
+- Peak: 10,000 requests/hour
+- Avg Duration: 5 seconds
+- Concurrency: 10 per worker
+
+```
+Workers = (10,000 × 5) / 3600 / 10 = 1.4 → 2 workers minimum
+Add 50% buffer → 3 workers recommended
+```
+
+### Growth Planning
+
+Monitor monthly trends:
+
+```sql
+SELECT 
+  DATE_TRUNC('day', created) AS day,
+  COUNT(*) AS executions,
+  AVG(EXTRACT(EPOCH FROM (updated - created))) AS avg_duration
+FROM attune.execution
+WHERE created > NOW() - INTERVAL '30 days'
+GROUP BY day
+ORDER BY day;
+```
+
+### Capacity Recommendations
+
+| Queue Depth | Worker Count | Action |
+|-------------|--------------|--------|
+| < 10 | Current | Maintain |
+| 10-50 | +25% | Plan scale-up |
+| 50-100 | +50% | Scale soon |
+| 100+ | +100% | Scale now |
+
+---
+
+## Related Documentation
+
+- [Queue Architecture](./queue-architecture.md)
+- [Executor Service](./executor-service.md)
+- [Worker Service](./worker-service.md)
+- [API: Actions - Queue Stats](./api-actions.md#get-queue-statistics)
+
+---
+
+**Version**: 1.0  
+**Maintained By**: SRE Team  
+**Last Updated**: 2025-01-27
--- a/docs/deployment/production-deployment.md
+++ b/docs/deployment/production-deployment.md
@@ -0,0 +1,447 @@
+# Production Deployment Guide
+
+This document provides guidelines and checklists for deploying Attune to production environments.
+
+## Table of Contents
+
+- [Pre-Deployment Checklist](#pre-deployment-checklist)
+- [Database Configuration](#database-configuration)
+- [Environment Variables](#environment-variables)
+- [Schema Verification](#schema-verification)
+- [Security Best Practices](#security-best-practices)
+- [Deployment Steps](#deployment-steps)
+- [Post-Deployment Validation](#post-deployment-validation)
+- [Troubleshooting](#troubleshooting)
+
+---
+
+## Pre-Deployment Checklist
+
+Before deploying Attune to production, verify the following:
+
+- [ ] PostgreSQL 14+ database is provisioned and accessible
+- [ ] RabbitMQ 3.12+ message queue is configured
+- [ ] All required environment variables are set (see below)
+- [ ] Database migrations have been tested in staging
+- [ ] SSL/TLS certificates are configured for HTTPS
+- [ ] Log aggregation and monitoring are configured
+- [ ] Backup and disaster recovery procedures are in place
+- [ ] Security audit has been completed
+- [ ] Load balancing and high availability are configured (if applicable)
+
+---
+
+## Database Configuration
+
+### Critical: Schema Configuration
+
+**Production MUST use the `attune` schema.**
+
+The schema configuration is set in `config.production.yaml`:
+
+```yaml
+database:
+  schema: "attune"  # REQUIRED: Do not remove or change
+```
+
+### Why This Matters
+
+- **Test Isolation**: Tests use dynamic schemas (e.g., `test_uuid`) for isolation
+- **Production Consistency**: All production services must use the same schema
+- **Migration Safety**: Migrations expect the `attune` schema in production
+
+### Verification
+
+You can verify the schema configuration in several ways:
+
+1. **Check Configuration File**: Ensure `config.production.yaml` has `schema: "attune"`
+
+2. **Check Environment Variable** (if overriding):
+   ```bash
+   echo $ATTUNE__DATABASE__SCHEMA
+   # Should output: attune
+   ```
+
+3. **Check Application Logs** on startup:
+   ```
+   INFO Using production schema: attune
+   ```
+
+4. **Query Database**:
+   ```sql
+   SELECT current_schema();
+   -- Should return: attune
+   ```
+
+### ⚠️ WARNING
+
+If the schema is **not** set to `attune` in production, you will see this warning in logs:
+
+```
+WARN Using non-standard schema: 'test_xyz'. Production should use 'attune'
+```
+
+**If you see this warning in production, STOP and fix the configuration immediately.**
+
+---
+
+## Environment Variables
+
+### Required Variables
+
+These environment variables **MUST** be set before deploying:
+
+```bash
+# Database connection (required)
+export DATABASE_URL="postgresql://username:password@host:port/database"
+
+# JWT secret for authentication (required, 64+ characters)
+# Generate with: openssl rand -base64 64
+export JWT_SECRET="your-secure-jwt-secret-here"
+
+# Encryption key for secrets storage (required, 32+ characters)
+# Generate with: openssl rand -base64 32
+export ENCRYPTION_KEY="your-secure-encryption-key-here"
+```
+
+### Optional Variables
+
+```bash
+# Redis (for caching)
+export REDIS_URL="redis://host:6379"
+
+# RabbitMQ (for message queue)
+export RABBITMQ_URL="amqp://user:pass@host:5672/%2f"
+
+# CORS origins (comma-separated)
+export ATTUNE__SERVER__CORS_ORIGINS="https://app.example.com,https://www.example.com"
+
+# Log level override
+export ATTUNE__LOG__LEVEL="info"
+
+# Server port override
+export ATTUNE__SERVER__PORT="8080"
+
+# Schema override (should always be 'attune' in production)
+export ATTUNE__DATABASE__SCHEMA="attune"
+```
+
+### Environment Variable Format
+
+Attune uses hierarchical configuration with the prefix `ATTUNE__` and separator `__`:
+
+- `ATTUNE__DATABASE__URL` → `database.url`
+- `ATTUNE__SERVER__PORT` → `server.port`
+- `ATTUNE__LOG__LEVEL` → `log.level`
+
+---
+
+## Schema Verification
+
+### Automatic Verification
+
+Attune includes built-in schema validation:
+
+1. **Schema Name Validation**: Only alphanumeric and underscores allowed (max 63 chars)
+2. **SQL Injection Prevention**: Schema names are validated before use
+3. **Logging**: Production schema usage is logged prominently at startup
+
+### Manual Verification Script
+
+Run this verification before deployment:
+
+```bash
+# Verify configuration loads correctly
+cargo run --release --bin attune-api -- --config config.production.yaml --dry-run
+
+# Check logs for schema confirmation
+cargo run --release --bin attune-api 2>&1 | grep -i schema
+```
+
+Expected output:
+```
+INFO Using production schema: attune
+INFO Connecting to database with max_connections=20, schema=attune
+```
+
+### Database Schema Check
+
+After deployment, verify the schema in the database:
+
+```bash
+# Connect to your production database
+psql $DATABASE_URL
+
+# Verify schema exists
+\dn attune
+
+# Verify search_path includes attune
+SHOW search_path;
+
+# Verify tables are in attune schema
+SELECT schemaname, tablename 
+FROM pg_tables 
+WHERE schemaname = 'attune' 
+ORDER BY tablename;
+```
+
+You should see all 17 Attune tables:
+- `action`
+- `enforcement`
+- `event`
+- `execution`
+- `execution_log`
+- `identity`
+- `inquiry`
+- `inquiry_response`
+- `key`
+- `pack`
+- `rule`
+- `rule_enforcement`
+- `sensor`
+- `sensor_instance`
+- `trigger`
+- `trigger_instance`
+- `workflow_definition`
+
+---
+
+## Security Best Practices
+
+### Secrets Management
+
+1. **Never commit secrets to version control**
+2. **Use environment variables or secret management systems** (e.g., AWS Secrets Manager, HashiCorp Vault)
+3. **Rotate secrets regularly** (JWT secret, encryption key, database passwords)
+4. **Use strong, randomly generated secrets** (use provided generation commands)
+
+### Database Security
+
+1. **Use dedicated database user** with minimal required permissions
+2. **Enable SSL/TLS** for database connections
+3. **Use connection pooling** (configured via `max_connections`)
+4. **Restrict network access** to database (firewall rules, VPC, etc.)
+5. **Enable audit logging** for sensitive operations
+
+### Application Security
+
+1. **Run as non-root user** in containers/VMs
+2. **Enable HTTPS** for all API endpoints (use reverse proxy like nginx)
+3. **Configure CORS properly** (only allow trusted origins)
+4. **Set up rate limiting** and DDoS protection
+5. **Enable security headers** (CSP, HSTS, X-Frame-Options, etc.)
+6. **Keep dependencies updated** (run `cargo audit` regularly)
+
+---
+
+## Deployment Steps
+
+### 1. Prepare Database
+
+```bash
+# Create production database (if not exists)
+createdb -h your-db-host -U your-db-user attune_prod
+
+# Run migrations
+export DATABASE_URL="postgresql://user:pass@host:port/attune_prod"
+export ATTUNE__DATABASE__SCHEMA="attune"
+sqlx migrate run --source ./migrations
+```
+
+### 2. Build Application
+
+```bash
+# Build release binary
+cargo build --release --bin attune-api
+
+# Or build Docker image
+docker build -t attune-api:latest -f docker/api.Dockerfile .
+```
+
+### 3. Configure Environment
+
+```bash
+# Set all required environment variables
+export DATABASE_URL="postgresql://..."
+export JWT_SECRET="$(openssl rand -base64 64)"
+export ENCRYPTION_KEY="$(openssl rand -base64 32)"
+export ATTUNE__DATABASE__SCHEMA="attune"
+# ... etc
+```
+
+### 4. Deploy Services
+
+```bash
+# Start API service
+./target/release/attune-api --config config.production.yaml
+
+# Or with Docker
+docker run -d \
+  --name attune-api \
+  -p 8080:8080 \
+  -e DATABASE_URL="$DATABASE_URL" \
+  -e JWT_SECRET="$JWT_SECRET" \
+  -e ENCRYPTION_KEY="$ENCRYPTION_KEY" \
+  -v ./config.production.yaml:/app/config.production.yaml \
+  attune-api:latest
+```
+
+### 5. Load Core Pack
+
+```bash
+# Load the core pack (provides essential actions and sensors)
+./scripts/load-core-pack.sh
+```
+
+---
+
+## Post-Deployment Validation
+
+### Health Check
+
+```bash
+# Check API health endpoint
+curl http://your-api-host:8080/health
+
+# Expected response:
+# {"status":"ok","timestamp":"2024-01-15T12:00:00Z"}
+```
+
+### Schema Validation
+
+```bash
+# Check application logs for schema confirmation
+docker logs attune-api 2>&1 | grep -i schema
+
+# Expected output:
+# INFO Using production schema: attune
+```
+
+### Functional Tests
+
+```bash
+# Test authentication
+curl -X POST http://your-api-host:8080/auth/login \
+  -H "Content-Type: application/json" \
+  -d '{"username":"admin","password":"your-password"}'
+
+# Test pack listing
+curl http://your-api-host:8080/api/v1/packs \
+  -H "Authorization: Bearer YOUR_TOKEN"
+
+# Test action execution
+curl -X POST http://your-api-host:8080/api/v1/executions \
+  -H "Authorization: Bearer YOUR_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{"action_ref":"core.echo","parameters":{"message":"Hello"}}'
+```
+
+### Monitoring
+
+Set up monitoring for:
+
+- **Application health**: `/health` endpoint availability
+- **Database connections**: Pool size and connection errors
+- **Error rates**: 4xx and 5xx HTTP responses
+- **Response times**: P50, P95, P99 latencies
+- **Resource usage**: CPU, memory, disk, network
+- **Schema usage**: Verify `attune` schema in logs
+
+---
+
+## Troubleshooting
+
+### Issue: Wrong Schema in Production
+
+**Symptoms:**
+- Log shows: `WARN Using non-standard schema: 'something_else'`
+- Database queries fail or return no data
+
+**Solution:**
+1. Check `config.production.yaml` has `schema: "attune"`
+2. Check for environment variable override: `echo $ATTUNE__DATABASE__SCHEMA`
+3. Restart the application after fixing configuration
+4. Verify logs show: `INFO Using production schema: attune`
+
+### Issue: Schema Not Found
+
+**Symptoms:**
+- Application startup fails with "schema does not exist"
+- Database queries fail with "schema not found"
+
+**Solution:**
+1. Verify schema exists: `psql $DATABASE_URL -c "\dn attune"`
+2. If missing, run migrations: `sqlx migrate run --source ./migrations`
+3. Check migration files uncommented schema creation (first migration)
+
+### Issue: Connection Pool Exhausted
+
+**Symptoms:**
+- Timeout errors
+- "connection pool exhausted" errors
+- Slow response times
+
+**Solution:**
+1. Increase `max_connections` in config
+2. Check for connection leaks in application logs
+3. Verify database can handle the connection load
+4. Consider scaling horizontally (multiple instances)
+
+### Issue: Authentication Fails
+
+**Symptoms:**
+- All requests return 401 Unauthorized
+- Token validation errors in logs
+
+**Solution:**
+1. Verify `JWT_SECRET` is set correctly
+2. Check token expiration times in config
+3. Ensure clocks are synchronized (NTP)
+4. Verify `enable_auth: true` in config
+
+### Issue: Migrations Fail
+
+**Symptoms:**
+- `sqlx migrate run` errors
+- "relation already exists" or "schema already exists"
+
+**Solution:**
+1. Check `_sqlx_migrations` table: `SELECT * FROM attune._sqlx_migrations;`
+2. Verify migrations are in correct order
+3. For fresh deployment, drop and recreate schema if safe
+4. Check PostgreSQL version compatibility (requires 14+)
+
+---
+
+## Rollback Procedure
+
+If issues occur after deployment:
+
+1. **Stop the application**: `systemctl stop attune-api` (or equivalent)
+2. **Revert to previous version**: Deploy previous known-good version
+3. **Restore database backup** (if migrations were run):
+   ```bash
+   pg_restore -d attune_prod backup.dump
+   ```
+4. **Verify old version works**: Run post-deployment validation steps
+5. **Investigate issue**: Review logs, error messages, configuration changes
+
+---
+
+## Additional Resources
+
+- [Configuration Guide](./configuration.md)
+- [Schema-Per-Test Architecture](./schema-per-test.md)
+- [API Documentation](./api-overview.md)
+- [Security Best Practices](./security.md)
+- [Monitoring and Observability](./monitoring.md)
+
+---
+
+## Support
+
+For production issues or questions:
+
+- GitHub Issues: https://github.com/your-org/attune/issues
+- Documentation: https://docs.attune.example.com
+- Community: https://community.attune.example.com