Files
attune/docs/MIGRATION-queue-separation-2026-02-03.md
2026-02-04 17:46:30 -06:00

281 lines
7.5 KiB
Markdown

# Migration Guide: Queue Separation Fix (2026-02-03)
**Issue:** Deserialization errors in executor service
**Urgency:** High - Critical bug causing message rejection
**Downtime Required:** Yes (brief - service restart only)
## Overview
This migration separates competing consumers on shared RabbitMQ queues into dedicated queues, fixing deserialization errors:
- `missing field 'inquiry_id'`
- `missing field 'action_id'`
## Changes Summary
### New Queues Created
1. `attune.inquiry.responses.queue` - For inquiry response messages
2. `attune.execution.completed.queue` - For execution completion messages
### Queue Bindings Modified
- `attune.execution.status.queue` - Now only receives `execution.status.changed` messages
- `attune.execution.completed.queue` - Now receives `execution.completed` messages
- `attune.inquiry.responses.queue` - Now receives `inquiry.responded` messages
### Services Affected
- **Executor Service** - Requires restart (consumers reconfigured)
- **Worker Service** - No changes required (publishers work automatically)
- **API Service** - No changes required (publishers work automatically)
## Pre-Migration Checklist
- [ ] Backup current RabbitMQ configuration
- [ ] Note current queue depths in RabbitMQ management UI
- [ ] Verify all services are running and healthy
- [ ] Review recent executor logs for deserialization errors
- [ ] Ensure you have access to restart the executor service
## Migration Steps
### Step 1: Stop the Executor Service
```bash
# Using systemd
sudo systemctl stop attune-executor
# Using docker-compose
docker-compose stop executor
# Or kill the process
pkill -f attune-executor
```
### Step 2: Deploy Updated Code
```bash
# Pull latest code
git pull origin main
# Rebuild executor (and common library)
cd attune
cargo build --release --bin attune-executor
```
### Step 3: Verify RabbitMQ Queue Creation
The new queues will be created automatically when the executor starts, but you can verify the configuration:
```bash
# Check that the code is updated
grep -r "inquiry_responses" crates/common/src/mq/config.rs
grep -r "execution_completed" crates/common/src/mq/config.rs
```
### Step 4: Start the Executor Service
```bash
# Using systemd
sudo systemctl start attune-executor
# Using docker-compose
docker-compose start executor
# Or directly
./target/release/attune-executor --config config.production.yaml
```
### Step 5: Verify Queue Creation in RabbitMQ
Check RabbitMQ Management UI (http://localhost:15672):
**Queues Tab:**
- [ ] `attune.inquiry.responses.queue` exists
- [ ] `attune.execution.completed.queue` exists
- [ ] `attune.execution.status.queue` still exists
**Exchanges Tab → attune.executions → Bindings:**
- [ ] `inquiry.responded``attune.inquiry.responses.queue`
- [ ] `execution.completed``attune.execution.completed.queue`
- [ ] `execution.status.changed``attune.execution.status.queue`
### Step 6: Monitor Executor Logs
```bash
# Watch for successful startup
tail -f /var/log/attune/executor.log
# Or with journalctl
journalctl -u attune-executor -f
# Or with docker
docker logs -f attune-executor
```
**Expected log messages:**
```
INFO Starting Executor Service
INFO Message queue connection established
INFO Queue manager initialized with database persistence
INFO Starting event processor...
INFO Starting completion listener...
INFO Starting enforcement processor...
INFO Starting execution scheduler...
INFO Starting execution manager...
INFO Starting inquiry handler...
INFO Executor Service started successfully
```
### Step 7: Verify No Deserialization Errors
```bash
# Check for the specific errors (should be NONE)
grep "missing field.*inquiry_id" /var/log/attune/executor.log
grep "missing field.*action_id" /var/log/attune/executor.log
grep "Failed to deserialize message" /var/log/attune/executor.log
```
If no output, the fix is working! ✅
### Step 8: Functional Testing
**Test Execution Completion:**
```bash
# Execute a simple action
attune action execute core.echo --param message="test"
# Verify execution completes without errors in logs
```
**Test Inquiry Workflow (if applicable):**
```bash
# Create an action that requests inquiry
# Respond to the inquiry via API
# Verify execution resumes
```
**Test Status Updates:**
```bash
# Execute a longer-running action
# Verify status updates are processed correctly
```
## Rollback Procedure
If issues occur, you can rollback:
### Step 1: Stop Executor
```bash
sudo systemctl stop attune-executor
```
### Step 2: Revert Code
```bash
git revert <commit-hash>
cargo build --release --bin attune-executor
```
### Step 3: Remove New Queues (Optional)
```bash
# Via RabbitMQ Management API
curl -u guest:guest -X DELETE http://localhost:15672/api/queues/%2F/attune.inquiry.responses.queue
curl -u guest:guest -X DELETE http://localhost:15672/api/queues/%2F/attune.execution.completed.queue
```
### Step 4: Restart Executor
```bash
sudo systemctl start attune-executor
```
## Post-Migration Verification
- [ ] Executor service is running and healthy
- [ ] No deserialization errors in logs for 15+ minutes
- [ ] Test executions complete successfully
- [ ] Inquiries (if used) work correctly
- [ ] All three new queue bindings show in RabbitMQ UI
- [ ] Queue message rates look normal
- [ ] No messages in dead letter queues
## Monitoring Points
Watch these metrics for 24 hours post-migration:
1. **Executor Error Rate** - Should drop to near zero
2. **Queue Depths** - Should remain stable/low
3. **Message Delivery Rate** - Should remain consistent
4. **Dead Letter Queue Depth** - Should not increase
## Troubleshooting
### Issue: New queues not created
**Symptoms:** Queues don't appear in RabbitMQ UI
**Solution:**
```bash
# Check executor logs for connection errors
grep "Failed to declare queue" /var/log/attune/executor.log
# Verify RabbitMQ permissions
rabbitmqctl list_user_permissions attune_user
```
### Issue: Still seeing deserialization errors
**Symptoms:** Errors persist after restart
**Solution:**
```bash
# 1. Verify code was rebuilt
attune-executor --version
# 2. Check which queues consumers are using
grep "Starting.*listener" /var/log/attune/executor.log
# 3. Verify bindings in RabbitMQ UI match expected configuration
# 4. Restart ALL services to ensure workers/API use new bindings
sudo systemctl restart attune-worker attune-api attune-executor
```
### Issue: Messages stuck in old queue
**Symptoms:** Old execution.status.queue has growing backlog
**Solution:**
```bash
# Check what messages are in the queue
rabbitmqadmin get queue=attune.execution.status.queue count=5
# If they're completion messages, manually move them:
# 1. Temporarily stop executor
# 2. Purge old queue
# 3. Restart executor (messages will be redelivered after TTL)
```
## Impact Assessment
**Before Fix:**
- ❌ ~30-50% of messages rejected due to deserialization errors
- ❌ Executions not completing properly
- ❌ Inquiries not being processed
- ❌ Resource waste from redelivery attempts
**After Fix:**
- ✅ 100% message delivery success rate
- ✅ All executions complete correctly
- ✅ Inquiries processed immediately
- ✅ Reduced message queue load
## Questions?
Contact the platform team or refer to:
- `attune/work-summary/2026-02-03-inquiry-queue-separation.md` - Technical details
- `attune/docs/QUICKREF-rabbitmq-queues.md` - Queue architecture reference
- `attune/docs/architecture/queue-architecture.md` - Overall architecture
---
**Migration Completed:** __________ (date/time)
**Performed By:** __________
**Issues Encountered:** __________
**Notes:** __________