247 lines
7.4 KiB
Markdown
247 lines
7.4 KiB
Markdown
# Enforcement Message Routing Fix
|
|
|
|
**Date:** 2026-01-16
|
|
**Status:** ✅ Completed
|
|
|
|
## Problem
|
|
|
|
Executions were not being created despite:
|
|
- Timer triggers generating events successfully
|
|
- Rules matching events and creating enforcements
|
|
- All services running without errors
|
|
|
|
When querying the executions API:
|
|
```bash
|
|
curl -X 'GET' 'http://localhost:8080/api/v1/executions?page=1&per_page=50'
|
|
```
|
|
|
|
Response showed no executions:
|
|
```json
|
|
{"data":[],"pagination":{"page":1,"page_size":50,"total_items":0,"total_pages":0}}
|
|
```
|
|
|
|
Database investigation revealed:
|
|
- ✅ Events: Being created every 10 seconds (128+ events)
|
|
- ✅ Enforcements: Being created by rule matcher (multiple enforcements)
|
|
- ❌ Executions: Zero executions in database
|
|
|
|
## Root Cause
|
|
|
|
**Message routing mismatch** between sensor and executor services:
|
|
|
|
1. **Sensor Service** (rule matcher):
|
|
- Published `EnforcementCreated` messages to `attune.events` exchange
|
|
- Routing key: `enforcement.created`
|
|
|
|
2. **Executor Service** (enforcement processor):
|
|
- Consumed from `attune.executions.queue`
|
|
- Queue bound to `attune.executions` exchange
|
|
- Expected messages on `attune.executions` exchange
|
|
|
|
3. **Result**: Messages published to wrong exchange → never reached executor → no executions created
|
|
|
|
### Message Flow (Before Fix)
|
|
|
|
```
|
|
Sensor Rule Matcher
|
|
↓ (publishes EnforcementCreated)
|
|
attune.events exchange
|
|
↓ (routed to)
|
|
attune.events.queue
|
|
↓ (NOT consumed by executor)
|
|
[Messages accumulate, executor never sees them]
|
|
|
|
Executor Enforcement Processor
|
|
↓ (consumes from)
|
|
attune.executions.queue ← (bound to attune.executions exchange)
|
|
↓ (waiting for messages that never arrive)
|
|
[No messages received, no executions created]
|
|
```
|
|
|
|
## Solution
|
|
|
|
Changed `EnforcementCreated` message to use the correct exchange:
|
|
|
|
**File**: `crates/common/src/mq/messages.rs`
|
|
|
|
**Before**:
|
|
```rust
|
|
pub fn exchange(&self) -> String {
|
|
match self {
|
|
Self::EventCreated | Self::EnforcementCreated => "attune.events".to_string(),
|
|
Self::ExecutionRequested | Self::ExecutionStatusChanged | Self::ExecutionCompleted => {
|
|
"attune.executions".to_string()
|
|
}
|
|
// ...
|
|
}
|
|
}
|
|
```
|
|
|
|
**After**:
|
|
```rust
|
|
pub fn exchange(&self) -> String {
|
|
match self {
|
|
Self::EventCreated => "attune.events".to_string(),
|
|
Self::EnforcementCreated => "attune.executions".to_string(),
|
|
Self::ExecutionRequested | Self::ExecutionStatusChanged | Self::ExecutionCompleted => {
|
|
"attune.executions".to_string()
|
|
}
|
|
// ...
|
|
}
|
|
}
|
|
```
|
|
|
|
### Message Flow (After Fix)
|
|
|
|
```
|
|
Sensor Rule Matcher
|
|
↓ (publishes EnforcementCreated)
|
|
attune.executions exchange
|
|
↓ (routed to)
|
|
attune.executions.queue
|
|
↓ (consumed by)
|
|
Executor Enforcement Processor
|
|
↓ (processes enforcement)
|
|
Execution Created ✓
|
|
```
|
|
|
|
## Implementation Details
|
|
|
|
### Files Modified
|
|
|
|
**`crates/common/src/mq/messages.rs`:**
|
|
- Moved `EnforcementCreated` from `attune.events` to `attune.executions` exchange
|
|
- Maintains routing key: `enforcement.created`
|
|
- All execution-related messages now use same exchange
|
|
|
|
### Architecture Rationale
|
|
|
|
**Exchange Purpose Clarification:**
|
|
- `attune.events`: For event generation and monitoring
|
|
- `EventCreated` messages
|
|
- `attune.executions`: For execution lifecycle management
|
|
- `EnforcementCreated` (triggers execution creation)
|
|
- `ExecutionRequested` (worker assignment)
|
|
- `ExecutionStatusChanged` (status updates)
|
|
- `ExecutionCompleted` (completion notifications)
|
|
- `InquiryCreated`/`InquiryResponded` (human-in-the-loop)
|
|
- `attune.notifications`: For notification delivery
|
|
- `NotificationCreated` messages
|
|
|
|
## Testing
|
|
|
|
After the fix, the complete flow should work:
|
|
|
|
1. ✅ **Timer triggers** generate events (already working)
|
|
2. ✅ **Rule matcher** creates enforcements (already working)
|
|
3. ✅ **Enforcement messages** published to correct exchange (FIXED)
|
|
4. ✅ **Executor** receives and processes enforcements (now works)
|
|
5. ✅ **Executions** are created in database
|
|
6. ✅ **Worker** receives execution requests
|
|
7. ✅ **Actions** are executed
|
|
|
|
### Verification Steps
|
|
|
|
After restarting services with the fix:
|
|
|
|
```bash
|
|
# Wait for a few timer events (10-20 seconds)
|
|
sleep 20
|
|
|
|
# Check enforcements (should have new ones)
|
|
psql -U postgres -d attune -c "SELECT COUNT(*) FROM attune.enforcement;"
|
|
|
|
# Check executions (should now have entries!)
|
|
psql -U postgres -d attune -c "SELECT COUNT(*) FROM attune.execution;"
|
|
|
|
# Query via API
|
|
curl -X 'GET' 'http://localhost:8080/api/v1/executions?page=1&per_page=50'
|
|
```
|
|
|
|
Expected result:
|
|
- Executions table has records
|
|
- API returns execution data
|
|
- Worker logs show action execution
|
|
|
|
## Impact
|
|
|
|
- **Critical Fix**: Enables the entire execution pipeline
|
|
- **No Breaking Changes**: Only affects internal message routing
|
|
- **Backward Compatible**: Existing events and enforcements unaffected
|
|
- **Performance**: No impact, messages now reach correct consumers
|
|
|
|
## Related Components
|
|
|
|
### Services Affected
|
|
- ✅ **Sensor Service**: Needs restart to publish to correct exchange
|
|
- ✅ **Executor Service**: No changes needed, already consuming from correct queue
|
|
- ⚠️ **API Service**: May need restart to show updated execution data
|
|
|
|
### Message Types Not Affected
|
|
- `EventCreated` - Still uses `attune.events` (correct)
|
|
- `ExecutionRequested`, `ExecutionStatusChanged`, `ExecutionCompleted` - Already using `attune.executions` (correct)
|
|
- `NotificationCreated` - Still uses `attune.notifications` (correct)
|
|
|
|
## Deployment Steps
|
|
|
|
1. **Rebuild affected services**:
|
|
```bash
|
|
cargo build -p attune-sensor
|
|
cargo build -p attune-executor # Already has new common lib
|
|
```
|
|
|
|
2. **Restart services** (in order):
|
|
```bash
|
|
# Stop old processes
|
|
pkill attune-sensor
|
|
pkill attune-executor
|
|
|
|
# Start with new binary
|
|
cargo run -p attune-sensor &
|
|
cargo run -p attune-executor &
|
|
```
|
|
|
|
3. **Verify** executions are being created:
|
|
```bash
|
|
# Wait for timer event (10 seconds)
|
|
sleep 15
|
|
|
|
# Check database
|
|
psql -U postgres -d attune -c \
|
|
"SELECT id, status, action_ref, created FROM attune.execution ORDER BY created DESC LIMIT 5;"
|
|
```
|
|
|
|
## Lessons Learned
|
|
|
|
### Message Routing Design Principles
|
|
1. **Group messages by lifecycle domain**, not by source service
|
|
2. **Enforcement is part of execution lifecycle**, not event monitoring
|
|
3. **Use exchange names that reflect message purpose**, not service names
|
|
4. **Document message routing** to prevent similar issues
|
|
|
|
### Debugging Message Queue Issues
|
|
1. **Check both producer and consumer** when messages aren't flowing
|
|
2. **Verify exchange bindings** match expected routing
|
|
3. **Monitor queue depths** to detect accumulation
|
|
4. **Use message tracing** for production debugging
|
|
|
|
### Architecture Documentation Needed
|
|
- [ ] Document message routing topology
|
|
- [ ] Create message flow diagrams
|
|
- [ ] Add routing decision matrix
|
|
- [ ] Document exchange purposes
|
|
|
|
## Next Steps
|
|
|
|
- [ ] Verify complete flow with worker execution
|
|
- [ ] Add integration test for enforcement → execution flow
|
|
- [ ] Document message routing in architecture docs
|
|
- [ ] Consider adding dead letter queue monitoring
|
|
- [ ] Add metrics for message routing success/failure
|
|
|
|
## Notes
|
|
|
|
- This was a subtle bug that only manifested in the integration between services
|
|
- Individual services were working correctly in isolation
|
|
- Proper message routing is critical for distributed system reliability
|
|
- Exchange naming should reflect message purpose, not producer service |