re-uploading work
This commit is contained in:
247
work-summary/sessions/2026-01-16-enforcement-routing-fix.md
Normal file
247
work-summary/sessions/2026-01-16-enforcement-routing-fix.md
Normal file
@@ -0,0 +1,247 @@
|
||||
# Enforcement Message Routing Fix
|
||||
|
||||
**Date:** 2026-01-16
|
||||
**Status:** ✅ Completed
|
||||
|
||||
## Problem
|
||||
|
||||
Executions were not being created despite:
|
||||
- Timer triggers generating events successfully
|
||||
- Rules matching events and creating enforcements
|
||||
- All services running without errors
|
||||
|
||||
When querying the executions API:
|
||||
```bash
|
||||
curl -X 'GET' 'http://localhost:8080/api/v1/executions?page=1&per_page=50'
|
||||
```
|
||||
|
||||
Response showed no executions:
|
||||
```json
|
||||
{"data":[],"pagination":{"page":1,"page_size":50,"total_items":0,"total_pages":0}}
|
||||
```
|
||||
|
||||
Database investigation revealed:
|
||||
- ✅ Events: Being created every 10 seconds (128+ events)
|
||||
- ✅ Enforcements: Being created by rule matcher (multiple enforcements)
|
||||
- ❌ Executions: Zero executions in database
|
||||
|
||||
## Root Cause
|
||||
|
||||
**Message routing mismatch** between sensor and executor services:
|
||||
|
||||
1. **Sensor Service** (rule matcher):
|
||||
- Published `EnforcementCreated` messages to `attune.events` exchange
|
||||
- Routing key: `enforcement.created`
|
||||
|
||||
2. **Executor Service** (enforcement processor):
|
||||
- Consumed from `attune.executions.queue`
|
||||
- Queue bound to `attune.executions` exchange
|
||||
- Expected messages on `attune.executions` exchange
|
||||
|
||||
3. **Result**: Messages published to wrong exchange → never reached executor → no executions created
|
||||
|
||||
### Message Flow (Before Fix)
|
||||
|
||||
```
|
||||
Sensor Rule Matcher
|
||||
↓ (publishes EnforcementCreated)
|
||||
attune.events exchange
|
||||
↓ (routed to)
|
||||
attune.events.queue
|
||||
↓ (NOT consumed by executor)
|
||||
[Messages accumulate, executor never sees them]
|
||||
|
||||
Executor Enforcement Processor
|
||||
↓ (consumes from)
|
||||
attune.executions.queue ← (bound to attune.executions exchange)
|
||||
↓ (waiting for messages that never arrive)
|
||||
[No messages received, no executions created]
|
||||
```
|
||||
|
||||
## Solution
|
||||
|
||||
Changed `EnforcementCreated` message to use the correct exchange:
|
||||
|
||||
**File**: `crates/common/src/mq/messages.rs`
|
||||
|
||||
**Before**:
|
||||
```rust
|
||||
pub fn exchange(&self) -> String {
|
||||
match self {
|
||||
Self::EventCreated | Self::EnforcementCreated => "attune.events".to_string(),
|
||||
Self::ExecutionRequested | Self::ExecutionStatusChanged | Self::ExecutionCompleted => {
|
||||
"attune.executions".to_string()
|
||||
}
|
||||
// ...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**After**:
|
||||
```rust
|
||||
pub fn exchange(&self) -> String {
|
||||
match self {
|
||||
Self::EventCreated => "attune.events".to_string(),
|
||||
Self::EnforcementCreated => "attune.executions".to_string(),
|
||||
Self::ExecutionRequested | Self::ExecutionStatusChanged | Self::ExecutionCompleted => {
|
||||
"attune.executions".to_string()
|
||||
}
|
||||
// ...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Message Flow (After Fix)
|
||||
|
||||
```
|
||||
Sensor Rule Matcher
|
||||
↓ (publishes EnforcementCreated)
|
||||
attune.executions exchange
|
||||
↓ (routed to)
|
||||
attune.executions.queue
|
||||
↓ (consumed by)
|
||||
Executor Enforcement Processor
|
||||
↓ (processes enforcement)
|
||||
Execution Created ✓
|
||||
```
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Files Modified
|
||||
|
||||
**`crates/common/src/mq/messages.rs`:**
|
||||
- Moved `EnforcementCreated` from `attune.events` to `attune.executions` exchange
|
||||
- Maintains routing key: `enforcement.created`
|
||||
- All execution-related messages now use same exchange
|
||||
|
||||
### Architecture Rationale
|
||||
|
||||
**Exchange Purpose Clarification:**
|
||||
- `attune.events`: For event generation and monitoring
|
||||
- `EventCreated` messages
|
||||
- `attune.executions`: For execution lifecycle management
|
||||
- `EnforcementCreated` (triggers execution creation)
|
||||
- `ExecutionRequested` (worker assignment)
|
||||
- `ExecutionStatusChanged` (status updates)
|
||||
- `ExecutionCompleted` (completion notifications)
|
||||
- `InquiryCreated`/`InquiryResponded` (human-in-the-loop)
|
||||
- `attune.notifications`: For notification delivery
|
||||
- `NotificationCreated` messages
|
||||
|
||||
## Testing
|
||||
|
||||
After the fix, the complete flow should work:
|
||||
|
||||
1. ✅ **Timer triggers** generate events (already working)
|
||||
2. ✅ **Rule matcher** creates enforcements (already working)
|
||||
3. ✅ **Enforcement messages** published to correct exchange (FIXED)
|
||||
4. ✅ **Executor** receives and processes enforcements (now works)
|
||||
5. ✅ **Executions** are created in database
|
||||
6. ✅ **Worker** receives execution requests
|
||||
7. ✅ **Actions** are executed
|
||||
|
||||
### Verification Steps
|
||||
|
||||
After restarting services with the fix:
|
||||
|
||||
```bash
|
||||
# Wait for a few timer events (10-20 seconds)
|
||||
sleep 20
|
||||
|
||||
# Check enforcements (should have new ones)
|
||||
psql -U postgres -d attune -c "SELECT COUNT(*) FROM attune.enforcement;"
|
||||
|
||||
# Check executions (should now have entries!)
|
||||
psql -U postgres -d attune -c "SELECT COUNT(*) FROM attune.execution;"
|
||||
|
||||
# Query via API
|
||||
curl -X 'GET' 'http://localhost:8080/api/v1/executions?page=1&per_page=50'
|
||||
```
|
||||
|
||||
Expected result:
|
||||
- Executions table has records
|
||||
- API returns execution data
|
||||
- Worker logs show action execution
|
||||
|
||||
## Impact
|
||||
|
||||
- **Critical Fix**: Enables the entire execution pipeline
|
||||
- **No Breaking Changes**: Only affects internal message routing
|
||||
- **Backward Compatible**: Existing events and enforcements unaffected
|
||||
- **Performance**: No impact, messages now reach correct consumers
|
||||
|
||||
## Related Components
|
||||
|
||||
### Services Affected
|
||||
- ✅ **Sensor Service**: Needs restart to publish to correct exchange
|
||||
- ✅ **Executor Service**: No changes needed, already consuming from correct queue
|
||||
- ⚠️ **API Service**: May need restart to show updated execution data
|
||||
|
||||
### Message Types Not Affected
|
||||
- `EventCreated` - Still uses `attune.events` (correct)
|
||||
- `ExecutionRequested`, `ExecutionStatusChanged`, `ExecutionCompleted` - Already using `attune.executions` (correct)
|
||||
- `NotificationCreated` - Still uses `attune.notifications` (correct)
|
||||
|
||||
## Deployment Steps
|
||||
|
||||
1. **Rebuild affected services**:
|
||||
```bash
|
||||
cargo build -p attune-sensor
|
||||
cargo build -p attune-executor # Already has new common lib
|
||||
```
|
||||
|
||||
2. **Restart services** (in order):
|
||||
```bash
|
||||
# Stop old processes
|
||||
pkill attune-sensor
|
||||
pkill attune-executor
|
||||
|
||||
# Start with new binary
|
||||
cargo run -p attune-sensor &
|
||||
cargo run -p attune-executor &
|
||||
```
|
||||
|
||||
3. **Verify** executions are being created:
|
||||
```bash
|
||||
# Wait for timer event (10 seconds)
|
||||
sleep 15
|
||||
|
||||
# Check database
|
||||
psql -U postgres -d attune -c \
|
||||
"SELECT id, status, action_ref, created FROM attune.execution ORDER BY created DESC LIMIT 5;"
|
||||
```
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
### Message Routing Design Principles
|
||||
1. **Group messages by lifecycle domain**, not by source service
|
||||
2. **Enforcement is part of execution lifecycle**, not event monitoring
|
||||
3. **Use exchange names that reflect message purpose**, not service names
|
||||
4. **Document message routing** to prevent similar issues
|
||||
|
||||
### Debugging Message Queue Issues
|
||||
1. **Check both producer and consumer** when messages aren't flowing
|
||||
2. **Verify exchange bindings** match expected routing
|
||||
3. **Monitor queue depths** to detect accumulation
|
||||
4. **Use message tracing** for production debugging
|
||||
|
||||
### Architecture Documentation Needed
|
||||
- [ ] Document message routing topology
|
||||
- [ ] Create message flow diagrams
|
||||
- [ ] Add routing decision matrix
|
||||
- [ ] Document exchange purposes
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [ ] Verify complete flow with worker execution
|
||||
- [ ] Add integration test for enforcement → execution flow
|
||||
- [ ] Document message routing in architecture docs
|
||||
- [ ] Consider adding dead letter queue monitoring
|
||||
- [ ] Add metrics for message routing success/failure
|
||||
|
||||
## Notes
|
||||
|
||||
- This was a subtle bug that only manifested in the integration between services
|
||||
- Individual services were working correctly in isolation
|
||||
- Proper message routing is critical for distributed system reliability
|
||||
- Exchange naming should reflect message purpose, not producer service
|
||||
Reference in New Issue
Block a user