Files
attune/work-summary/sessions/2026-01-16-enforcement-routing-fix.md
2026-02-04 17:46:30 -06:00

7.4 KiB

Enforcement Message Routing Fix

Date: 2026-01-16 Status: Completed

Problem

Executions were not being created despite:

  • Timer triggers generating events successfully
  • Rules matching events and creating enforcements
  • All services running without errors

When querying the executions API:

curl -X 'GET' 'http://localhost:8080/api/v1/executions?page=1&per_page=50'

Response showed no executions:

{"data":[],"pagination":{"page":1,"page_size":50,"total_items":0,"total_pages":0}}

Database investigation revealed:

  • Events: Being created every 10 seconds (128+ events)
  • Enforcements: Being created by rule matcher (multiple enforcements)
  • Executions: Zero executions in database

Root Cause

Message routing mismatch between sensor and executor services:

  1. Sensor Service (rule matcher):

    • Published EnforcementCreated messages to attune.events exchange
    • Routing key: enforcement.created
  2. Executor Service (enforcement processor):

    • Consumed from attune.executions.queue
    • Queue bound to attune.executions exchange
    • Expected messages on attune.executions exchange
  3. Result: Messages published to wrong exchange → never reached executor → no executions created

Message Flow (Before Fix)

Sensor Rule Matcher
    ↓ (publishes EnforcementCreated)
attune.events exchange
    ↓ (routed to)
attune.events.queue
    ↓ (NOT consumed by executor)
[Messages accumulate, executor never sees them]

Executor Enforcement Processor
    ↓ (consumes from)
attune.executions.queue ← (bound to attune.executions exchange)
    ↓ (waiting for messages that never arrive)
[No messages received, no executions created]

Solution

Changed EnforcementCreated message to use the correct exchange:

File: crates/common/src/mq/messages.rs

Before:

pub fn exchange(&self) -> String {
    match self {
        Self::EventCreated | Self::EnforcementCreated => "attune.events".to_string(),
        Self::ExecutionRequested | Self::ExecutionStatusChanged | Self::ExecutionCompleted => {
            "attune.executions".to_string()
        }
        // ...
    }
}

After:

pub fn exchange(&self) -> String {
    match self {
        Self::EventCreated => "attune.events".to_string(),
        Self::EnforcementCreated => "attune.executions".to_string(),
        Self::ExecutionRequested | Self::ExecutionStatusChanged | Self::ExecutionCompleted => {
            "attune.executions".to_string()
        }
        // ...
    }
}

Message Flow (After Fix)

Sensor Rule Matcher
    ↓ (publishes EnforcementCreated)
attune.executions exchange
    ↓ (routed to)
attune.executions.queue
    ↓ (consumed by)
Executor Enforcement Processor
    ↓ (processes enforcement)
Execution Created ✓

Implementation Details

Files Modified

crates/common/src/mq/messages.rs:

  • Moved EnforcementCreated from attune.events to attune.executions exchange
  • Maintains routing key: enforcement.created
  • All execution-related messages now use same exchange

Architecture Rationale

Exchange Purpose Clarification:

  • attune.events: For event generation and monitoring
    • EventCreated messages
  • attune.executions: For execution lifecycle management
    • EnforcementCreated (triggers execution creation)
    • ExecutionRequested (worker assignment)
    • ExecutionStatusChanged (status updates)
    • ExecutionCompleted (completion notifications)
    • InquiryCreated/InquiryResponded (human-in-the-loop)
  • attune.notifications: For notification delivery
    • NotificationCreated messages

Testing

After the fix, the complete flow should work:

  1. Timer triggers generate events (already working)
  2. Rule matcher creates enforcements (already working)
  3. Enforcement messages published to correct exchange (FIXED)
  4. Executor receives and processes enforcements (now works)
  5. Executions are created in database
  6. Worker receives execution requests
  7. Actions are executed

Verification Steps

After restarting services with the fix:

# Wait for a few timer events (10-20 seconds)
sleep 20

# Check enforcements (should have new ones)
psql -U postgres -d attune -c "SELECT COUNT(*) FROM attune.enforcement;"

# Check executions (should now have entries!)
psql -U postgres -d attune -c "SELECT COUNT(*) FROM attune.execution;"

# Query via API
curl -X 'GET' 'http://localhost:8080/api/v1/executions?page=1&per_page=50'

Expected result:

  • Executions table has records
  • API returns execution data
  • Worker logs show action execution

Impact

  • Critical Fix: Enables the entire execution pipeline
  • No Breaking Changes: Only affects internal message routing
  • Backward Compatible: Existing events and enforcements unaffected
  • Performance: No impact, messages now reach correct consumers

Services Affected

  • Sensor Service: Needs restart to publish to correct exchange
  • Executor Service: No changes needed, already consuming from correct queue
  • ⚠️ API Service: May need restart to show updated execution data

Message Types Not Affected

  • EventCreated - Still uses attune.events (correct)
  • ExecutionRequested, ExecutionStatusChanged, ExecutionCompleted - Already using attune.executions (correct)
  • NotificationCreated - Still uses attune.notifications (correct)

Deployment Steps

  1. Rebuild affected services:

    cargo build -p attune-sensor
    cargo build -p attune-executor  # Already has new common lib
    
  2. Restart services (in order):

    # Stop old processes
    pkill attune-sensor
    pkill attune-executor
    
    # Start with new binary
    cargo run -p attune-sensor &
    cargo run -p attune-executor &
    
  3. Verify executions are being created:

    # Wait for timer event (10 seconds)
    sleep 15
    
    # Check database
    psql -U postgres -d attune -c \
      "SELECT id, status, action_ref, created FROM attune.execution ORDER BY created DESC LIMIT 5;"
    

Lessons Learned

Message Routing Design Principles

  1. Group messages by lifecycle domain, not by source service
  2. Enforcement is part of execution lifecycle, not event monitoring
  3. Use exchange names that reflect message purpose, not service names
  4. Document message routing to prevent similar issues

Debugging Message Queue Issues

  1. Check both producer and consumer when messages aren't flowing
  2. Verify exchange bindings match expected routing
  3. Monitor queue depths to detect accumulation
  4. Use message tracing for production debugging

Architecture Documentation Needed

  • Document message routing topology
  • Create message flow diagrams
  • Add routing decision matrix
  • Document exchange purposes

Next Steps

  • Verify complete flow with worker execution
  • Add integration test for enforcement → execution flow
  • Document message routing in architecture docs
  • Consider adding dead letter queue monitoring
  • Add metrics for message routing success/failure

Notes

  • This was a subtle bug that only manifested in the integration between services
  • Individual services were working correctly in isolation
  • Proper message routing is critical for distributed system reliability
  • Exchange naming should reflect message purpose, not producer service