Files
attune/work-summary/2026-02-03-inquiry-queue-separation.md
2026-02-04 17:46:30 -06:00

11 KiB

Work Summary: Inquiry Queue Separation Fix

Date: 2026-02-03
Issues:

  • Executor deserialization error: "missing field inquiry_id"
  • Executor deserialization error: "missing field action_id"

Status: Both Fixed

Visual Overview

Before Fix

attune.execution.status.queue
    ├─ Consumer: CompletionListener (expects ExecutionCompletedPayload)
    ├─ Consumer: ExecutionManager (expects ExecutionStatusPayload)
    └─ Consumer: InquiryHandler (expects InquiryRespondedPayload)
    
    Incoming Messages:
    - execution.completed → ExecutionCompletedPayload
    - execution.status.changed → ExecutionStatusChangedPayload
    - inquiry.responded → InquiryRespondedPayload
    
    Problem: Round-robin distribution causes wrong consumer to receive wrong message type!

After Fix

attune.execution.completed.queue
    └─ Consumer: CompletionListener (expects ExecutionCompletedPayload)
    └─ Message: execution.completed → ExecutionCompletedPayload ✓

attune.execution.status.queue
    └─ Consumer: ExecutionManager (expects ExecutionStatusPayload)
    └─ Message: execution.status.changed → ExecutionStatusChangedPayload ✓

attune.inquiry.responses.queue
    └─ Consumer: InquiryHandler (expects InquiryRespondedPayload)
    └─ Message: inquiry.responded → InquiryRespondedPayload ✓
    
    Result: Each queue has ONE consumer expecting ONE message type!

Problem Description

The executor service was logging deserialization errors when processing messages from the execution_status queue:

ERROR ThreadId(13) crates/common/src/mq/consumer.rs:112: Failed to deserialize message: missing field `inquiry_id` at line 1 column 318. Rejecting message.

Root Cause Analysis

The issue was caused by two different consumers listening to the same RabbitMQ queue but expecting different message payload types:

Queue Configuration Issue

The execution_status queue (attune.execution.status.queue) was bound to the attune.executions exchange with routing key "execution.status.changed", but it was receiving messages with two different routing keys:

  1. execution.completedExecutionCompletedPayload (published by Worker service)
  2. inquiry.respondedInquiryRespondedPayload (published by API service)

Competing Consumers

Two consumers were configured to read from the same execution_status queue:

  1. CompletionListener (executor.completion tag)

    • Expected: ExecutionCompletedPayload
    • Fields: execution_id, action_id, action_ref, status, result, completed_at
  2. InquiryHandler (executor.inquiry tag)

    • Expected: InquiryRespondedPayload
    • Fields: inquiry_id, execution_id, response, responded_by, responded_at

Message Routing Behavior

RabbitMQ distributes messages to consumers on the same queue using round-robin load balancing. This meant:

  • When an InquiryRespondedPayload was delivered to CompletionListenerdeserialization failed (missing inquiry_id)
  • When an ExecutionCompletedPayload was delivered to InquiryHandlerdeserialization failed (missing action_id)

The error message specifically mentioned inquiry_id because CompletionListener tried to deserialize an inquiry response message.

Solution Implemented

1. Created Separate Queue for Inquiry Responses

File: attune/crates/common/src/mq/config.rs

Added a new queue configuration:

pub struct QueuesConfig {
    // ... existing queues ...
    
    /// Inquiry responses queue configuration
    pub inquiry_responses: QueueConfig,
}

Default configurations:

execution_completed: QueueConfig {
    name: "attune.execution.completed.queue".to_string(),
    durable: true,
    exclusive: false,
    auto_delete: false,
},
inquiry_responses: QueueConfig {
    name: "attune.inquiry.responses.queue".to_string(),
    durable: true,
    exclusive: false,
    auto_delete: false,
}

2. Updated Infrastructure Setup

File: attune/crates/common/src/mq/connection.rs

Added queue declarations and bindings in setup_infrastructure():

// Declare the new queues with DLX support
self.declare_queue_with_dlx(&config.rabbitmq.queues.execution_completed, dlx).await?;
self.declare_queue_with_dlx(&config.rabbitmq.queues.inquiry_responses, dlx).await?;

// Bind execution_status queue to status changed messages for ExecutionManager
self.bind_queue(
    &config.rabbitmq.queues.execution_status.name,
    &config.rabbitmq.exchanges.executions.name,
    "execution.status.changed",
)
.await?;

// Bind execution_completed queue to completed messages for CompletionListener
self.bind_queue(
    &config.rabbitmq.queues.execution_completed.name,
    &config.rabbitmq.exchanges.executions.name,
    "execution.completed",
)
.await?;

// Bind inquiry_responses queue to inquiry responded messages for InquiryHandler
self.bind_queue(
    &config.rabbitmq.queues.inquiry_responses.name,
    &config.rabbitmq.exchanges.executions.name,
    "inquiry.responded",
)
.await?;

3. Updated Executor Service Configuration

File: attune/crates/executor/src/service.rs

Changed InquiryHandler and CompletionListener to consume from dedicated queues:

// InquiryHandler - Before:
let inquiry_response_queue = self.inner.mq_config.rabbitmq.queues.execution_status.name.clone();

// InquiryHandler - After:
let inquiry_response_queue = self.inner.mq_config.rabbitmq.queues.inquiry_responses.name.clone();

// CompletionListener - Before:
let execution_completed_queue = self.inner.mq_config.rabbitmq.queues.execution_status.name.clone();

// CompletionListener - After:
let execution_completed_queue = self.inner.mq_config.rabbitmq.queues.execution_completed.name.clone();

Message Flow After Fix

Execution Completion Flow

Worker → publishes ExecutionCompletedPayload
       → routing key: "execution.completed"
       → exchange: "attune.executions"
       → queue: "attune.execution.completed.queue"
       → consumer: CompletionListener
       ✅ Correct payload type received

Execution Status Change Flow

Worker → publishes ExecutionStatusChangedPayload
       → routing key: "execution.status.changed"
       → exchange: "attune.executions"
       → queue: "attune.execution.status.queue"
       → consumer: ExecutionManager
       ✅ Correct payload type received

Inquiry Response Flow

API → publishes InquiryRespondedPayload
    → routing key: "inquiry.responded"
    → exchange: "attune.executions"
    → queue: "attune.inquiry.responses.queue"
    → consumer: InquiryHandler
    ✅ Correct payload type received

Benefits

  1. Type Safety: Each queue receives only one message type, eliminating deserialization errors
  2. Scalability: Can scale CompletionListener, ExecutionManager, and InquiryHandler independently
  3. Maintainability: Clear separation of concerns - each queue has a single purpose
  4. Reliability: No message rejection due to type mismatches
  5. Performance: No wasted processing from consumers receiving wrong message types

Queue Separation Summary

After both fixes, we now have three dedicated queues for execution-related messages:

Queue Routing Key Message Type Consumer
attune.execution.status.queue execution.status.changed ExecutionStatusChangedPayload ExecutionManager
attune.execution.completed.queue execution.completed ExecutionCompletedPayload CompletionListener
attune.inquiry.responses.queue inquiry.responded InquiryRespondedPayload InquiryHandler

Result: Each queue now has exactly one consumer expecting exactly one message type.

Testing Recommendations

  1. Restart all services to recreate the queue infrastructure with new bindings
  2. Verify queue creation in RabbitMQ management UI:
    • Check that attune.inquiry.responses.queue exists
    • Check that attune.execution.completed.queue exists
    • Verify bindings on attune.executions exchange:
      • inquiry.respondedattune.inquiry.responses.queue
      • execution.completedattune.execution.completed.queue
      • execution.status.changedattune.execution.status.queue
  3. Monitor executor logs for absence of deserialization errors (inquiry_id and action_id)
  4. Test inquiry workflow:
    • Create an action that requests inquiry (__inquiry in result)
    • Respond to inquiry via API
    • Verify execution resumes correctly
  5. Test execution completion:
    • Execute a simple action
    • Verify completion notification processed without errors

Files Modified

  • attune/crates/common/src/mq/config.rs - Added inquiry_responses and execution_completed queues
  • attune/crates/common/src/mq/connection.rs - Added queue declarations and bindings
  • attune/crates/executor/src/service.rs - Updated InquiryHandler and CompletionListener to use new queues

Migration Notes

This is a breaking change for existing deployments:

  1. Two new queues will be created automatically on service startup:
    • attune.inquiry.responses.queue
    • attune.execution.completed.queue
  2. The execution_status queue now has only one binding (execution.status.changed)
  3. Existing messages in queues are unaffected
  4. No database migrations required
  5. Action Required: Restart executor service to apply changes
  • Original implementation assumed a single queue could handle multiple message types
  • RabbitMQ round-robin distribution caused non-deterministic deserialization failures
  • Errors were intermittent because they depended on which consumer received which message
  • ExecutionManager uses local payload struct instead of canonical ExecutionStatusChangedPayload (not critical but should be unified in future)

Lessons Learned

  1. One queue, one message type: RabbitMQ queues should have a single message schema
  2. One queue, one consumer: Multiple consumers on the same queue creates competition, not cooperation
  3. Use routing keys effectively: Topic exchanges with specific routing keys provide better message segregation
  4. Consumer tag awareness: Consumer tags don't prevent round-robin distribution within the same queue
  5. Type-safe patterns: Rust's strong typing revealed the issue quickly through deserialization errors
  6. Canonical message types: Use shared message structs from attune_common::mq::messages, not local definitions
  7. Incremental fixes: Sometimes you discover deeper issues while fixing surface-level problems - fix them all at once
  8. Test thoroughly: Restart services and monitor logs to catch related issues before they reach production