re-uploading work
This commit is contained in:
279
work-summary/2026-02-03-inquiry-queue-separation.md
Normal file
279
work-summary/2026-02-03-inquiry-queue-separation.md
Normal file
@@ -0,0 +1,279 @@
|
||||
# Work Summary: Inquiry Queue Separation Fix
|
||||
|
||||
**Date:** 2026-02-03
|
||||
**Issues:**
|
||||
- Executor deserialization error: "missing field `inquiry_id`"
|
||||
- Executor deserialization error: "missing field `action_id`"
|
||||
|
||||
**Status:** ✅ Both Fixed
|
||||
|
||||
## Visual Overview
|
||||
|
||||
### Before Fix ❌
|
||||
```
|
||||
attune.execution.status.queue
|
||||
├─ Consumer: CompletionListener (expects ExecutionCompletedPayload)
|
||||
├─ Consumer: ExecutionManager (expects ExecutionStatusPayload)
|
||||
└─ Consumer: InquiryHandler (expects InquiryRespondedPayload)
|
||||
|
||||
Incoming Messages:
|
||||
- execution.completed → ExecutionCompletedPayload
|
||||
- execution.status.changed → ExecutionStatusChangedPayload
|
||||
- inquiry.responded → InquiryRespondedPayload
|
||||
|
||||
Problem: Round-robin distribution causes wrong consumer to receive wrong message type!
|
||||
```
|
||||
|
||||
### After Fix ✅
|
||||
```
|
||||
attune.execution.completed.queue
|
||||
└─ Consumer: CompletionListener (expects ExecutionCompletedPayload)
|
||||
└─ Message: execution.completed → ExecutionCompletedPayload ✓
|
||||
|
||||
attune.execution.status.queue
|
||||
└─ Consumer: ExecutionManager (expects ExecutionStatusPayload)
|
||||
└─ Message: execution.status.changed → ExecutionStatusChangedPayload ✓
|
||||
|
||||
attune.inquiry.responses.queue
|
||||
└─ Consumer: InquiryHandler (expects InquiryRespondedPayload)
|
||||
└─ Message: inquiry.responded → InquiryRespondedPayload ✓
|
||||
|
||||
Result: Each queue has ONE consumer expecting ONE message type!
|
||||
```
|
||||
|
||||
## Problem Description
|
||||
|
||||
The executor service was logging deserialization errors when processing messages from the `execution_status` queue:
|
||||
|
||||
```
|
||||
ERROR ThreadId(13) crates/common/src/mq/consumer.rs:112: Failed to deserialize message: missing field `inquiry_id` at line 1 column 318. Rejecting message.
|
||||
```
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
The issue was caused by **two different consumers listening to the same RabbitMQ queue** but expecting different message payload types:
|
||||
|
||||
### Queue Configuration Issue
|
||||
|
||||
The `execution_status` queue (`attune.execution.status.queue`) was bound to the `attune.executions` exchange with routing key `"execution.status.changed"`, but it was receiving messages with two different routing keys:
|
||||
|
||||
1. **`execution.completed`** → `ExecutionCompletedPayload` (published by Worker service)
|
||||
2. **`inquiry.responded`** → `InquiryRespondedPayload` (published by API service)
|
||||
|
||||
### Competing Consumers
|
||||
|
||||
Two consumers were configured to read from the same `execution_status` queue:
|
||||
|
||||
1. **CompletionListener** (`executor.completion` tag)
|
||||
- Expected: `ExecutionCompletedPayload`
|
||||
- Fields: `execution_id`, `action_id`, `action_ref`, `status`, `result`, `completed_at`
|
||||
|
||||
2. **InquiryHandler** (`executor.inquiry` tag)
|
||||
- Expected: `InquiryRespondedPayload`
|
||||
- Fields: `inquiry_id`, `execution_id`, `response`, `responded_by`, `responded_at`
|
||||
|
||||
### Message Routing Behavior
|
||||
|
||||
RabbitMQ distributes messages to consumers on the same queue using **round-robin load balancing**. This meant:
|
||||
|
||||
- When an `InquiryRespondedPayload` was delivered to `CompletionListener` → **deserialization failed** (missing `inquiry_id`)
|
||||
- When an `ExecutionCompletedPayload` was delivered to `InquiryHandler` → **deserialization failed** (missing `action_id`)
|
||||
|
||||
The error message specifically mentioned `inquiry_id` because `CompletionListener` tried to deserialize an inquiry response message.
|
||||
|
||||
## Solution Implemented
|
||||
|
||||
### 1. Created Separate Queue for Inquiry Responses
|
||||
|
||||
**File:** `attune/crates/common/src/mq/config.rs`
|
||||
|
||||
Added a new queue configuration:
|
||||
|
||||
```rust
|
||||
pub struct QueuesConfig {
|
||||
// ... existing queues ...
|
||||
|
||||
/// Inquiry responses queue configuration
|
||||
pub inquiry_responses: QueueConfig,
|
||||
}
|
||||
```
|
||||
|
||||
Default configurations:
|
||||
```rust
|
||||
execution_completed: QueueConfig {
|
||||
name: "attune.execution.completed.queue".to_string(),
|
||||
durable: true,
|
||||
exclusive: false,
|
||||
auto_delete: false,
|
||||
},
|
||||
inquiry_responses: QueueConfig {
|
||||
name: "attune.inquiry.responses.queue".to_string(),
|
||||
durable: true,
|
||||
exclusive: false,
|
||||
auto_delete: false,
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Updated Infrastructure Setup
|
||||
|
||||
**File:** `attune/crates/common/src/mq/connection.rs`
|
||||
|
||||
Added queue declarations and bindings in `setup_infrastructure()`:
|
||||
|
||||
```rust
|
||||
// Declare the new queues with DLX support
|
||||
self.declare_queue_with_dlx(&config.rabbitmq.queues.execution_completed, dlx).await?;
|
||||
self.declare_queue_with_dlx(&config.rabbitmq.queues.inquiry_responses, dlx).await?;
|
||||
|
||||
// Bind execution_status queue to status changed messages for ExecutionManager
|
||||
self.bind_queue(
|
||||
&config.rabbitmq.queues.execution_status.name,
|
||||
&config.rabbitmq.exchanges.executions.name,
|
||||
"execution.status.changed",
|
||||
)
|
||||
.await?;
|
||||
|
||||
// Bind execution_completed queue to completed messages for CompletionListener
|
||||
self.bind_queue(
|
||||
&config.rabbitmq.queues.execution_completed.name,
|
||||
&config.rabbitmq.exchanges.executions.name,
|
||||
"execution.completed",
|
||||
)
|
||||
.await?;
|
||||
|
||||
// Bind inquiry_responses queue to inquiry responded messages for InquiryHandler
|
||||
self.bind_queue(
|
||||
&config.rabbitmq.queues.inquiry_responses.name,
|
||||
&config.rabbitmq.exchanges.executions.name,
|
||||
"inquiry.responded",
|
||||
)
|
||||
.await?;
|
||||
```
|
||||
|
||||
### 3. Updated Executor Service Configuration
|
||||
|
||||
**File:** `attune/crates/executor/src/service.rs`
|
||||
|
||||
Changed `InquiryHandler` and `CompletionListener` to consume from dedicated queues:
|
||||
|
||||
```rust
|
||||
// InquiryHandler - Before:
|
||||
let inquiry_response_queue = self.inner.mq_config.rabbitmq.queues.execution_status.name.clone();
|
||||
|
||||
// InquiryHandler - After:
|
||||
let inquiry_response_queue = self.inner.mq_config.rabbitmq.queues.inquiry_responses.name.clone();
|
||||
|
||||
// CompletionListener - Before:
|
||||
let execution_completed_queue = self.inner.mq_config.rabbitmq.queues.execution_status.name.clone();
|
||||
|
||||
// CompletionListener - After:
|
||||
let execution_completed_queue = self.inner.mq_config.rabbitmq.queues.execution_completed.name.clone();
|
||||
```
|
||||
|
||||
## Message Flow After Fix
|
||||
|
||||
### Execution Completion Flow
|
||||
```
|
||||
Worker → publishes ExecutionCompletedPayload
|
||||
→ routing key: "execution.completed"
|
||||
→ exchange: "attune.executions"
|
||||
→ queue: "attune.execution.completed.queue"
|
||||
→ consumer: CompletionListener
|
||||
✅ Correct payload type received
|
||||
```
|
||||
|
||||
### Execution Status Change Flow
|
||||
```
|
||||
Worker → publishes ExecutionStatusChangedPayload
|
||||
→ routing key: "execution.status.changed"
|
||||
→ exchange: "attune.executions"
|
||||
→ queue: "attune.execution.status.queue"
|
||||
→ consumer: ExecutionManager
|
||||
✅ Correct payload type received
|
||||
```
|
||||
|
||||
### Inquiry Response Flow
|
||||
```
|
||||
API → publishes InquiryRespondedPayload
|
||||
→ routing key: "inquiry.responded"
|
||||
→ exchange: "attune.executions"
|
||||
→ queue: "attune.inquiry.responses.queue"
|
||||
→ consumer: InquiryHandler
|
||||
✅ Correct payload type received
|
||||
```
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Type Safety**: Each queue receives only one message type, eliminating deserialization errors
|
||||
2. **Scalability**: Can scale `CompletionListener`, `ExecutionManager`, and `InquiryHandler` independently
|
||||
3. **Maintainability**: Clear separation of concerns - each queue has a single purpose
|
||||
4. **Reliability**: No message rejection due to type mismatches
|
||||
5. **Performance**: No wasted processing from consumers receiving wrong message types
|
||||
|
||||
## Queue Separation Summary
|
||||
|
||||
After both fixes, we now have three dedicated queues for execution-related messages:
|
||||
|
||||
| Queue | Routing Key | Message Type | Consumer |
|
||||
|-------|-------------|--------------|----------|
|
||||
| `attune.execution.status.queue` | `execution.status.changed` | `ExecutionStatusChangedPayload` | ExecutionManager |
|
||||
| `attune.execution.completed.queue` | `execution.completed` | `ExecutionCompletedPayload` | CompletionListener |
|
||||
| `attune.inquiry.responses.queue` | `inquiry.responded` | `InquiryRespondedPayload` | InquiryHandler |
|
||||
|
||||
**Result:** Each queue now has exactly one consumer expecting exactly one message type. ✅
|
||||
|
||||
## Testing Recommendations
|
||||
|
||||
1. **Restart all services** to recreate the queue infrastructure with new bindings
|
||||
2. **Verify queue creation** in RabbitMQ management UI:
|
||||
- Check that `attune.inquiry.responses.queue` exists
|
||||
- Check that `attune.execution.completed.queue` exists
|
||||
- Verify bindings on `attune.executions` exchange:
|
||||
- `inquiry.responded` → `attune.inquiry.responses.queue`
|
||||
- `execution.completed` → `attune.execution.completed.queue`
|
||||
- `execution.status.changed` → `attune.execution.status.queue`
|
||||
3. **Monitor executor logs** for absence of deserialization errors (`inquiry_id` and `action_id`)
|
||||
4. **Test inquiry workflow**:
|
||||
- Create an action that requests inquiry (`__inquiry` in result)
|
||||
- Respond to inquiry via API
|
||||
- Verify execution resumes correctly
|
||||
5. **Test execution completion**:
|
||||
- Execute a simple action
|
||||
- Verify completion notification processed without errors
|
||||
|
||||
|
||||
### Files Modified
|
||||
|
||||
- `attune/crates/common/src/mq/config.rs` - Added `inquiry_responses` and `execution_completed` queues
|
||||
- `attune/crates/common/src/mq/connection.rs` - Added queue declarations and bindings
|
||||
- `attune/crates/executor/src/service.rs` - Updated InquiryHandler and CompletionListener to use new queues
|
||||
|
||||
## Migration Notes
|
||||
|
||||
This is a **breaking change** for existing deployments:
|
||||
|
||||
1. Two new queues will be created automatically on service startup:
|
||||
- `attune.inquiry.responses.queue`
|
||||
- `attune.execution.completed.queue`
|
||||
2. The `execution_status` queue now has **only one binding** (`execution.status.changed`)
|
||||
3. Existing messages in queues are unaffected
|
||||
4. No database migrations required
|
||||
5. **Action Required**: Restart executor service to apply changes
|
||||
|
||||
## Related Issues
|
||||
|
||||
- Original implementation assumed a single queue could handle multiple message types
|
||||
- RabbitMQ round-robin distribution caused non-deterministic deserialization failures
|
||||
- Errors were intermittent because they depended on which consumer received which message
|
||||
- `ExecutionManager` uses local payload struct instead of canonical `ExecutionStatusChangedPayload` (not critical but should be unified in future)
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. **One queue, one message type**: RabbitMQ queues should have a single message schema
|
||||
2. **One queue, one consumer**: Multiple consumers on the same queue creates competition, not cooperation
|
||||
3. **Use routing keys effectively**: Topic exchanges with specific routing keys provide better message segregation
|
||||
4. **Consumer tag awareness**: Consumer tags don't prevent round-robin distribution within the same queue
|
||||
5. **Type-safe patterns**: Rust's strong typing revealed the issue quickly through deserialization errors
|
||||
6. **Canonical message types**: Use shared message structs from `attune_common::mq::messages`, not local definitions
|
||||
7. **Incremental fixes**: Sometimes you discover deeper issues while fixing surface-level problems - fix them all at once
|
||||
8. **Test thoroughly**: Restart services and monitor logs to catch related issues before they reach production
|
||||
Reference in New Issue
Block a user