re-uploading work
This commit is contained in:
262
work-summary/sessions/2026-01-16-worker-mq-infrastructure-fix.md
Normal file
262
work-summary/sessions/2026-01-16-worker-mq-infrastructure-fix.md
Normal file
@@ -0,0 +1,262 @@
|
||||
# Worker Service Message Queue Infrastructure Fix
|
||||
|
||||
**Date:** 2026-01-16
|
||||
**Status:** ✅ Completed
|
||||
|
||||
## Problem
|
||||
|
||||
The worker service was failing to start with the following error:
|
||||
|
||||
```
|
||||
ERROR ThreadId(22) io_loop: Channel closed channel=2 method=Close {
|
||||
reply_code: 404,
|
||||
reply_text: ShortString("NOT_FOUND - no queue 'worker.1.executions' in vhost '/'"),
|
||||
class_id: 60,
|
||||
method_id: 20
|
||||
} error=AMQPError {
|
||||
kind: Soft(NOTFOUND),
|
||||
message: ShortString("NOT_FOUND - no queue 'worker.1.executions' in vhost '/'")
|
||||
}
|
||||
```
|
||||
|
||||
This occurred because:
|
||||
1. The worker was trying to consume from a dynamically-named queue `worker.{worker_id}.executions` that didn't exist
|
||||
2. The RabbitMQ infrastructure (exchanges, queues, bindings) was not being set up by the worker
|
||||
3. Worker-specific queues need to be created dynamically after worker registration
|
||||
|
||||
## Root Cause
|
||||
|
||||
The worker service was:
|
||||
- Assuming the RabbitMQ infrastructure already existed
|
||||
- Attempting to consume from a worker-specific queue without creating it first
|
||||
- Not declaring or binding the queue before starting the consumer
|
||||
- Missing infrastructure setup that other services (executor) had
|
||||
|
||||
## Solution
|
||||
|
||||
Updated the worker service to:
|
||||
|
||||
### 1. Set Up Base Infrastructure on Startup
|
||||
Added automatic infrastructure setup similar to the executor service:
|
||||
|
||||
```rust
|
||||
// Setup message queue infrastructure (exchanges, queues, bindings)
|
||||
let mq_config = MqConfig::default();
|
||||
match mq_connection.setup_infrastructure(&mq_config).await {
|
||||
Ok(_) => info!("Message queue infrastructure setup completed"),
|
||||
Err(e) => {
|
||||
warn!(
|
||||
"Failed to setup MQ infrastructure (may already exist): {}",
|
||||
e
|
||||
);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This creates the base infrastructure:
|
||||
- Exchanges: `attune.events`, `attune.executions`, `attune.notifications`
|
||||
- Base queues: `attune.events.queue`, `attune.executions.queue`, `attune.notifications.queue`
|
||||
- Dead letter exchange: `attune.dlx`
|
||||
|
||||
### 2. Create Worker-Specific Queue Dynamically
|
||||
Added dynamic queue creation in `start_execution_consumer()`:
|
||||
|
||||
```rust
|
||||
// Create the worker-specific queue
|
||||
let worker_queue = QueueConfig {
|
||||
name: queue_name.clone(),
|
||||
durable: false, // Worker queues are temporary
|
||||
exclusive: false,
|
||||
auto_delete: true, // Delete when worker disconnects
|
||||
};
|
||||
|
||||
self.mq_connection
|
||||
.declare_queue(&worker_queue)
|
||||
.await?;
|
||||
```
|
||||
|
||||
**Key Design Decisions:**
|
||||
- `durable: false` - Worker queues don't need to survive broker restarts
|
||||
- `auto_delete: true` - Queues are automatically cleaned up when worker disconnects
|
||||
- Queue name format: `worker.{worker_id}.executions`
|
||||
|
||||
### 3. Bind Queue to Exchange with Worker-Specific Routing Key
|
||||
Added proper binding so the scheduler can route executions to specific workers:
|
||||
|
||||
```rust
|
||||
// Bind the queue to the executions exchange with worker-specific routing key
|
||||
self.mq_connection
|
||||
.bind_queue(
|
||||
&queue_name,
|
||||
"attune.executions",
|
||||
&format!("worker.{}", worker_id),
|
||||
)
|
||||
.await?;
|
||||
```
|
||||
|
||||
This allows the executor/scheduler to target specific workers by publishing messages with routing key `worker.{worker_id}`.
|
||||
|
||||
### 4. Added Proper Imports
|
||||
- Added `config::MessageQueueConfig as MqConfig` import
|
||||
- Added `QueueConfig` import for queue configuration
|
||||
- Added `warn` import from `tracing` for setup warnings
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Files Modified
|
||||
|
||||
**`crates/worker/src/service.rs`:**
|
||||
- Added `MessageQueueConfig` import from `attune_common::mq::config`
|
||||
- Added `QueueConfig` import
|
||||
- Added infrastructure setup in `new()` method
|
||||
- Enhanced `start_execution_consumer()` to:
|
||||
- Create worker-specific queue dynamically
|
||||
- Bind queue to exchange with worker-specific routing key
|
||||
- Add detailed logging for each step
|
||||
- All operations happen after worker registration (when worker_id is known)
|
||||
|
||||
### Infrastructure Created
|
||||
|
||||
#### Base Infrastructure (shared with executor)
|
||||
1. **Exchanges**:
|
||||
- `attune.events` (Topic)
|
||||
- `attune.executions` (Direct)
|
||||
- `attune.notifications` (Fanout)
|
||||
|
||||
2. **Base Queues**:
|
||||
- `attune.events.queue`
|
||||
- `attune.executions.queue`
|
||||
- `attune.notifications.queue`
|
||||
|
||||
3. **Dead Letter Exchange**: `attune.dlx`
|
||||
|
||||
#### Worker-Specific Infrastructure (per worker)
|
||||
1. **Worker Queue**: `worker.{worker_id}.executions`
|
||||
- Non-durable (temporary)
|
||||
- Auto-delete when worker disconnects
|
||||
- Bound to `attune.executions` exchange
|
||||
|
||||
2. **Routing Key**: `worker.{worker_id}`
|
||||
- Allows targeted message delivery to specific workers
|
||||
- Scheduler can route executions based on worker capabilities
|
||||
|
||||
## Testing
|
||||
|
||||
Verified the fix works correctly:
|
||||
|
||||
1. ✅ **Base Infrastructure Setup**: Common exchanges and queues created
|
||||
```
|
||||
INFO Setting up RabbitMQ infrastructure
|
||||
INFO Queue 'attune.events.queue' declared with dead letter exchange 'attune.dlx'
|
||||
INFO Queue 'attune.executions.queue' declared with dead letter exchange 'attune.dlx'
|
||||
INFO Queue 'attune.notifications.queue' declared with dead letter exchange 'attune.dlx'
|
||||
INFO RabbitMQ infrastructure setup complete
|
||||
```
|
||||
|
||||
2. ✅ **Worker Registration**: Worker successfully registers with database
|
||||
```
|
||||
INFO Worker registered with ID: 1
|
||||
```
|
||||
|
||||
3. ✅ **Dynamic Queue Creation**: Worker-specific queue is created
|
||||
```
|
||||
INFO Creating worker-specific queue: worker.1.executions
|
||||
INFO Worker queue created: worker.1.executions
|
||||
```
|
||||
|
||||
4. ✅ **Queue Binding**: Queue bound to exchange with proper routing key
|
||||
```
|
||||
INFO Queue 'worker.1.executions' bound to exchange 'attune.executions' with routing key 'worker.1'
|
||||
INFO Queue bound to exchange with routing key 'worker.1'
|
||||
```
|
||||
|
||||
5. ✅ **Consumer Start**: Worker successfully starts consuming messages
|
||||
```
|
||||
INFO Consumer started for queue: worker.1.executions
|
||||
INFO Message queue consumer initialized
|
||||
```
|
||||
|
||||
6. ✅ **Service Ready**: Worker service fully operational
|
||||
```
|
||||
INFO Worker Service started successfully
|
||||
```
|
||||
|
||||
## Impact
|
||||
|
||||
- **Automated Setup**: Workers create their own queues automatically on startup
|
||||
- **Dynamic Infrastructure**: Queue creation happens after worker registration
|
||||
- **Clean Shutdown**: Auto-delete queues prevent orphaned queues from offline workers
|
||||
- **Targeted Routing**: Scheduler can route executions to specific workers
|
||||
- **Idempotent**: Base infrastructure setup is safe to run multiple times
|
||||
- **Better Logging**: Clear visibility into infrastructure creation steps
|
||||
|
||||
## Architecture Notes
|
||||
|
||||
### Worker Queue Lifecycle
|
||||
1. Worker starts and connects to MQ
|
||||
2. Worker sets up base infrastructure (idempotent)
|
||||
3. Worker registers with database, receives worker_id
|
||||
4. Worker creates queue: `worker.{worker_id}.executions`
|
||||
5. Worker binds queue to `attune.executions` exchange
|
||||
6. Worker starts consuming with tag `worker-{worker_id}`
|
||||
7. On shutdown/disconnect, queue is automatically deleted
|
||||
|
||||
### Message Routing Flow
|
||||
```
|
||||
Scheduler/Executor
|
||||
↓ (publishes to)
|
||||
attune.executions exchange
|
||||
↓ (routes by key: worker.{id})
|
||||
worker.{id}.executions queue
|
||||
↓ (consumed by)
|
||||
Worker Service
|
||||
```
|
||||
|
||||
### Queue Configuration Rationale
|
||||
- **Non-durable**: Worker queues are ephemeral, tied to worker lifetime
|
||||
- **Auto-delete**: Prevents accumulation of queues from dead workers
|
||||
- **Non-exclusive**: Allows monitoring/management tools to inspect queues
|
||||
- **Named by worker_id**: Enables targeted execution assignment
|
||||
|
||||
## Comparison with Other Services
|
||||
|
||||
### Executor Service
|
||||
- Uses persistent queues: `attune.executions.queue`
|
||||
- Multiple competing consumers on same queue
|
||||
- Infrastructure setup in service initialization
|
||||
|
||||
### Worker Service
|
||||
- Uses ephemeral per-worker queues: `worker.{id}.executions`
|
||||
- One consumer per worker queue
|
||||
- Queue creation after worker registration
|
||||
- Auto-cleanup when worker disconnects
|
||||
|
||||
### Sensor Service
|
||||
- Uses `MessageQueue` wrapper (publish-only)
|
||||
- No consumer setup needed
|
||||
- May need similar updates if consuming in future
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [ ] Add health check that verifies worker queue exists and is bound
|
||||
- [ ] Consider adding queue TTL for additional cleanup safety
|
||||
- [ ] Add metrics for worker queue depth and consumer performance
|
||||
- [ ] Document worker routing patterns in architecture docs
|
||||
- [ ] Consider adding worker queue monitoring/alerting
|
||||
- [ ] Test worker failover scenarios (queue cleanup)
|
||||
|
||||
## Related Issues
|
||||
|
||||
This fix follows the same pattern as:
|
||||
- **2026-01-15**: Executor service MQ infrastructure fix
|
||||
- **2026-01-15**: Configuration URL parsing fix
|
||||
|
||||
All services now properly set up their required MQ infrastructure on startup.
|
||||
|
||||
## Notes
|
||||
|
||||
- Worker queues are intentionally temporary and auto-delete
|
||||
- Each worker gets its own queue for targeted execution delivery
|
||||
- The routing key pattern `worker.{id}` allows flexible execution scheduling
|
||||
- Infrastructure setup is idempotent - safe for multiple workers to run simultaneously
|
||||
- Queue creation happens after registration to ensure worker_id is available
|
||||
Reference in New Issue
Block a user