8.1 KiB
Work Summary: Worker Runtime Matching Fix and Routing Key Issue
Date: 2026-01-16
Session Goal: Complete the happy path for timer-driven rule execution with echo action
Summary
Successfully fixed the worker runtime matching logic. The executor now correctly selects workers based on their capabilities.runtimes array instead of the deprecated runtime column. However, discovered a critical message routing issue preventing executions from reaching workers.
Completed Work
1. Worker Runtime Matching Refactor ✅
Problem:
- Executor matched workers by checking
worker.runtimecolumn (which was NULL) - Worker's actual capabilities stored in
capabilities.runtimesJSON array like["python", "shell", "node"] - Actions require specific runtimes (e.g.,
core.echorequiresshellruntime)
Solution:
- Updated
ExecutionScheduler::select_worker()incrates/executor/src/scheduler.rs - Added
worker_supports_runtime()helper function that:- Parses worker's
capabilities.runtimesarray - Performs case-insensitive matching against action's runtime name
- Falls back to deprecated
runtimecolumn for backward compatibility - Logs detailed matching information for debugging
- Parses worker's
Code Changes:
fn worker_supports_runtime(worker: &Worker, runtime_name: &str) -> bool {
if let Some(ref capabilities) = worker.capabilities {
if let Some(runtimes) = capabilities.get("runtimes") {
if let Some(runtime_array) = runtimes.as_array() {
for runtime_value in runtime_array {
if let Some(runtime_str) = runtime_value.as_str() {
if runtime_str.eq_ignore_ascii_case(runtime_name) {
return true;
}
}
}
}
}
}
false
}
2. Message Payload Standardization ✅
Problem:
- Multiple local definitions of
ExecutionRequestedPayloadacross executor modules - Each had different fields causing deserialization failures
- Scheduler, manager, and enforcement processor expected different message formats
Solution:
- Updated scheduler and execution manager to use shared
ExecutionRequestedPayloadfromattune_common::mq - Standardized payload fields:
execution_id: i64action_id: Option<i64>action_ref: Stringparent_id: Option<i64>enforcement_id: Option<i64>config: Option<JsonValue>
Files Modified:
crates/executor/src/scheduler.rscrates/executor/src/execution_manager.rs
3. Worker Message Routing Implementation ✅
Problem:
- Scheduler published messages to exchange without routing key
- Worker-specific queue
worker.1.executionsbound with routing keyworker.1 - Messages weren't reaching worker queue due to missing routing key
Solution:
- Updated
queue_to_worker()to usepublish_envelope_with_routing() - Routing key format:
worker.{worker_id} - Exchange:
attune.executions - Added detailed logging of routing key in publish message
Code:
let routing_key = format!("worker.{}", worker_id);
let exchange = "attune.executions";
publisher
.publish_envelope_with_routing(&envelope, exchange, &routing_key)
.await?;
Critical Issue Discovered 🚨
Message Queue Consumer Architecture Problem
Symptom:
- Executor creates executions successfully (status:
requested) - Enforcement processor publishes
ExecutionRequestedmessages - Scheduler never processes these messages (executions stay in
requestedstatus) - Continuous deserialization errors in logs
Root Cause:
All three executor components consume from the SAME queue (attune.executions.queue) but expect DIFFERENT message payload types:
- Enforcement Processor - expects
EnforcementCreatedPayload - Execution Scheduler - expects
ExecutionRequestedPayload - Execution Manager - expects
ExecutionStatusPayload
What Happens:
- Message arrives in
attune.executions.queue - All three consumers compete to consume it (round-robin)
- Two consumers fail deserialization and reject (nack) the message
- Message goes to dead letter queue (DLQ)
- Correct consumer never gets the message
Log Evidence:
ERROR Failed to deserialize message: missing field `execution_id` at line 1 column 492. Rejecting message.
ERROR Failed to deserialize message: missing field `status` at line 1 column 404. Rejecting message.
ERROR Failed to deserialize message: missing field `rule_ref` at line 1 column 404. Rejecting message.
These errors repeat constantly as different consumers try to process incompatible messages.
Current System Status
Working Components ✅
- Timer sensors fire every 10 seconds
- Events created in database
- Rules match events correctly
- Enforcements created
- Executions created with action_params flowing through
- Worker registered with correct capabilities
- Worker-specific queue bound correctly
Broken Pipeline 🔴
- Enforcement → Execution: Messages published but rejected by consumers
- Scheduler → Worker: Would work IF scheduler received messages
- Worker → Action Execution: Not tested yet
Test Data
- Rule:
core.timer_echo(ID: 2) - Trigger:
core.intervaltimer(ID: 15) - Action:
core.echo(ID: 1, runtime: shell/3) - Worker: ID 1, capabilities:
{"runtimes": ["python", "shell", "node"]} - Recent executions: 46, 47, 48, 49 - all stuck in
requestedstatus
Solutions to Consider
Option 1: Separate Queues (Recommended)
- Create dedicated queues for each message type
attune.enforcements.queue→ Enforcement Processorattune.execution.requests.queue→ Schedulerattune.execution.status.queue→ Manager- Update publishers to route to correct queues
Option 2: Topic-Based Routing
- Include message type in routing key
- Bind consumers to specific message type patterns
- Example:
execution.requested,execution.status,enforcement.created
Option 3: Message Type Pre-Filtering
- Modify Consumer to peek at
message_typefield before deserializing payload - Route to appropriate handler based on type
- More complex, requires consumer interface changes
Option 4: Dead Letter Queue Recovery
- Add DLQ consumer that re-routes messages to correct queues
- Band-aid solution, doesn't fix root cause
Next Steps
-
Immediate Priority: Implement separate queues (Option 1)
- Update
config.yamlwith new queue definitions - Modify enforcement processor to publish to dedicated queue
- Update scheduler and manager to consume from their specific queues
- Test end-to-end flow
- Update
-
Verification Steps:
- Confirm execution moves from
requested→scheduled→running - Verify message reaches worker queue
- Check worker executes shell command
- Validate "hello, world" appears in execution result
- Confirm execution moves from
-
Future Improvements:
- Remove deprecated
worker.runtimecolumn - Implement topic-based routing for better scalability
- Add message type validation at queue level
- Create monitoring for DLQ depth
- Remove deprecated
Files Modified
attune/crates/executor/src/scheduler.rs- Runtime matching + routing keyattune/crates/executor/src/execution_manager.rs- Payload standardizationattune/crates/executor/src/enforcement_processor.rs- Uses shared payload
Testing Notes
Services running:
- Worker: Listening on
worker.1.executionsqueue ✅ - Executor: All three consumers competing on one queue ⚠️
- Sensor: Generating events every 10 seconds ✅
Database state:
- 4 executions reached
scheduledstatus in previous runs - All new executions stuck in
requestedstatus since current run
Conclusion
The worker runtime matching fix is complete and correct. The system can select appropriate workers based on capabilities. However, the message queue architecture has a fundamental flaw where multiple consumers compete for messages they cannot process. This must be resolved before the happy path can be completed.
The fix is well-understood and straightforward: implement separate queues for different message types. This is the standard pattern for message-driven architectures and will eliminate the consumer competition issue.