Files
attune/work-summary/sessions/2026-01-16-runtime-matching-fix.md
2026-02-04 17:46:30 -06:00

8.1 KiB

Work Summary: Worker Runtime Matching Fix and Routing Key Issue

Date: 2026-01-16
Session Goal: Complete the happy path for timer-driven rule execution with echo action

Summary

Successfully fixed the worker runtime matching logic. The executor now correctly selects workers based on their capabilities.runtimes array instead of the deprecated runtime column. However, discovered a critical message routing issue preventing executions from reaching workers.

Completed Work

1. Worker Runtime Matching Refactor

Problem:

  • Executor matched workers by checking worker.runtime column (which was NULL)
  • Worker's actual capabilities stored in capabilities.runtimes JSON array like ["python", "shell", "node"]
  • Actions require specific runtimes (e.g., core.echo requires shell runtime)

Solution:

  • Updated ExecutionScheduler::select_worker() in crates/executor/src/scheduler.rs
  • Added worker_supports_runtime() helper function that:
    • Parses worker's capabilities.runtimes array
    • Performs case-insensitive matching against action's runtime name
    • Falls back to deprecated runtime column for backward compatibility
    • Logs detailed matching information for debugging

Code Changes:

fn worker_supports_runtime(worker: &Worker, runtime_name: &str) -> bool {
    if let Some(ref capabilities) = worker.capabilities {
        if let Some(runtimes) = capabilities.get("runtimes") {
            if let Some(runtime_array) = runtimes.as_array() {
                for runtime_value in runtime_array {
                    if let Some(runtime_str) = runtime_value.as_str() {
                        if runtime_str.eq_ignore_ascii_case(runtime_name) {
                            return true;
                        }
                    }
                }
            }
        }
    }
    false
}

2. Message Payload Standardization

Problem:

  • Multiple local definitions of ExecutionRequestedPayload across executor modules
  • Each had different fields causing deserialization failures
  • Scheduler, manager, and enforcement processor expected different message formats

Solution:

  • Updated scheduler and execution manager to use shared ExecutionRequestedPayload from attune_common::mq
  • Standardized payload fields:
    • execution_id: i64
    • action_id: Option<i64>
    • action_ref: String
    • parent_id: Option<i64>
    • enforcement_id: Option<i64>
    • config: Option<JsonValue>

Files Modified:

  • crates/executor/src/scheduler.rs
  • crates/executor/src/execution_manager.rs

3. Worker Message Routing Implementation

Problem:

  • Scheduler published messages to exchange without routing key
  • Worker-specific queue worker.1.executions bound with routing key worker.1
  • Messages weren't reaching worker queue due to missing routing key

Solution:

  • Updated queue_to_worker() to use publish_envelope_with_routing()
  • Routing key format: worker.{worker_id}
  • Exchange: attune.executions
  • Added detailed logging of routing key in publish message

Code:

let routing_key = format!("worker.{}", worker_id);
let exchange = "attune.executions";

publisher
    .publish_envelope_with_routing(&envelope, exchange, &routing_key)
    .await?;

Critical Issue Discovered 🚨

Message Queue Consumer Architecture Problem

Symptom:

  • Executor creates executions successfully (status: requested)
  • Enforcement processor publishes ExecutionRequested messages
  • Scheduler never processes these messages (executions stay in requested status)
  • Continuous deserialization errors in logs

Root Cause: All three executor components consume from the SAME queue (attune.executions.queue) but expect DIFFERENT message payload types:

  1. Enforcement Processor - expects EnforcementCreatedPayload
  2. Execution Scheduler - expects ExecutionRequestedPayload
  3. Execution Manager - expects ExecutionStatusPayload

What Happens:

  1. Message arrives in attune.executions.queue
  2. All three consumers compete to consume it (round-robin)
  3. Two consumers fail deserialization and reject (nack) the message
  4. Message goes to dead letter queue (DLQ)
  5. Correct consumer never gets the message

Log Evidence:

ERROR Failed to deserialize message: missing field `execution_id` at line 1 column 492. Rejecting message.
ERROR Failed to deserialize message: missing field `status` at line 1 column 404. Rejecting message.
ERROR Failed to deserialize message: missing field `rule_ref` at line 1 column 404. Rejecting message.

These errors repeat constantly as different consumers try to process incompatible messages.

Current System Status

Working Components

  • Timer sensors fire every 10 seconds
  • Events created in database
  • Rules match events correctly
  • Enforcements created
  • Executions created with action_params flowing through
  • Worker registered with correct capabilities
  • Worker-specific queue bound correctly

Broken Pipeline 🔴

  • Enforcement → Execution: Messages published but rejected by consumers
  • Scheduler → Worker: Would work IF scheduler received messages
  • Worker → Action Execution: Not tested yet

Test Data

  • Rule: core.timer_echo (ID: 2)
  • Trigger: core.intervaltimer (ID: 15)
  • Action: core.echo (ID: 1, runtime: shell/3)
  • Worker: ID 1, capabilities: {"runtimes": ["python", "shell", "node"]}
  • Recent executions: 46, 47, 48, 49 - all stuck in requested status

Solutions to Consider

  • Create dedicated queues for each message type
  • attune.enforcements.queue → Enforcement Processor
  • attune.execution.requests.queue → Scheduler
  • attune.execution.status.queue → Manager
  • Update publishers to route to correct queues

Option 2: Topic-Based Routing

  • Include message type in routing key
  • Bind consumers to specific message type patterns
  • Example: execution.requested, execution.status, enforcement.created

Option 3: Message Type Pre-Filtering

  • Modify Consumer to peek at message_type field before deserializing payload
  • Route to appropriate handler based on type
  • More complex, requires consumer interface changes

Option 4: Dead Letter Queue Recovery

  • Add DLQ consumer that re-routes messages to correct queues
  • Band-aid solution, doesn't fix root cause

Next Steps

  1. Immediate Priority: Implement separate queues (Option 1)

    • Update config.yaml with new queue definitions
    • Modify enforcement processor to publish to dedicated queue
    • Update scheduler and manager to consume from their specific queues
    • Test end-to-end flow
  2. Verification Steps:

    • Confirm execution moves from requestedscheduledrunning
    • Verify message reaches worker queue
    • Check worker executes shell command
    • Validate "hello, world" appears in execution result
  3. Future Improvements:

    • Remove deprecated worker.runtime column
    • Implement topic-based routing for better scalability
    • Add message type validation at queue level
    • Create monitoring for DLQ depth

Files Modified

  • attune/crates/executor/src/scheduler.rs - Runtime matching + routing key
  • attune/crates/executor/src/execution_manager.rs - Payload standardization
  • attune/crates/executor/src/enforcement_processor.rs - Uses shared payload

Testing Notes

Services running:

  • Worker: Listening on worker.1.executions queue
  • Executor: All three consumers competing on one queue ⚠️
  • Sensor: Generating events every 10 seconds

Database state:

  • 4 executions reached scheduled status in previous runs
  • All new executions stuck in requested status since current run

Conclusion

The worker runtime matching fix is complete and correct. The system can select appropriate workers based on capabilities. However, the message queue architecture has a fundamental flaw where multiple consumers compete for messages they cannot process. This must be resolved before the happy path can be completed.

The fix is well-understood and straightforward: implement separate queues for different message types. This is the standard pattern for message-driven architectures and will eliminate the consumer competition issue.