attune-system/attune

Fork 0

Files

David Culbreth 3b14c65998 re-uploading work

2026-02-04 17:46:30 -06:00

8.1 KiB

Raw Blame History

Work Summary: Worker Runtime Matching Fix and Routing Key Issue

Date: 2026-01-16
Session Goal: Complete the happy path for timer-driven rule execution with echo action

Summary

Successfully fixed the worker runtime matching logic. The executor now correctly selects workers based on their capabilities.runtimes array instead of the deprecated runtime column. However, discovered a critical message routing issue preventing executions from reaching workers.

Completed Work

1. Worker Runtime Matching Refactor ✅

Problem:

Executor matched workers by checking worker.runtime column (which was NULL)
Worker's actual capabilities stored in capabilities.runtimes JSON array like ["python", "shell", "node"]
Actions require specific runtimes (e.g., core.echo requires shell runtime)

Solution:

Updated ExecutionScheduler::select_worker() in crates/executor/src/scheduler.rs
Added worker_supports_runtime() helper function that:
- Parses worker's capabilities.runtimes array
- Performs case-insensitive matching against action's runtime name
- Falls back to deprecated runtime column for backward compatibility
- Logs detailed matching information for debugging

Code Changes:

fn worker_supports_runtime(worker: &Worker, runtime_name: &str) -> bool {
    if let Some(ref capabilities) = worker.capabilities {
        if let Some(runtimes) = capabilities.get("runtimes") {
            if let Some(runtime_array) = runtimes.as_array() {
                for runtime_value in runtime_array {
                    if let Some(runtime_str) = runtime_value.as_str() {
                        if runtime_str.eq_ignore_ascii_case(runtime_name) {
                            return true;
                        }
                    }
                }
            }
        }
    }
    false
}

2. Message Payload Standardization ✅

Problem:

Multiple local definitions of ExecutionRequestedPayload across executor modules
Each had different fields causing deserialization failures
Scheduler, manager, and enforcement processor expected different message formats

Solution:

Updated scheduler and execution manager to use shared ExecutionRequestedPayload from attune_common::mq
Standardized payload fields:
- execution_id: i64
- action_id: Option<i64>
- action_ref: String
- parent_id: Option<i64>
- enforcement_id: Option<i64>
- config: Option<JsonValue>

Files Modified:

crates/executor/src/scheduler.rs
crates/executor/src/execution_manager.rs

3. Worker Message Routing Implementation ✅

Problem:

Scheduler published messages to exchange without routing key
Worker-specific queue worker.1.executions bound with routing key worker.1
Messages weren't reaching worker queue due to missing routing key

Solution:

Updated queue_to_worker() to use publish_envelope_with_routing()
Routing key format: worker.{worker_id}
Exchange: attune.executions
Added detailed logging of routing key in publish message

Code:

let routing_key = format!("worker.{}", worker_id);
let exchange = "attune.executions";

publisher
    .publish_envelope_with_routing(&envelope, exchange, &routing_key)
    .await?;

Critical Issue Discovered 🚨

Message Queue Consumer Architecture Problem

Symptom:

Executor creates executions successfully (status: requested)
Enforcement processor publishes ExecutionRequested messages
Scheduler never processes these messages (executions stay in requested status)
Continuous deserialization errors in logs

Root Cause: All three executor components consume from the SAME queue (attune.executions.queue) but expect DIFFERENT message payload types:

Enforcement Processor - expects EnforcementCreatedPayload
Execution Scheduler - expects ExecutionRequestedPayload
Execution Manager - expects ExecutionStatusPayload

What Happens:

Message arrives in attune.executions.queue
All three consumers compete to consume it (round-robin)
Two consumers fail deserialization and reject (nack) the message
Message goes to dead letter queue (DLQ)
Correct consumer never gets the message

Log Evidence:

ERROR Failed to deserialize message: missing field `execution_id` at line 1 column 492. Rejecting message.
ERROR Failed to deserialize message: missing field `status` at line 1 column 404. Rejecting message.
ERROR Failed to deserialize message: missing field `rule_ref` at line 1 column 404. Rejecting message.

These errors repeat constantly as different consumers try to process incompatible messages.

Current System Status

Working Components ✅

Timer sensors fire every 10 seconds
Events created in database
Rules match events correctly
Enforcements created
Executions created with action_params flowing through
Worker registered with correct capabilities
Worker-specific queue bound correctly

Broken Pipeline 🔴

Enforcement → Execution: Messages published but rejected by consumers
Scheduler → Worker: Would work IF scheduler received messages
Worker → Action Execution: Not tested yet

Test Data

Rule: core.timer_echo (ID: 2)
Trigger: core.intervaltimer (ID: 15)
Action: core.echo (ID: 1, runtime: shell/3)
Worker: ID 1, capabilities: {"runtimes": ["python", "shell", "node"]}
Recent executions: 46, 47, 48, 49 - all stuck in requested status

Solutions to Consider

Option 1: Separate Queues (Recommended)

Create dedicated queues for each message type
attune.enforcements.queue → Enforcement Processor
attune.execution.requests.queue → Scheduler
attune.execution.status.queue → Manager
Update publishers to route to correct queues

Option 2: Topic-Based Routing

Include message type in routing key
Bind consumers to specific message type patterns
Example: execution.requested, execution.status, enforcement.created

Option 3: Message Type Pre-Filtering

Modify Consumer to peek at message_type field before deserializing payload
Route to appropriate handler based on type
More complex, requires consumer interface changes

Option 4: Dead Letter Queue Recovery

Add DLQ consumer that re-routes messages to correct queues
Band-aid solution, doesn't fix root cause

Next Steps

Immediate Priority: Implement separate queues (Option 1)
- Update config.yaml with new queue definitions
- Modify enforcement processor to publish to dedicated queue
- Update scheduler and manager to consume from their specific queues
- Test end-to-end flow
Verification Steps:
- Confirm execution moves from requested → scheduled → running
- Verify message reaches worker queue
- Check worker executes shell command
- Validate "hello, world" appears in execution result
Future Improvements:
- Remove deprecated worker.runtime column
- Implement topic-based routing for better scalability
- Add message type validation at queue level
- Create monitoring for DLQ depth

Files Modified

attune/crates/executor/src/scheduler.rs - Runtime matching + routing key
attune/crates/executor/src/execution_manager.rs - Payload standardization
attune/crates/executor/src/enforcement_processor.rs - Uses shared payload

Testing Notes

Services running:

Worker: Listening on worker.1.executions queue ✅
Executor: All three consumers competing on one queue ⚠️
Sensor: Generating events every 10 seconds ✅

Database state:

4 executions reached scheduled status in previous runs
All new executions stuck in requested status since current run

Conclusion

The worker runtime matching fix is complete and correct. The system can select appropriate workers based on capabilities. However, the message queue architecture has a fundamental flaw where multiple consumers compete for messages they cannot process. This must be resolved before the happy path can be completed.

The fix is well-understood and straightforward: implement separate queues for different message types. This is the standard pattern for message-driven architectures and will eliminate the consumer competition issue.

8.1 KiB Raw Blame History

Work Summary: Worker Runtime Matching Fix and Routing Key Issue

Summary

Completed Work

1. Worker Runtime Matching Refactor ✅

2. Message Payload Standardization ✅

3. Worker Message Routing Implementation ✅

Critical Issue Discovered 🚨

Message Queue Consumer Architecture Problem

Current System Status

Working Components ✅

Broken Pipeline 🔴

Test Data

Solutions to Consider

Option 1: Separate Queues (Recommended)

Option 2: Topic-Based Routing

Option 3: Message Type Pre-Filtering

Option 4: Dead Letter Queue Recovery

Next Steps

Files Modified

Testing Notes

Conclusion

8.1 KiB

Raw Blame History