315 lines
9.7 KiB
Markdown
315 lines
9.7 KiB
Markdown
# Work Summary: Critical Fixes for Timer-Driven Execution Pipeline
|
|
**Date:** 2026-01-16
|
|
**Status:** COMPLETED ✅
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
Fixed two P0 blocking issues that prevented the timer-driven automation pipeline from completing end-to-end:
|
|
1. **Message Loop Bug** - Execution manager processing completion messages infinitely
|
|
2. **Worker Runtime Resolution** - Worker unable to find correct runtime for actions
|
|
|
|
Both issues are now resolved, and executions complete successfully with the correct runtime.
|
|
|
|
---
|
|
|
|
## Issue 1: Message Loop in Execution Manager
|
|
|
|
### Problem
|
|
Execution manager entered an infinite loop where `ExecutionCompleted` messages were routed back to the status queue and reprocessed indefinitely.
|
|
|
|
**Symptoms:**
|
|
```
|
|
INFO Processing status change for execution 137: Failed
|
|
INFO Updated execution 137 status: Failed -> Failed
|
|
INFO Handling completion for execution: 137
|
|
INFO Published execution.completed notification for execution: 137
|
|
[repeats infinitely]
|
|
```
|
|
|
|
### Root Cause
|
|
The execution manager's queue was bound with wildcard pattern `execution.status.#` which matched:
|
|
- `execution.status.changed` ✅ (intended)
|
|
- `execution.completed` ❌ (unintended - caused loop)
|
|
|
|
When an execution completed, the manager published an `ExecutionCompleted` message with routing key `execution.completed`. This message was routed back to the same queue due to the wildcard pattern, causing infinite reprocessing.
|
|
|
|
### Solution
|
|
Changed queue binding in `common/src/mq/connection.rs` from wildcard to exact match:
|
|
```rust
|
|
// Before
|
|
self.bind_queue(
|
|
&config.rabbitmq.queues.execution_status.name,
|
|
&config.rabbitmq.exchanges.executions.name,
|
|
"execution.status.#", // Matches ALL execution.status.* messages
|
|
)
|
|
|
|
// After
|
|
self.bind_queue(
|
|
&config.rabbitmq.queues.execution_status.name,
|
|
&config.rabbitmq.exchanges.executions.name,
|
|
"execution.status.changed", // Only matches status change messages
|
|
)
|
|
```
|
|
|
|
### Files Modified
|
|
- `crates/common/src/mq/connection.rs` - Line 407
|
|
|
|
### Result
|
|
✅ ExecutionCompleted messages no longer route to status queue
|
|
✅ Manager processes each status change exactly once
|
|
✅ No more infinite loops
|
|
|
|
---
|
|
|
|
## Issue 2: Worker Runtime Resolution Failure
|
|
|
|
### Problem
|
|
Worker received execution messages but failed with:
|
|
```
|
|
ERROR Runtime not found: No runtime found for action: core.echo
|
|
```
|
|
|
|
Even though:
|
|
- Worker had shell runtime available ✅
|
|
- Action `core.echo` requires shell runtime ✅
|
|
- Runtime matching logic existed ✅
|
|
|
|
### Root Cause
|
|
The worker's runtime selection relied on `can_execute()` methods that checked file extensions and action_ref patterns:
|
|
- Python runtime checked for `.py` in action_ref
|
|
- Shell runtime checked for specific patterns
|
|
- `core.echo` didn't match any patterns
|
|
|
|
The action's runtime metadata (stored in database as `runtime: 3` → shell runtime) was being ignored.
|
|
|
|
### Solution Architecture
|
|
|
|
**1. Added `runtime_name` field to `ExecutionContext`**
|
|
```rust
|
|
pub struct ExecutionContext {
|
|
// ... existing fields
|
|
pub runtime_name: Option<String>, // NEW: "python", "shell", etc.
|
|
}
|
|
```
|
|
|
|
**2. Updated worker executor to load runtime from database**
|
|
```rust
|
|
// Load runtime information if specified
|
|
let runtime_name = if let Some(runtime_id) = action.runtime {
|
|
match sqlx::query_as::<_, RuntimeModel>("SELECT * FROM attune.runtime WHERE id = $1")
|
|
.bind(runtime_id)
|
|
.fetch_optional(&self.pool)
|
|
.await
|
|
{
|
|
Ok(Some(runtime)) => {
|
|
debug!("Loaded runtime '{}' for action '{}'", runtime.name, action.r#ref);
|
|
Some(runtime.name)
|
|
}
|
|
// ... error handling
|
|
}
|
|
} else {
|
|
None
|
|
};
|
|
```
|
|
|
|
**3. Updated `RuntimeRegistry::get_runtime()` to prefer runtime_name**
|
|
```rust
|
|
pub fn get_runtime(&self, context: &ExecutionContext) -> RuntimeResult<&dyn Runtime> {
|
|
// If runtime_name is specified, use it directly (no guessing!)
|
|
if let Some(ref runtime_name) = context.runtime_name {
|
|
return self
|
|
.runtimes
|
|
.iter()
|
|
.find(|r| r.name() == runtime_name)
|
|
.map(|r| r.as_ref())
|
|
.ok_or_else(|| {
|
|
RuntimeError::RuntimeNotFound(format!(
|
|
"Runtime '{}' not found (available: {})",
|
|
runtime_name,
|
|
self.list_runtimes().join(", ")
|
|
))
|
|
});
|
|
}
|
|
|
|
// Fall back to can_execute() for ad-hoc executions
|
|
self.runtimes
|
|
.iter()
|
|
.find(|r| r.can_execute(context))
|
|
.map(|r| r.as_ref())
|
|
.ok_or_else(|| RuntimeError::RuntimeNotFound(...))
|
|
}
|
|
```
|
|
|
|
**4. Updated worker to register individual runtimes**
|
|
```rust
|
|
// Register individual runtimes so they can be matched by name
|
|
runtime_registry.register(Box::new(PythonRuntime::new()));
|
|
runtime_registry.register(Box::new(ShellRuntime::new()));
|
|
|
|
// Also keep local runtime as fallback
|
|
runtime_registry.register(Box::new(LocalRuntime::new()));
|
|
```
|
|
|
|
### Files Modified
|
|
- `crates/worker/src/runtime/mod.rs` - Added runtime_name field, updated get_runtime()
|
|
- `crates/worker/src/executor.rs` - Load runtime from database, populate runtime_name
|
|
- `crates/worker/src/service.rs` - Register individual Python/Shell runtimes
|
|
- Test files updated to include new field:
|
|
- `crates/worker/src/runtime/python.rs`
|
|
- `crates/worker/src/runtime/shell.rs`
|
|
- `crates/worker/src/runtime/local.rs`
|
|
|
|
### Result
|
|
✅ Worker correctly identifies runtime from database metadata
|
|
✅ Runtime selection based on authoritative data, not pattern matching
|
|
✅ Backward compatible with can_execute() for ad-hoc executions
|
|
✅ Actions execute successfully with correct runtime
|
|
|
|
---
|
|
|
|
## Verification
|
|
|
|
### Test Execution (Execution ID 180)
|
|
```
|
|
2026-01-16T04:01:08.319898Z INFO Starting execution: 180
|
|
2026-01-16T04:01:08.326189Z INFO Executing shell action: core.echo (execution_id: 180)
|
|
2026-01-16T04:01:08.328574Z INFO Execution 180 succeeded
|
|
2026-01-16T04:01:08.330727Z INFO Execution 180 completed successfully in 2ms
|
|
```
|
|
|
|
### Database State
|
|
```sql
|
|
SELECT id, status FROM attune.execution WHERE id >= 177;
|
|
id | status
|
|
-----+-----------
|
|
182 | completed
|
|
181 | completed
|
|
180 | completed
|
|
179 | completed
|
|
178 | completed
|
|
```
|
|
|
|
### Runtime Registration
|
|
```
|
|
2026-01-16T04:00:46.998849Z INFO Registered runtimes: ["python", "shell", "local"]
|
|
```
|
|
|
|
---
|
|
|
|
## Architecture Improvements
|
|
|
|
### Before
|
|
```
|
|
Action (core.echo)
|
|
└─> Worker receives execution
|
|
└─> RuntimeRegistry.get_runtime()
|
|
└─> Check can_execute() on each runtime
|
|
└─> PythonRuntime.can_execute() → ❌ (no .py)
|
|
└─> ShellRuntime.can_execute() → ❌ (no pattern match)
|
|
└─> ERROR: Runtime not found
|
|
```
|
|
|
|
### After
|
|
```
|
|
Action (core.echo, runtime_id: 3)
|
|
└─> Worker loads action from database
|
|
└─> Query runtime table: runtime_id=3 → name="shell"
|
|
└─> ExecutionContext { runtime_name: Some("shell"), ... }
|
|
└─> RuntimeRegistry.get_runtime()
|
|
└─> Direct name match: "shell" → ShellRuntime ✅
|
|
└─> SUCCESS: Execute with shell runtime
|
|
```
|
|
|
|
---
|
|
|
|
## Impact
|
|
|
|
### Immediate Benefits
|
|
1. **End-to-end timer pipeline works** - Events trigger actions successfully
|
|
2. **Reliable runtime selection** - Based on database metadata, not heuristics
|
|
3. **No more infinite loops** - Execution manager processes messages correctly
|
|
4. **Fast execution** - Shell actions complete in ~2-3ms
|
|
|
|
### System Status
|
|
- ✅ Sensor: Firing timer events every 10 seconds
|
|
- ✅ Executor: Processing enforcements, scheduling executions
|
|
- ✅ Worker: Executing actions with correct runtime
|
|
- ✅ End-to-end: Timer → Event → Rule → Enforcement → Execution → Completion
|
|
|
|
---
|
|
|
|
## Known Issues
|
|
|
|
### Message Re-delivery
|
|
Executions are being processed multiple times due to message acknowledgment issues. This is a separate concern from the fixed issues and does not prevent successful execution.
|
|
|
|
**Example:**
|
|
```
|
|
Processing execution.scheduled for execution: 180
|
|
Starting execution: 180
|
|
Execution 180 completed successfully
|
|
Processing execution.scheduled for execution: 180 # Re-delivered
|
|
Starting execution: 180
|
|
Execution 180 completed successfully
|
|
```
|
|
|
|
**Impact:** Minor - Executions still complete successfully, just with redundant processing.
|
|
|
|
**Next Steps:** Investigate message acknowledgment in worker consumer to ensure proper ACK/NACK handling.
|
|
|
|
---
|
|
|
|
## Testing Recommendations
|
|
|
|
### Manual Testing
|
|
```bash
|
|
# 1. Start all services
|
|
./target/debug/attune-sensor -c config.development.yaml &
|
|
./target/debug/attune-executor -c config.development.yaml &
|
|
./target/debug/attune-worker -c config.development.yaml &
|
|
|
|
# 2. Monitor logs
|
|
tail -f /tmp/sensor.log | grep "Timer.*fired"
|
|
tail -f /tmp/executor.log | grep "Execution.*scheduled"
|
|
tail -f /tmp/worker.log | grep "completed successfully"
|
|
|
|
# 3. Check database
|
|
psql -U postgres -h localhost -d attune \
|
|
-c "SELECT id, status, action_ref FROM attune.execution ORDER BY id DESC LIMIT 10;"
|
|
```
|
|
|
|
### Integration Tests
|
|
- Test timer events trigger executions
|
|
- Test Python actions (verify python runtime)
|
|
- Test shell actions (verify shell runtime)
|
|
- Test execution lifecycle transitions
|
|
- Test message routing (no loops, no duplicate processing)
|
|
|
|
---
|
|
|
|
## Configuration Note
|
|
|
|
Added message queue configuration to `config.development.yaml`:
|
|
```yaml
|
|
message_queue:
|
|
url: amqp://guest:guest@localhost:5672
|
|
```
|
|
|
|
This was missing and caused sensor/executor/worker to fail on startup in development environment.
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
Both P0 blocking issues are **RESOLVED**:
|
|
1. ✅ Message loop fixed via queue binding change
|
|
2. ✅ Runtime resolution fixed via database-driven selection
|
|
|
|
The timer-driven automation pipeline now works end-to-end:
|
|
```
|
|
Timer (10s) → Event → Rule Match → Enforcement → Execution → Worker (shell) → Completion (2-3ms)
|
|
```
|
|
|
|
**Next Priority:** Address message re-delivery issue to eliminate redundant processing. |