re-uploading work
This commit is contained in:
315
work-summary/sessions/2026-01-16-critical-fixes.md
Normal file
315
work-summary/sessions/2026-01-16-critical-fixes.md
Normal file
@@ -0,0 +1,315 @@
|
||||
# Work Summary: Critical Fixes for Timer-Driven Execution Pipeline
|
||||
**Date:** 2026-01-16
|
||||
**Status:** COMPLETED ✅
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Fixed two P0 blocking issues that prevented the timer-driven automation pipeline from completing end-to-end:
|
||||
1. **Message Loop Bug** - Execution manager processing completion messages infinitely
|
||||
2. **Worker Runtime Resolution** - Worker unable to find correct runtime for actions
|
||||
|
||||
Both issues are now resolved, and executions complete successfully with the correct runtime.
|
||||
|
||||
---
|
||||
|
||||
## Issue 1: Message Loop in Execution Manager
|
||||
|
||||
### Problem
|
||||
Execution manager entered an infinite loop where `ExecutionCompleted` messages were routed back to the status queue and reprocessed indefinitely.
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
INFO Processing status change for execution 137: Failed
|
||||
INFO Updated execution 137 status: Failed -> Failed
|
||||
INFO Handling completion for execution: 137
|
||||
INFO Published execution.completed notification for execution: 137
|
||||
[repeats infinitely]
|
||||
```
|
||||
|
||||
### Root Cause
|
||||
The execution manager's queue was bound with wildcard pattern `execution.status.#` which matched:
|
||||
- `execution.status.changed` ✅ (intended)
|
||||
- `execution.completed` ❌ (unintended - caused loop)
|
||||
|
||||
When an execution completed, the manager published an `ExecutionCompleted` message with routing key `execution.completed`. This message was routed back to the same queue due to the wildcard pattern, causing infinite reprocessing.
|
||||
|
||||
### Solution
|
||||
Changed queue binding in `common/src/mq/connection.rs` from wildcard to exact match:
|
||||
```rust
|
||||
// Before
|
||||
self.bind_queue(
|
||||
&config.rabbitmq.queues.execution_status.name,
|
||||
&config.rabbitmq.exchanges.executions.name,
|
||||
"execution.status.#", // Matches ALL execution.status.* messages
|
||||
)
|
||||
|
||||
// After
|
||||
self.bind_queue(
|
||||
&config.rabbitmq.queues.execution_status.name,
|
||||
&config.rabbitmq.exchanges.executions.name,
|
||||
"execution.status.changed", // Only matches status change messages
|
||||
)
|
||||
```
|
||||
|
||||
### Files Modified
|
||||
- `crates/common/src/mq/connection.rs` - Line 407
|
||||
|
||||
### Result
|
||||
✅ ExecutionCompleted messages no longer route to status queue
|
||||
✅ Manager processes each status change exactly once
|
||||
✅ No more infinite loops
|
||||
|
||||
---
|
||||
|
||||
## Issue 2: Worker Runtime Resolution Failure
|
||||
|
||||
### Problem
|
||||
Worker received execution messages but failed with:
|
||||
```
|
||||
ERROR Runtime not found: No runtime found for action: core.echo
|
||||
```
|
||||
|
||||
Even though:
|
||||
- Worker had shell runtime available ✅
|
||||
- Action `core.echo` requires shell runtime ✅
|
||||
- Runtime matching logic existed ✅
|
||||
|
||||
### Root Cause
|
||||
The worker's runtime selection relied on `can_execute()` methods that checked file extensions and action_ref patterns:
|
||||
- Python runtime checked for `.py` in action_ref
|
||||
- Shell runtime checked for specific patterns
|
||||
- `core.echo` didn't match any patterns
|
||||
|
||||
The action's runtime metadata (stored in database as `runtime: 3` → shell runtime) was being ignored.
|
||||
|
||||
### Solution Architecture
|
||||
|
||||
**1. Added `runtime_name` field to `ExecutionContext`**
|
||||
```rust
|
||||
pub struct ExecutionContext {
|
||||
// ... existing fields
|
||||
pub runtime_name: Option<String>, // NEW: "python", "shell", etc.
|
||||
}
|
||||
```
|
||||
|
||||
**2. Updated worker executor to load runtime from database**
|
||||
```rust
|
||||
// Load runtime information if specified
|
||||
let runtime_name = if let Some(runtime_id) = action.runtime {
|
||||
match sqlx::query_as::<_, RuntimeModel>("SELECT * FROM attune.runtime WHERE id = $1")
|
||||
.bind(runtime_id)
|
||||
.fetch_optional(&self.pool)
|
||||
.await
|
||||
{
|
||||
Ok(Some(runtime)) => {
|
||||
debug!("Loaded runtime '{}' for action '{}'", runtime.name, action.r#ref);
|
||||
Some(runtime.name)
|
||||
}
|
||||
// ... error handling
|
||||
}
|
||||
} else {
|
||||
None
|
||||
};
|
||||
```
|
||||
|
||||
**3. Updated `RuntimeRegistry::get_runtime()` to prefer runtime_name**
|
||||
```rust
|
||||
pub fn get_runtime(&self, context: &ExecutionContext) -> RuntimeResult<&dyn Runtime> {
|
||||
// If runtime_name is specified, use it directly (no guessing!)
|
||||
if let Some(ref runtime_name) = context.runtime_name {
|
||||
return self
|
||||
.runtimes
|
||||
.iter()
|
||||
.find(|r| r.name() == runtime_name)
|
||||
.map(|r| r.as_ref())
|
||||
.ok_or_else(|| {
|
||||
RuntimeError::RuntimeNotFound(format!(
|
||||
"Runtime '{}' not found (available: {})",
|
||||
runtime_name,
|
||||
self.list_runtimes().join(", ")
|
||||
))
|
||||
});
|
||||
}
|
||||
|
||||
// Fall back to can_execute() for ad-hoc executions
|
||||
self.runtimes
|
||||
.iter()
|
||||
.find(|r| r.can_execute(context))
|
||||
.map(|r| r.as_ref())
|
||||
.ok_or_else(|| RuntimeError::RuntimeNotFound(...))
|
||||
}
|
||||
```
|
||||
|
||||
**4. Updated worker to register individual runtimes**
|
||||
```rust
|
||||
// Register individual runtimes so they can be matched by name
|
||||
runtime_registry.register(Box::new(PythonRuntime::new()));
|
||||
runtime_registry.register(Box::new(ShellRuntime::new()));
|
||||
|
||||
// Also keep local runtime as fallback
|
||||
runtime_registry.register(Box::new(LocalRuntime::new()));
|
||||
```
|
||||
|
||||
### Files Modified
|
||||
- `crates/worker/src/runtime/mod.rs` - Added runtime_name field, updated get_runtime()
|
||||
- `crates/worker/src/executor.rs` - Load runtime from database, populate runtime_name
|
||||
- `crates/worker/src/service.rs` - Register individual Python/Shell runtimes
|
||||
- Test files updated to include new field:
|
||||
- `crates/worker/src/runtime/python.rs`
|
||||
- `crates/worker/src/runtime/shell.rs`
|
||||
- `crates/worker/src/runtime/local.rs`
|
||||
|
||||
### Result
|
||||
✅ Worker correctly identifies runtime from database metadata
|
||||
✅ Runtime selection based on authoritative data, not pattern matching
|
||||
✅ Backward compatible with can_execute() for ad-hoc executions
|
||||
✅ Actions execute successfully with correct runtime
|
||||
|
||||
---
|
||||
|
||||
## Verification
|
||||
|
||||
### Test Execution (Execution ID 180)
|
||||
```
|
||||
2026-01-16T04:01:08.319898Z INFO Starting execution: 180
|
||||
2026-01-16T04:01:08.326189Z INFO Executing shell action: core.echo (execution_id: 180)
|
||||
2026-01-16T04:01:08.328574Z INFO Execution 180 succeeded
|
||||
2026-01-16T04:01:08.330727Z INFO Execution 180 completed successfully in 2ms
|
||||
```
|
||||
|
||||
### Database State
|
||||
```sql
|
||||
SELECT id, status FROM attune.execution WHERE id >= 177;
|
||||
id | status
|
||||
-----+-----------
|
||||
182 | completed
|
||||
181 | completed
|
||||
180 | completed
|
||||
179 | completed
|
||||
178 | completed
|
||||
```
|
||||
|
||||
### Runtime Registration
|
||||
```
|
||||
2026-01-16T04:00:46.998849Z INFO Registered runtimes: ["python", "shell", "local"]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Architecture Improvements
|
||||
|
||||
### Before
|
||||
```
|
||||
Action (core.echo)
|
||||
└─> Worker receives execution
|
||||
└─> RuntimeRegistry.get_runtime()
|
||||
└─> Check can_execute() on each runtime
|
||||
└─> PythonRuntime.can_execute() → ❌ (no .py)
|
||||
└─> ShellRuntime.can_execute() → ❌ (no pattern match)
|
||||
└─> ERROR: Runtime not found
|
||||
```
|
||||
|
||||
### After
|
||||
```
|
||||
Action (core.echo, runtime_id: 3)
|
||||
└─> Worker loads action from database
|
||||
└─> Query runtime table: runtime_id=3 → name="shell"
|
||||
└─> ExecutionContext { runtime_name: Some("shell"), ... }
|
||||
└─> RuntimeRegistry.get_runtime()
|
||||
└─> Direct name match: "shell" → ShellRuntime ✅
|
||||
└─> SUCCESS: Execute with shell runtime
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
### Immediate Benefits
|
||||
1. **End-to-end timer pipeline works** - Events trigger actions successfully
|
||||
2. **Reliable runtime selection** - Based on database metadata, not heuristics
|
||||
3. **No more infinite loops** - Execution manager processes messages correctly
|
||||
4. **Fast execution** - Shell actions complete in ~2-3ms
|
||||
|
||||
### System Status
|
||||
- ✅ Sensor: Firing timer events every 10 seconds
|
||||
- ✅ Executor: Processing enforcements, scheduling executions
|
||||
- ✅ Worker: Executing actions with correct runtime
|
||||
- ✅ End-to-end: Timer → Event → Rule → Enforcement → Execution → Completion
|
||||
|
||||
---
|
||||
|
||||
## Known Issues
|
||||
|
||||
### Message Re-delivery
|
||||
Executions are being processed multiple times due to message acknowledgment issues. This is a separate concern from the fixed issues and does not prevent successful execution.
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Processing execution.scheduled for execution: 180
|
||||
Starting execution: 180
|
||||
Execution 180 completed successfully
|
||||
Processing execution.scheduled for execution: 180 # Re-delivered
|
||||
Starting execution: 180
|
||||
Execution 180 completed successfully
|
||||
```
|
||||
|
||||
**Impact:** Minor - Executions still complete successfully, just with redundant processing.
|
||||
|
||||
**Next Steps:** Investigate message acknowledgment in worker consumer to ensure proper ACK/NACK handling.
|
||||
|
||||
---
|
||||
|
||||
## Testing Recommendations
|
||||
|
||||
### Manual Testing
|
||||
```bash
|
||||
# 1. Start all services
|
||||
./target/debug/attune-sensor -c config.development.yaml &
|
||||
./target/debug/attune-executor -c config.development.yaml &
|
||||
./target/debug/attune-worker -c config.development.yaml &
|
||||
|
||||
# 2. Monitor logs
|
||||
tail -f /tmp/sensor.log | grep "Timer.*fired"
|
||||
tail -f /tmp/executor.log | grep "Execution.*scheduled"
|
||||
tail -f /tmp/worker.log | grep "completed successfully"
|
||||
|
||||
# 3. Check database
|
||||
psql -U postgres -h localhost -d attune \
|
||||
-c "SELECT id, status, action_ref FROM attune.execution ORDER BY id DESC LIMIT 10;"
|
||||
```
|
||||
|
||||
### Integration Tests
|
||||
- Test timer events trigger executions
|
||||
- Test Python actions (verify python runtime)
|
||||
- Test shell actions (verify shell runtime)
|
||||
- Test execution lifecycle transitions
|
||||
- Test message routing (no loops, no duplicate processing)
|
||||
|
||||
---
|
||||
|
||||
## Configuration Note
|
||||
|
||||
Added message queue configuration to `config.development.yaml`:
|
||||
```yaml
|
||||
message_queue:
|
||||
url: amqp://guest:guest@localhost:5672
|
||||
```
|
||||
|
||||
This was missing and caused sensor/executor/worker to fail on startup in development environment.
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
Both P0 blocking issues are **RESOLVED**:
|
||||
1. ✅ Message loop fixed via queue binding change
|
||||
2. ✅ Runtime resolution fixed via database-driven selection
|
||||
|
||||
The timer-driven automation pipeline now works end-to-end:
|
||||
```
|
||||
Timer (10s) → Event → Rule Match → Enforcement → Execution → Worker (shell) → Completion (2-3ms)
|
||||
```
|
||||
|
||||
**Next Priority:** Address message re-delivery issue to eliminate redundant processing.
|
||||
Reference in New Issue
Block a user