205 lines
7.6 KiB
Markdown
205 lines
7.6 KiB
Markdown
# Quick Reference: Execution State Ownership
|
|
|
|
**Last Updated**: 2026-02-09
|
|
|
|
## Ownership Model at a Glance
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────┐
|
|
│ EXECUTOR OWNS │ WORKER OWNS │
|
|
│ Requested │ Running │
|
|
│ Scheduling │ Completed │
|
|
│ Scheduled │ Failed │
|
|
│ (+ pre-handoff Cancelled) │ (+ post-handoff │
|
|
│ │ Cancelled/Timeout/ │
|
|
│ │ Abandoned) │
|
|
└───────────────────────────────┴──────────────────────────┘
|
|
│ │
|
|
└─────── HANDOFF ──────────┘
|
|
execution.scheduled PUBLISHED
|
|
```
|
|
|
|
## Who Updates the Database?
|
|
|
|
### Executor Updates (Pre-Handoff Only)
|
|
- ✅ Creates execution record
|
|
- ✅ Updates status: `Requested` → `Scheduling` → `Scheduled`
|
|
- ✅ Publishes `execution.scheduled` message **← HANDOFF POINT**
|
|
- ✅ Handles cancellations/failures BEFORE handoff (worker never notified)
|
|
- ❌ NEVER updates after `execution.scheduled` is published
|
|
|
|
### Worker Updates (Post-Handoff Only)
|
|
- ✅ Receives `execution.scheduled` message (takes ownership)
|
|
- ✅ Updates status: `Scheduled` → `Running`
|
|
- ✅ Updates status: `Running` → `Completed`/`Failed`/`Cancelled`/etc.
|
|
- ✅ Handles cancellations/failures AFTER handoff
|
|
- ✅ Updates result data
|
|
- ✅ Writes for every status change after receiving handoff
|
|
|
|
## Who Publishes Messages?
|
|
|
|
### Executor Publishes
|
|
- `enforcement.created` (from rules)
|
|
- `execution.requested` (to scheduler)
|
|
- `execution.scheduled` (to worker) **← HANDOFF MESSAGE - OWNERSHIP TRANSFER**
|
|
|
|
### Worker Publishes
|
|
- `execution.status_changed` (for each status change after handoff)
|
|
- `execution.completed` (when done)
|
|
|
|
### Executor Receives (But Doesn't Update DB Post-Handoff)
|
|
- `execution.status_changed` → triggers orchestration logic (read-only)
|
|
- `execution.completed` → releases queue slots
|
|
|
|
## Code Locations
|
|
|
|
### Executor Updates DB
|
|
```rust
|
|
// crates/executor/src/scheduler.rs
|
|
execution.status = ExecutionStatus::Scheduled;
|
|
ExecutionRepository::update(pool, execution.id, execution.into()).await?;
|
|
```
|
|
|
|
### Worker Updates DB
|
|
```rust
|
|
// crates/worker/src/executor.rs
|
|
self.update_execution_status(execution_id, ExecutionStatus::Running).await?;
|
|
// ...
|
|
ExecutionRepository::update(&self.pool, execution_id, input).await?;
|
|
```
|
|
|
|
### Executor Orchestrates (Read-Only)
|
|
```rust
|
|
// crates/executor/src/execution_manager.rs
|
|
async fn process_status_change(...) -> Result<()> {
|
|
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
|
|
// NO UPDATE - just orchestration logic
|
|
Self::handle_completion(pool, publisher, &execution).await?;
|
|
}
|
|
```
|
|
|
|
## Decision Tree: Should I Update the DB?
|
|
|
|
```
|
|
Are you in the Executor?
|
|
├─ Have you published execution.scheduled for this execution?
|
|
│ ├─ NO → Update DB (you own it)
|
|
│ │ └─ Includes: Requested/Scheduling/Scheduled/pre-handoff Cancelled
|
|
│ └─ YES → Don't update DB (worker owns it now)
|
|
│ └─ Just orchestrate (trigger workflows, etc)
|
|
│
|
|
Are you in the Worker?
|
|
├─ Have you received execution.scheduled for this execution?
|
|
│ ├─ YES → Update DB for ALL status changes (you own it)
|
|
│ │ └─ Includes: Running/Completed/Failed/post-handoff Cancelled/etc.
|
|
│ └─ NO → Don't touch this execution (doesn't exist for you yet)
|
|
```
|
|
|
|
## Common Patterns
|
|
|
|
### ✅ DO: Worker Updates After Handoff
|
|
```rust
|
|
// Worker receives execution.scheduled
|
|
self.update_execution_status(execution_id, ExecutionStatus::Running).await?;
|
|
self.publish_status_update(execution_id, ExecutionStatus::Running).await?;
|
|
```
|
|
|
|
### ✅ DO: Executor Orchestrates Without DB Write
|
|
```rust
|
|
// Executor receives execution.status_changed
|
|
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
|
|
if status == ExecutionStatus::Completed {
|
|
Self::trigger_child_executions(pool, publisher, &execution).await?;
|
|
}
|
|
```
|
|
|
|
### ❌ DON'T: Executor Updates After Handoff
|
|
```rust
|
|
// Executor receives execution.status_changed
|
|
execution.status = status;
|
|
ExecutionRepository::update(pool, execution.id, execution).await?; // ❌ WRONG!
|
|
```
|
|
|
|
### ❌ DON'T: Worker Updates Before Handoff
|
|
```rust
|
|
// Worker updates execution it hasn't received via execution.scheduled
|
|
ExecutionRepository::update(&self.pool, execution_id, input).await?; // ❌ WRONG!
|
|
```
|
|
|
|
### ✅ DO: Executor Handles Pre-Handoff Cancellation
|
|
```rust
|
|
// User cancels execution before it's scheduled to worker
|
|
// Execution is still in Requested/Scheduling state
|
|
execution.status = ExecutionStatus::Cancelled;
|
|
ExecutionRepository::update(pool, execution_id, execution).await?; // ✅ CORRECT!
|
|
// Worker never receives execution.scheduled, never knows execution existed
|
|
```
|
|
|
|
### ✅ DO: Worker Handles Post-Handoff Cancellation
|
|
```rust
|
|
// Worker received execution.scheduled, now owns execution
|
|
// User cancels execution while it's running
|
|
execution.status = ExecutionStatus::Cancelled;
|
|
ExecutionRepository::update(&self.pool, execution_id, execution).await?; // ✅ CORRECT!
|
|
self.publish_status_update(execution_id, ExecutionStatus::Cancelled).await?;
|
|
```
|
|
|
|
## Handoff Checklist
|
|
|
|
When an execution is scheduled:
|
|
|
|
**Executor Must**:
|
|
- [x] Update status to `Scheduled`
|
|
- [x] Write to database
|
|
- [x] Publish `execution.scheduled` message **← HANDOFF OCCURS HERE**
|
|
- [x] Stop updating this execution (ownership transferred)
|
|
- [x] Continue to handle orchestration (read-only)
|
|
|
|
**Worker Must**:
|
|
- [x] Receive `execution.scheduled` message **← OWNERSHIP RECEIVED**
|
|
- [x] Take ownership of execution state
|
|
- [x] Update DB for all future status changes
|
|
- [x] Handle any cancellations/failures after this point
|
|
- [x] Publish status notifications
|
|
|
|
**Important**: If execution is cancelled BEFORE executor publishes `execution.scheduled`, the executor updates status to `Cancelled` and worker never learns about it.
|
|
|
|
## Benefits Summary
|
|
|
|
| Aspect | Benefit |
|
|
|--------|---------|
|
|
| **Race Conditions** | Eliminated - only one owner per stage |
|
|
| **DB Writes** | Reduced by ~50% - no duplicates |
|
|
| **Code Clarity** | Clear boundaries - easy to reason about |
|
|
| **Message Traffic** | Reduced - no duplicate completions |
|
|
| **Idempotency** | Safe to receive duplicate messages |
|
|
|
|
## Troubleshooting
|
|
|
|
### Execution Stuck in "Scheduled"
|
|
**Problem**: Worker not updating status to Running
|
|
**Check**: Was execution.scheduled published? Worker received it? Worker healthy?
|
|
|
|
### Workflow Children Not Triggering
|
|
**Problem**: Orchestration not running
|
|
**Check**: Worker published execution.status_changed? Message queue healthy?
|
|
|
|
### Duplicate Status Updates
|
|
**Problem**: Both services updating DB
|
|
**Check**: Executor should NOT update after publishing execution.scheduled
|
|
|
|
### Execution Cancelled But Status Not Updated
|
|
**Problem**: Cancellation not reflected in database
|
|
**Check**: Was it cancelled before or after handoff?
|
|
**Fix**: If before handoff → executor updates; if after handoff → worker updates
|
|
|
|
### Queue Warnings
|
|
**Problem**: Duplicate completion notifications
|
|
**Check**: Only worker should publish execution.completed
|
|
|
|
## See Also
|
|
|
|
- **Full Architecture Doc**: `docs/ARCHITECTURE-execution-state-ownership.md`
|
|
- **Bug Fix Visualization**: `docs/BUGFIX-duplicate-completion-2026-02-09.md`
|
|
- **Work Summary**: `work-summary/2026-02-09-execution-state-ownership.md`
|