# Quick Reference: Execution State Ownership **Last Updated**: 2026-02-09 ## Ownership Model at a Glance ``` ┌──────────────────────────────────────────────────────────┐ │ EXECUTOR OWNS │ WORKER OWNS │ │ Requested │ Running │ │ Scheduling │ Completed │ │ Scheduled │ Failed │ │ (+ pre-handoff Cancelled) │ (+ post-handoff │ │ │ Cancelled/Timeout/ │ │ │ Abandoned) │ └───────────────────────────────┴──────────────────────────┘ │ │ └─────── HANDOFF ──────────┘ execution.scheduled PUBLISHED ``` ## Who Updates the Database? ### Executor Updates (Pre-Handoff Only) - ✅ Creates execution record - ✅ Updates status: `Requested` → `Scheduling` → `Scheduled` - ✅ Publishes `execution.scheduled` message **← HANDOFF POINT** - ✅ Handles cancellations/failures BEFORE handoff (worker never notified) - ❌ NEVER updates after `execution.scheduled` is published ### Worker Updates (Post-Handoff Only) - ✅ Receives `execution.scheduled` message (takes ownership) - ✅ Updates status: `Scheduled` → `Running` - ✅ Updates status: `Running` → `Completed`/`Failed`/`Cancelled`/etc. - ✅ Handles cancellations/failures AFTER handoff - ✅ Updates result data - ✅ Writes for every status change after receiving handoff ## Who Publishes Messages? ### Executor Publishes - `enforcement.created` (from rules) - `execution.requested` (to scheduler) - `execution.scheduled` (to worker) **← HANDOFF MESSAGE - OWNERSHIP TRANSFER** ### Worker Publishes - `execution.status_changed` (for each status change after handoff) - `execution.completed` (when done) ### Executor Receives (But Doesn't Update DB Post-Handoff) - `execution.status_changed` → triggers orchestration logic (read-only) - `execution.completed` → releases queue slots ## Code Locations ### Executor Updates DB ```rust // crates/executor/src/scheduler.rs execution.status = ExecutionStatus::Scheduled; ExecutionRepository::update(pool, execution.id, execution.into()).await?; ``` ### Worker Updates DB ```rust // crates/worker/src/executor.rs self.update_execution_status(execution_id, ExecutionStatus::Running).await?; // ... ExecutionRepository::update(&self.pool, execution_id, input).await?; ``` ### Executor Orchestrates (Read-Only) ```rust // crates/executor/src/execution_manager.rs async fn process_status_change(...) -> Result<()> { let execution = ExecutionRepository::find_by_id(pool, execution_id).await?; // NO UPDATE - just orchestration logic Self::handle_completion(pool, publisher, &execution).await?; } ``` ## Decision Tree: Should I Update the DB? ``` Are you in the Executor? ├─ Have you published execution.scheduled for this execution? │ ├─ NO → Update DB (you own it) │ │ └─ Includes: Requested/Scheduling/Scheduled/pre-handoff Cancelled │ └─ YES → Don't update DB (worker owns it now) │ └─ Just orchestrate (trigger workflows, etc) │ Are you in the Worker? ├─ Have you received execution.scheduled for this execution? │ ├─ YES → Update DB for ALL status changes (you own it) │ │ └─ Includes: Running/Completed/Failed/post-handoff Cancelled/etc. │ └─ NO → Don't touch this execution (doesn't exist for you yet) ``` ## Common Patterns ### ✅ DO: Worker Updates After Handoff ```rust // Worker receives execution.scheduled self.update_execution_status(execution_id, ExecutionStatus::Running).await?; self.publish_status_update(execution_id, ExecutionStatus::Running).await?; ``` ### ✅ DO: Executor Orchestrates Without DB Write ```rust // Executor receives execution.status_changed let execution = ExecutionRepository::find_by_id(pool, execution_id).await?; if status == ExecutionStatus::Completed { Self::trigger_child_executions(pool, publisher, &execution).await?; } ``` ### ❌ DON'T: Executor Updates After Handoff ```rust // Executor receives execution.status_changed execution.status = status; ExecutionRepository::update(pool, execution.id, execution).await?; // ❌ WRONG! ``` ### ❌ DON'T: Worker Updates Before Handoff ```rust // Worker updates execution it hasn't received via execution.scheduled ExecutionRepository::update(&self.pool, execution_id, input).await?; // ❌ WRONG! ``` ### ✅ DO: Executor Handles Pre-Handoff Cancellation ```rust // User cancels execution before it's scheduled to worker // Execution is still in Requested/Scheduling state execution.status = ExecutionStatus::Cancelled; ExecutionRepository::update(pool, execution_id, execution).await?; // ✅ CORRECT! // Worker never receives execution.scheduled, never knows execution existed ``` ### ✅ DO: Worker Handles Post-Handoff Cancellation ```rust // Worker received execution.scheduled, now owns execution // User cancels execution while it's running execution.status = ExecutionStatus::Cancelled; ExecutionRepository::update(&self.pool, execution_id, execution).await?; // ✅ CORRECT! self.publish_status_update(execution_id, ExecutionStatus::Cancelled).await?; ``` ## Handoff Checklist When an execution is scheduled: **Executor Must**: - [x] Update status to `Scheduled` - [x] Write to database - [x] Publish `execution.scheduled` message **← HANDOFF OCCURS HERE** - [x] Stop updating this execution (ownership transferred) - [x] Continue to handle orchestration (read-only) **Worker Must**: - [x] Receive `execution.scheduled` message **← OWNERSHIP RECEIVED** - [x] Take ownership of execution state - [x] Update DB for all future status changes - [x] Handle any cancellations/failures after this point - [x] Publish status notifications **Important**: If execution is cancelled BEFORE executor publishes `execution.scheduled`, the executor updates status to `Cancelled` and worker never learns about it. ## Benefits Summary | Aspect | Benefit | |--------|---------| | **Race Conditions** | Eliminated - only one owner per stage | | **DB Writes** | Reduced by ~50% - no duplicates | | **Code Clarity** | Clear boundaries - easy to reason about | | **Message Traffic** | Reduced - no duplicate completions | | **Idempotency** | Safe to receive duplicate messages | ## Troubleshooting ### Execution Stuck in "Scheduled" **Problem**: Worker not updating status to Running **Check**: Was execution.scheduled published? Worker received it? Worker healthy? ### Workflow Children Not Triggering **Problem**: Orchestration not running **Check**: Worker published execution.status_changed? Message queue healthy? ### Duplicate Status Updates **Problem**: Both services updating DB **Check**: Executor should NOT update after publishing execution.scheduled ### Execution Cancelled But Status Not Updated **Problem**: Cancellation not reflected in database **Check**: Was it cancelled before or after handoff? **Fix**: If before handoff → executor updates; if after handoff → worker updates ### Queue Warnings **Problem**: Duplicate completion notifications **Check**: Only worker should publish execution.completed ## See Also - **Full Architecture Doc**: `docs/ARCHITECTURE-execution-state-ownership.md` - **Bug Fix Visualization**: `docs/BUGFIX-duplicate-completion-2026-02-09.md` - **Work Summary**: `work-summary/2026-02-09-execution-state-ownership.md`