more internal polish, resilient workers

2026-02-09 18:32:34 -06:00
parent 588b319fec
commit e31ecb781b
62 changed files with 9872 additions and 584 deletions
--- a/docs/QUICKREF-execution-state-ownership.md
+++ b/docs/QUICKREF-execution-state-ownership.md
@@ -0,0 +1,204 @@
+# Quick Reference: Execution State Ownership
+
+**Last Updated**: 2026-02-09
+
+## Ownership Model at a Glance
+
+```
+┌──────────────────────────────────────────────────────────┐
+│  EXECUTOR OWNS                │  WORKER OWNS             │
+│  Requested                    │  Running                 │
+│  Scheduling                   │  Completed               │
+│  Scheduled                    │  Failed                  │
+│  (+ pre-handoff Cancelled)    │  (+ post-handoff         │
+│                               │     Cancelled/Timeout/   │
+│                               │     Abandoned)           │
+└───────────────────────────────┴──────────────────────────┘
+            │                           │
+            └─────── HANDOFF ──────────┘
+        execution.scheduled PUBLISHED
+```
+
+## Who Updates the Database?
+
+### Executor Updates (Pre-Handoff Only)
+- ✅ Creates execution record
+- ✅ Updates status: `Requested` → `Scheduling` → `Scheduled`
+- ✅ Publishes `execution.scheduled` message **← HANDOFF POINT**
+- ✅ Handles cancellations/failures BEFORE handoff (worker never notified)
+- ❌ NEVER updates after `execution.scheduled` is published
+
+### Worker Updates (Post-Handoff Only)
+- ✅ Receives `execution.scheduled` message (takes ownership)
+- ✅ Updates status: `Scheduled` → `Running`
+- ✅ Updates status: `Running` → `Completed`/`Failed`/`Cancelled`/etc.
+- ✅ Handles cancellations/failures AFTER handoff
+- ✅ Updates result data
+- ✅ Writes for every status change after receiving handoff
+
+## Who Publishes Messages?
+
+### Executor Publishes
+- `enforcement.created` (from rules)
+- `execution.requested` (to scheduler)
+- `execution.scheduled` (to worker) **← HANDOFF MESSAGE - OWNERSHIP TRANSFER**
+
+### Worker Publishes
+- `execution.status_changed` (for each status change after handoff)
+- `execution.completed` (when done)
+
+### Executor Receives (But Doesn't Update DB Post-Handoff)
+- `execution.status_changed` → triggers orchestration logic (read-only)
+- `execution.completed` → releases queue slots
+
+## Code Locations
+
+### Executor Updates DB
+```rust
+// crates/executor/src/scheduler.rs
+execution.status = ExecutionStatus::Scheduled;
+ExecutionRepository::update(pool, execution.id, execution.into()).await?;
+```
+
+### Worker Updates DB
+```rust
+// crates/worker/src/executor.rs
+self.update_execution_status(execution_id, ExecutionStatus::Running).await?;
+// ...
+ExecutionRepository::update(&self.pool, execution_id, input).await?;
+```
+
+### Executor Orchestrates (Read-Only)
+```rust
+// crates/executor/src/execution_manager.rs
+async fn process_status_change(...) -> Result<()> {
+    let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
+    // NO UPDATE - just orchestration logic
+    Self::handle_completion(pool, publisher, &execution).await?;
+}
+```
+
+## Decision Tree: Should I Update the DB?
+
+```
+Are you in the Executor?
+├─ Have you published execution.scheduled for this execution?
+│  ├─ NO → Update DB (you own it)
+│  │  └─ Includes: Requested/Scheduling/Scheduled/pre-handoff Cancelled
+│  └─ YES → Don't update DB (worker owns it now)
+│     └─ Just orchestrate (trigger workflows, etc)
+│
+Are you in the Worker?
+├─ Have you received execution.scheduled for this execution?
+│  ├─ YES → Update DB for ALL status changes (you own it)
+│  │  └─ Includes: Running/Completed/Failed/post-handoff Cancelled/etc.
+│  └─ NO → Don't touch this execution (doesn't exist for you yet)
+```
+
+## Common Patterns
+
+### ✅ DO: Worker Updates After Handoff
+```rust
+// Worker receives execution.scheduled
+self.update_execution_status(execution_id, ExecutionStatus::Running).await?;
+self.publish_status_update(execution_id, ExecutionStatus::Running).await?;
+```
+
+### ✅ DO: Executor Orchestrates Without DB Write
+```rust
+// Executor receives execution.status_changed
+let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
+if status == ExecutionStatus::Completed {
+    Self::trigger_child_executions(pool, publisher, &execution).await?;
+}
+```
+
+### ❌ DON'T: Executor Updates After Handoff
+```rust
+// Executor receives execution.status_changed
+execution.status = status;
+ExecutionRepository::update(pool, execution.id, execution).await?; // ❌ WRONG!
+```
+
+### ❌ DON'T: Worker Updates Before Handoff
+```rust
+// Worker updates execution it hasn't received via execution.scheduled
+ExecutionRepository::update(&self.pool, execution_id, input).await?; // ❌ WRONG!
+```
+
+### ✅ DO: Executor Handles Pre-Handoff Cancellation
+```rust
+// User cancels execution before it's scheduled to worker
+// Execution is still in Requested/Scheduling state
+execution.status = ExecutionStatus::Cancelled;
+ExecutionRepository::update(pool, execution_id, execution).await?; // ✅ CORRECT!
+// Worker never receives execution.scheduled, never knows execution existed
+```
+
+### ✅ DO: Worker Handles Post-Handoff Cancellation
+```rust
+// Worker received execution.scheduled, now owns execution
+// User cancels execution while it's running
+execution.status = ExecutionStatus::Cancelled;
+ExecutionRepository::update(&self.pool, execution_id, execution).await?; // ✅ CORRECT!
+self.publish_status_update(execution_id, ExecutionStatus::Cancelled).await?;
+```
+
+## Handoff Checklist
+
+When an execution is scheduled:
+
+**Executor Must**:
+- [x] Update status to `Scheduled`
+- [x] Write to database
+- [x] Publish `execution.scheduled` message **← HANDOFF OCCURS HERE**
+- [x] Stop updating this execution (ownership transferred)
+- [x] Continue to handle orchestration (read-only)
+
+**Worker Must**:
+- [x] Receive `execution.scheduled` message **← OWNERSHIP RECEIVED**
+- [x] Take ownership of execution state
+- [x] Update DB for all future status changes
+- [x] Handle any cancellations/failures after this point
+- [x] Publish status notifications
+
+**Important**: If execution is cancelled BEFORE executor publishes `execution.scheduled`, the executor updates status to `Cancelled` and worker never learns about it.
+
+## Benefits Summary
+
+| Aspect | Benefit |
+|--------|---------|
+| **Race Conditions** | Eliminated - only one owner per stage |
+| **DB Writes** | Reduced by ~50% - no duplicates |
+| **Code Clarity** | Clear boundaries - easy to reason about |
+| **Message Traffic** | Reduced - no duplicate completions |
+| **Idempotency** | Safe to receive duplicate messages |
+
+## Troubleshooting
+
+### Execution Stuck in "Scheduled"
+**Problem**: Worker not updating status to Running  
+**Check**: Was execution.scheduled published? Worker received it? Worker healthy?
+
+### Workflow Children Not Triggering
+**Problem**: Orchestration not running  
+**Check**: Worker published execution.status_changed? Message queue healthy?
+
+### Duplicate Status Updates
+**Problem**: Both services updating DB  
+**Check**: Executor should NOT update after publishing execution.scheduled
+
+### Execution Cancelled But Status Not Updated
+**Problem**: Cancellation not reflected in database  
+**Check**: Was it cancelled before or after handoff?  
+**Fix**: If before handoff → executor updates; if after handoff → worker updates
+
+### Queue Warnings
+**Problem**: Duplicate completion notifications  
+**Check**: Only worker should publish execution.completed
+
+## See Also
+
+- **Full Architecture Doc**: `docs/ARCHITECTURE-execution-state-ownership.md`
+- **Bug Fix Visualization**: `docs/BUGFIX-duplicate-completion-2026-02-09.md`
+- **Work Summary**: `work-summary/2026-02-09-execution-state-ownership.md`