more internal polish, resilient workers

This commit is contained in:
2026-02-09 18:32:34 -06:00
parent 588b319fec
commit e31ecb781b
62 changed files with 9872 additions and 584 deletions

View File

@@ -0,0 +1,204 @@
# Quick Reference: Execution State Ownership
**Last Updated**: 2026-02-09
## Ownership Model at a Glance
```
┌──────────────────────────────────────────────────────────┐
│ EXECUTOR OWNS │ WORKER OWNS │
│ Requested │ Running │
│ Scheduling │ Completed │
│ Scheduled │ Failed │
│ (+ pre-handoff Cancelled) │ (+ post-handoff │
│ │ Cancelled/Timeout/ │
│ │ Abandoned) │
└───────────────────────────────┴──────────────────────────┘
│ │
└─────── HANDOFF ──────────┘
execution.scheduled PUBLISHED
```
## Who Updates the Database?
### Executor Updates (Pre-Handoff Only)
- ✅ Creates execution record
- ✅ Updates status: `Requested``Scheduling``Scheduled`
- ✅ Publishes `execution.scheduled` message **← HANDOFF POINT**
- ✅ Handles cancellations/failures BEFORE handoff (worker never notified)
- ❌ NEVER updates after `execution.scheduled` is published
### Worker Updates (Post-Handoff Only)
- ✅ Receives `execution.scheduled` message (takes ownership)
- ✅ Updates status: `Scheduled``Running`
- ✅ Updates status: `Running``Completed`/`Failed`/`Cancelled`/etc.
- ✅ Handles cancellations/failures AFTER handoff
- ✅ Updates result data
- ✅ Writes for every status change after receiving handoff
## Who Publishes Messages?
### Executor Publishes
- `enforcement.created` (from rules)
- `execution.requested` (to scheduler)
- `execution.scheduled` (to worker) **← HANDOFF MESSAGE - OWNERSHIP TRANSFER**
### Worker Publishes
- `execution.status_changed` (for each status change after handoff)
- `execution.completed` (when done)
### Executor Receives (But Doesn't Update DB Post-Handoff)
- `execution.status_changed` → triggers orchestration logic (read-only)
- `execution.completed` → releases queue slots
## Code Locations
### Executor Updates DB
```rust
// crates/executor/src/scheduler.rs
execution.status = ExecutionStatus::Scheduled;
ExecutionRepository::update(pool, execution.id, execution.into()).await?;
```
### Worker Updates DB
```rust
// crates/worker/src/executor.rs
self.update_execution_status(execution_id, ExecutionStatus::Running).await?;
// ...
ExecutionRepository::update(&self.pool, execution_id, input).await?;
```
### Executor Orchestrates (Read-Only)
```rust
// crates/executor/src/execution_manager.rs
async fn process_status_change(...) -> Result<()> {
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
// NO UPDATE - just orchestration logic
Self::handle_completion(pool, publisher, &execution).await?;
}
```
## Decision Tree: Should I Update the DB?
```
Are you in the Executor?
├─ Have you published execution.scheduled for this execution?
│ ├─ NO → Update DB (you own it)
│ │ └─ Includes: Requested/Scheduling/Scheduled/pre-handoff Cancelled
│ └─ YES → Don't update DB (worker owns it now)
│ └─ Just orchestrate (trigger workflows, etc)
Are you in the Worker?
├─ Have you received execution.scheduled for this execution?
│ ├─ YES → Update DB for ALL status changes (you own it)
│ │ └─ Includes: Running/Completed/Failed/post-handoff Cancelled/etc.
│ └─ NO → Don't touch this execution (doesn't exist for you yet)
```
## Common Patterns
### ✅ DO: Worker Updates After Handoff
```rust
// Worker receives execution.scheduled
self.update_execution_status(execution_id, ExecutionStatus::Running).await?;
self.publish_status_update(execution_id, ExecutionStatus::Running).await?;
```
### ✅ DO: Executor Orchestrates Without DB Write
```rust
// Executor receives execution.status_changed
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
if status == ExecutionStatus::Completed {
Self::trigger_child_executions(pool, publisher, &execution).await?;
}
```
### ❌ DON'T: Executor Updates After Handoff
```rust
// Executor receives execution.status_changed
execution.status = status;
ExecutionRepository::update(pool, execution.id, execution).await?; // ❌ WRONG!
```
### ❌ DON'T: Worker Updates Before Handoff
```rust
// Worker updates execution it hasn't received via execution.scheduled
ExecutionRepository::update(&self.pool, execution_id, input).await?; // ❌ WRONG!
```
### ✅ DO: Executor Handles Pre-Handoff Cancellation
```rust
// User cancels execution before it's scheduled to worker
// Execution is still in Requested/Scheduling state
execution.status = ExecutionStatus::Cancelled;
ExecutionRepository::update(pool, execution_id, execution).await?; // ✅ CORRECT!
// Worker never receives execution.scheduled, never knows execution existed
```
### ✅ DO: Worker Handles Post-Handoff Cancellation
```rust
// Worker received execution.scheduled, now owns execution
// User cancels execution while it's running
execution.status = ExecutionStatus::Cancelled;
ExecutionRepository::update(&self.pool, execution_id, execution).await?; // ✅ CORRECT!
self.publish_status_update(execution_id, ExecutionStatus::Cancelled).await?;
```
## Handoff Checklist
When an execution is scheduled:
**Executor Must**:
- [x] Update status to `Scheduled`
- [x] Write to database
- [x] Publish `execution.scheduled` message **← HANDOFF OCCURS HERE**
- [x] Stop updating this execution (ownership transferred)
- [x] Continue to handle orchestration (read-only)
**Worker Must**:
- [x] Receive `execution.scheduled` message **← OWNERSHIP RECEIVED**
- [x] Take ownership of execution state
- [x] Update DB for all future status changes
- [x] Handle any cancellations/failures after this point
- [x] Publish status notifications
**Important**: If execution is cancelled BEFORE executor publishes `execution.scheduled`, the executor updates status to `Cancelled` and worker never learns about it.
## Benefits Summary
| Aspect | Benefit |
|--------|---------|
| **Race Conditions** | Eliminated - only one owner per stage |
| **DB Writes** | Reduced by ~50% - no duplicates |
| **Code Clarity** | Clear boundaries - easy to reason about |
| **Message Traffic** | Reduced - no duplicate completions |
| **Idempotency** | Safe to receive duplicate messages |
## Troubleshooting
### Execution Stuck in "Scheduled"
**Problem**: Worker not updating status to Running
**Check**: Was execution.scheduled published? Worker received it? Worker healthy?
### Workflow Children Not Triggering
**Problem**: Orchestration not running
**Check**: Worker published execution.status_changed? Message queue healthy?
### Duplicate Status Updates
**Problem**: Both services updating DB
**Check**: Executor should NOT update after publishing execution.scheduled
### Execution Cancelled But Status Not Updated
**Problem**: Cancellation not reflected in database
**Check**: Was it cancelled before or after handoff?
**Fix**: If before handoff → executor updates; if after handoff → worker updates
### Queue Warnings
**Problem**: Duplicate completion notifications
**Check**: Only worker should publish execution.completed
## See Also
- **Full Architecture Doc**: `docs/ARCHITECTURE-execution-state-ownership.md`
- **Bug Fix Visualization**: `docs/BUGFIX-duplicate-completion-2026-02-09.md`
- **Work Summary**: `work-summary/2026-02-09-execution-state-ownership.md`