more internal polish, resilient workers

This commit is contained in:
2026-02-09 18:32:34 -06:00
parent 588b319fec
commit e31ecb781b
62 changed files with 9872 additions and 584 deletions

View File

@@ -339,7 +339,7 @@ Understanding the execution lifecycle helps with monitoring and debugging:
```
1. requested → Action execution requested
2. scheduling → Finding available worker
3. scheduled → Assigned to worker, queued
3. scheduled → Assigned to worker, queued [HANDOFF TO WORKER]
4. running → Currently executing
5. completed → Finished successfully
OR
@@ -352,33 +352,78 @@ Understanding the execution lifecycle helps with monitoring and debugging:
abandoned → Worker lost
```
### State Ownership Model
Execution state is owned by different services at different lifecycle stages:
**Executor Ownership (Pre-Handoff):**
- `requested``scheduling``scheduled`
- Executor creates and updates execution records
- Executor selects worker and publishes `execution.scheduled`
- **Handles cancellations/failures BEFORE handoff** (before `execution.scheduled` is published)
**Handoff Point:**
- When `execution.scheduled` message is **published to worker**
- Before handoff: Executor owns and updates state
- After handoff: Worker owns and updates state
**Worker Ownership (Post-Handoff):**
- `running``completed` / `failed` / `cancelled` / `timeout` / `abandoned`
- Worker updates execution records directly
- Worker publishes status change notifications
- **Handles cancellations/failures AFTER handoff** (after receiving `execution.scheduled`)
- Worker only owns executions it has received
**Orchestration (Read-Only):**
- Executor receives status change notifications for orchestration
- Triggers workflow children, manages parent-child relationships
- Does NOT update execution state after handoff
### State Transitions
**Normal Flow:**
```
requested → scheduling → scheduled → running → completed
requested → scheduling → scheduled → [HANDOFF] → running → completed
└─ Executor Updates ─────────┘ └─ Worker Updates ─┘
```
**Failure Flow:**
```
requested → scheduling → scheduled → running → failed
requested → scheduling → scheduled → [HANDOFF] → running → failed
└─ Executor Updates ─────────┘ └─ Worker Updates ──┘
```
**Cancellation:**
**Cancellation (depends on handoff):**
```
(any state) → canceling → cancelled
Before handoff:
requested/scheduling/scheduled → cancelled
└─ Executor Updates (worker never notified) ──┘
After handoff:
running → canceling → cancelled
└─ Worker Updates ──┘
```
**Timeout:**
```
scheduled/running → timeout
scheduled/running → [HANDOFF] → timeout
└─ Worker Updates
```
**Abandonment:**
```
scheduled/running → abandoned
scheduled/running → [HANDOFF] → abandoned
└─ Worker Updates
```
**Key Points:**
- Only one service updates each execution stage (no race conditions)
- Handoff occurs when `execution.scheduled` is **published**, not just when status is set to `scheduled`
- If cancelled before handoff: Executor updates (worker never knows execution existed)
- If cancelled after handoff: Worker updates (worker owns execution)
- Worker is authoritative source for execution state after receiving `execution.scheduled`
- Status changes are reflected in real-time via notifications
---
## Data Fields