# Execution State Ownership Model **Date**: 2026-02-09 **Status**: Implemented **Related Issues**: Duplicate completion notifications, unnecessary database updates ## Overview This document defines the **ownership model** for execution state management in Attune. It clarifies which service is responsible for updating execution records at each stage of the lifecycle, eliminating race conditions and redundant database writes. ## The Problem Prior to this change, both the executor and worker were updating execution state in the database, causing: 1. **Race conditions** - unclear which service's update would happen first 2. **Redundant writes** - both services writing the same status value 3. **Architectural confusion** - no clear ownership boundaries 4. **Warning logs** - duplicate completion notifications ## The Solution: Lifecycle-Based Ownership Execution state ownership is divided based on **lifecycle stage**, with a clear handoff point: ``` ┌─────────────────────────────────────────────────────────────────┐ │ EXECUTOR OWNERSHIP │ │ │ │ Requested → Scheduling → Scheduled │ │ │ │ │ (includes cancellations/failures │ │ │ before execution.scheduled │ │ │ message is published) │ │ │ │ │ │ Handoff Point: │ │ execution.scheduled message PUBLISHED │ │ ▼ │ └─────────────────────────────────────────────────────────────────┘ │ │ Worker receives message │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ WORKER OWNERSHIP │ │ │ │ Running → Completed / Failed / Cancelled / Timeout │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` ### Executor Responsibilities The **Executor Service** owns execution state from creation through scheduling: - ✅ Creates execution records (`Requested`) - ✅ Updates status during scheduling (`Scheduling`) - ✅ Updates status when scheduled to worker (`Scheduled`) - ✅ Publishes `execution.scheduled` message **← HANDOFF POINT** - ✅ Handles cancellations/failures BEFORE `execution.scheduled` is published - ❌ Does NOT update status after `execution.scheduled` is published **Lifecycle stages**: `Requested` → `Scheduling` → `Scheduled` **Important**: If an execution is cancelled or fails before the executor publishes `execution.scheduled`, the executor is responsible for updating the status (e.g., to `Cancelled`). The worker never learns about executions that don't reach the handoff point. ### Worker Responsibilities The **Worker Service** owns execution state after receiving the handoff: - ✅ Receives `execution.scheduled` message **← TAKES OWNERSHIP** - ✅ Updates status when execution starts (`Running`) - ✅ Updates status when execution completes (`Completed`, `Failed`, etc.) - ✅ Handles cancellations AFTER receiving `execution.scheduled` - ✅ Updates execution result data - ✅ Publishes `execution.status_changed` notifications - ✅ Publishes `execution.completed` notifications - ❌ Does NOT update status for executions it hasn't received **Lifecycle stages**: `Running` → `Completed` / `Failed` / `Cancelled` / `Timeout` **Important**: The worker only owns executions it has received via `execution.scheduled`. If a cancellation happens before this message is sent, the worker is never involved. ## Message Flow ### 1. Executor Creates and Schedules ``` Executor Service ├─> Creates execution (status: Requested) ├─> Updates status: Scheduling ├─> Selects worker ├─> Updates status: Scheduled └─> Publishes: execution.scheduled → worker-specific queue ``` ### 2. Worker Receives and Executes ``` Worker Service ├─> Receives: execution.scheduled ├─> Updates DB: Scheduled → Running ├─> Publishes: execution.status_changed (running) ├─> Executes action ├─> Updates DB: Running → Completed/Failed ├─> Publishes: execution.status_changed (completed/failed) └─> Publishes: execution.completed ``` ### 3. Executor Handles Orchestration ``` Executor Service (ExecutionManager) ├─> Receives: execution.status_changed ├─> Does NOT update database ├─> Handles orchestration logic: │ ├─> Triggers workflow children (if parent completed) │ ├─> Updates workflow state │ └─> Manages parent-child relationships └─> Logs event for monitoring ``` ### 4. Queue Management ``` Executor Service (CompletionListener) ├─> Receives: execution.completed ├─> Releases queue slot ├─> Notifies waiting executions └─> Updates queue statistics ``` ## Database Update Rules ### Executor (Pre-Scheduling) **File**: `crates/executor/src/scheduler.rs` ```rust // ✅ Executor updates DB before scheduling execution.status = ExecutionStatus::Scheduled; ExecutionRepository::update(pool, execution.id, execution.into()).await?; // Publish to worker Self::queue_to_worker(...).await?; ``` ### Worker (Post-Scheduling) **File**: `crates/worker/src/executor.rs` ```rust // ✅ Worker updates DB when starting async fn execute(&self, execution_id: i64) -> Result { // Update status to running self.update_execution_status(execution_id, ExecutionStatus::Running).await?; // Execute action... } // ✅ Worker updates DB when completing async fn handle_execution_success(&self, execution_id: i64, result: &ExecutionResult) -> Result<()> { let input = UpdateExecutionInput { status: Some(ExecutionStatus::Completed), result: Some(result_data), // ... }; ExecutionRepository::update(&self.pool, execution_id, input).await?; } ``` ### Executor (Post-Scheduling) **File**: `crates/executor/src/execution_manager.rs` ```rust // ❌ Executor does NOT update DB after scheduling async fn process_status_change(...) -> Result<()> { // Fetch execution (for orchestration logic only) let execution = ExecutionRepository::find_by_id(pool, execution_id).await?; // Handle orchestration, but do NOT update DB match status { ExecutionStatus::Completed | ExecutionStatus::Failed | ExecutionStatus::Cancelled => { Self::handle_completion(pool, publisher, &execution).await?; } _ => {} } Ok(()) } ``` ## Benefits ### 1. Clear Ownership Boundaries - No ambiguity about who updates what - Easy to reason about system behavior - Reduced cognitive load for developers ### 2. Eliminated Race Conditions - Only one service updates each lifecycle stage - No competing writes to same fields - Predictable state transitions ### 3. Better Performance - No redundant database writes - Reduced database contention - Lower network overhead (fewer queries) ### 4. Cleaner Logs Before: ``` executor | Updated execution 9061 status: Scheduled -> Running executor | Updated execution 9061 status: Running -> Running executor | Updated execution 9061 status: Completed -> Completed executor | WARN: Completion notification for action 3 but active_count is 0 ``` After: ``` executor | Execution 9061 scheduled to worker 29 worker | Starting execution: 9061 worker | Execution 9061 completed successfully in 142ms executor | Execution 9061 reached terminal state: Completed, handling orchestration ``` ### 5. Idempotent Message Handling - Executor can safely receive duplicate status change messages - Worker updates are authoritative - No special logic needed for retries ## Edge Cases & Error Handling ### Cancellation Before Handoff **Scenario**: Execution is queued due to concurrency policy, user cancels before scheduling. **Handling**: - Execution in `Requested` or `Scheduling` state - Executor updates status: → `Cancelled` - Worker never receives `execution.scheduled` - No worker resources consumed ✅ ### Cancellation After Handoff **Scenario**: Execution already scheduled to worker, user cancels while running. **Handling**: - Worker has received `execution.scheduled` and owns execution - Worker updates status: `Running` → `Cancelled` - Worker publishes status change notification - Executor handles orchestration (e.g., skip workflow children) ### Worker Crashes Before Updating Status **Scenario**: Worker receives `execution.scheduled` but crashes before updating status to `Running`. **Handling**: - Execution remains in `Scheduled` state - Worker owned the execution but failed to update - Executor's heartbeat monitoring detects stale scheduled executions - After timeout, executor can reschedule to another worker or mark as abandoned - Idempotent: If worker already started, duplicate scheduling is rejected ### Message Delivery Delays **Scenario**: Worker updates DB but `execution.status_changed` message is delayed. **Handling**: - Database reflects correct state (source of truth) - Executor eventually receives notification and handles orchestration - Orchestration logic is idempotent (safe to call multiple times) - Critical: Workflows may have slight delay, but remain consistent ### Partial Failures **Scenario**: Worker updates DB successfully but fails to publish notification. **Handling**: - Database has correct state (worker succeeded) - Executor won't trigger orchestration until notification arrives - Future enhancement: Periodic executor polling for stale completions - Workaround: Worker retries message publishing with exponential backoff ## Migration Notes ### Changes Required 1. **Executor Service** (`execution_manager.rs`): - ✅ Removed database updates from `process_status_change()` - ✅ Changed to read-only orchestration handler - ✅ Updated logs to reflect observer role 2. **Worker Service** (`service.rs`): - ✅ Already updates DB directly (no changes needed) - ✅ Updated comment: "we'll update the database directly" 3. **Documentation**: - ✅ Updated module docs to reflect ownership model - ✅ Added ownership boundaries to architecture docs ### Backward Compatibility - ✅ No breaking changes to external APIs - ✅ Message formats unchanged - ✅ Database schema unchanged - ✅ Workflow behavior unchanged ## Testing Strategy ### Unit Tests - ✅ Executor tests verify no DB updates after scheduling - ✅ Worker tests verify DB updates at all lifecycle stages - ✅ Message handler tests verify orchestration without DB writes ### Integration Tests - Test full execution lifecycle end-to-end - Verify status transitions in database - Confirm orchestration logic (workflow children) still works - Test failure scenarios (worker crashes, message delays) ### Monitoring Monitor for: - Executions stuck in `Scheduled` state (worker not picking up) - Large delays between status changes (message queue lag) - Workflow children not triggering (orchestration failure) ## Future Enhancements ### 1. Executor Polling for Stale Completions If `execution.status_changed` messages are lost, executor could periodically poll for completed executions that haven't triggered orchestration. ### 2. Worker Health Checks More robust detection of worker failures before scheduled executions time out. ### 3. Explicit Handoff Messages Consider adding `execution.handoff` message to explicitly mark ownership transfer point. ## References - **Architecture Doc**: `docs/architecture/executor-service.md` - **Work Summary**: `work-summary/2026-02-09-duplicate-completion-fix.md` - **Bug Fix Doc**: `docs/BUGFIX-duplicate-completion-2026-02-09.md` - **ExecutionManager**: `crates/executor/src/execution_manager.rs` - **Worker Executor**: `crates/worker/src/executor.rs` - **Worker Service**: `crates/worker/src/service.rs` ## Summary The execution state ownership model provides **clear, lifecycle-based boundaries** for who updates execution records: - **Executor**: Owns state from creation through scheduling (including pre-handoff cancellations) - **Worker**: Owns state after receiving `execution.scheduled` message - **Handoff**: Occurs when `execution.scheduled` message is **published to worker** - **Key Principle**: Worker only knows about executions it receives; pre-handoff cancellations are executor's responsibility This eliminates race conditions, reduces database load, and provides a clean architectural foundation for future enhancements.