Files
attune/docs/ARCHITECTURE-execution-state-ownership.md

13 KiB

Execution State Ownership Model

Date: 2026-02-09
Status: Implemented
Related Issues: Duplicate completion notifications, unnecessary database updates

Overview

This document defines the ownership model for execution state management in Attune. It clarifies which service is responsible for updating execution records at each stage of the lifecycle, eliminating race conditions and redundant database writes.

The Problem

Prior to this change, both the executor and worker were updating execution state in the database, causing:

  1. Race conditions - unclear which service's update would happen first
  2. Redundant writes - both services writing the same status value
  3. Architectural confusion - no clear ownership boundaries
  4. Warning logs - duplicate completion notifications

The Solution: Lifecycle-Based Ownership

Execution state ownership is divided based on lifecycle stage, with a clear handoff point:

┌─────────────────────────────────────────────────────────────────┐
│                      EXECUTOR OWNERSHIP                         │
│                                                                 │
│  Requested → Scheduling → Scheduled                             │
│                                    │                            │
│  (includes cancellations/failures  │                            │
│   before execution.scheduled       │                            │
│   message is published)            │                            │
│                                    │                            │
│                          Handoff Point:                         │
│                          execution.scheduled message PUBLISHED  │
│                                    ▼                            │
└─────────────────────────────────────────────────────────────────┘
                                    │
                                    │ Worker receives message
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────┐
│                       WORKER OWNERSHIP                          │
│                                                                 │
│  Running → Completed / Failed / Cancelled / Timeout            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Executor Responsibilities

The Executor Service owns execution state from creation through scheduling:

  • Creates execution records (Requested)
  • Updates status during scheduling (Scheduling)
  • Updates status when scheduled to worker (Scheduled)
  • Publishes execution.scheduled message ← HANDOFF POINT
  • Handles cancellations/failures BEFORE execution.scheduled is published
  • Does NOT update status after execution.scheduled is published

Lifecycle stages: RequestedSchedulingScheduled

Important: If an execution is cancelled or fails before the executor publishes execution.scheduled, the executor is responsible for updating the status (e.g., to Cancelled). The worker never learns about executions that don't reach the handoff point.

Worker Responsibilities

The Worker Service owns execution state after receiving the handoff:

  • Receives execution.scheduled message ← TAKES OWNERSHIP
  • Updates status when execution starts (Running)
  • Updates status when execution completes (Completed, Failed, etc.)
  • Handles cancellations AFTER receiving execution.scheduled
  • Updates execution result data
  • Publishes execution.status_changed notifications
  • Publishes execution.completed notifications
  • Does NOT update status for executions it hasn't received

Lifecycle stages: RunningCompleted / Failed / Cancelled / Timeout

Important: The worker only owns executions it has received via execution.scheduled. If a cancellation happens before this message is sent, the worker is never involved.

Message Flow

1. Executor Creates and Schedules

Executor Service
  ├─> Creates execution (status: Requested)
  ├─> Updates status: Scheduling
  ├─> Selects worker
  ├─> Updates status: Scheduled
  └─> Publishes: execution.scheduled → worker-specific queue

2. Worker Receives and Executes

Worker Service
  ├─> Receives: execution.scheduled
  ├─> Updates DB: Scheduled → Running
  ├─> Publishes: execution.status_changed (running)
  ├─> Executes action
  ├─> Updates DB: Running → Completed/Failed
  ├─> Publishes: execution.status_changed (completed/failed)
  └─> Publishes: execution.completed

3. Executor Handles Orchestration

Executor Service (ExecutionManager)
  ├─> Receives: execution.status_changed
  ├─> Does NOT update database
  ├─> Handles orchestration logic:
  │   ├─> Triggers workflow children (if parent completed)
  │   ├─> Updates workflow state
  │   └─> Manages parent-child relationships
  └─> Logs event for monitoring

4. Queue Management

Executor Service (CompletionListener)
  ├─> Receives: execution.completed
  ├─> Releases queue slot
  ├─> Notifies waiting executions
  └─> Updates queue statistics

Database Update Rules

Executor (Pre-Scheduling)

File: crates/executor/src/scheduler.rs

// ✅ Executor updates DB before scheduling
execution.status = ExecutionStatus::Scheduled;
ExecutionRepository::update(pool, execution.id, execution.into()).await?;

// Publish to worker
Self::queue_to_worker(...).await?;

Worker (Post-Scheduling)

File: crates/worker/src/executor.rs

// ✅ Worker updates DB when starting
async fn execute(&self, execution_id: i64) -> Result<ExecutionResult> {
    // Update status to running
    self.update_execution_status(execution_id, ExecutionStatus::Running).await?;
    
    // Execute action...
}

// ✅ Worker updates DB when completing
async fn handle_execution_success(&self, execution_id: i64, result: &ExecutionResult) -> Result<()> {
    let input = UpdateExecutionInput {
        status: Some(ExecutionStatus::Completed),
        result: Some(result_data),
        // ...
    };
    ExecutionRepository::update(&self.pool, execution_id, input).await?;
}

Executor (Post-Scheduling)

File: crates/executor/src/execution_manager.rs

// ❌ Executor does NOT update DB after scheduling
async fn process_status_change(...) -> Result<()> {
    // Fetch execution (for orchestration logic only)
    let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
    
    // Handle orchestration, but do NOT update DB
    match status {
        ExecutionStatus::Completed | ExecutionStatus::Failed | ExecutionStatus::Cancelled => {
            Self::handle_completion(pool, publisher, &execution).await?;
        }
        _ => {}
    }
    
    Ok(())
}

Benefits

1. Clear Ownership Boundaries

  • No ambiguity about who updates what
  • Easy to reason about system behavior
  • Reduced cognitive load for developers

2. Eliminated Race Conditions

  • Only one service updates each lifecycle stage
  • No competing writes to same fields
  • Predictable state transitions

3. Better Performance

  • No redundant database writes
  • Reduced database contention
  • Lower network overhead (fewer queries)

4. Cleaner Logs

Before:

executor | Updated execution 9061 status: Scheduled -> Running
executor | Updated execution 9061 status: Running -> Running
executor | Updated execution 9061 status: Completed -> Completed
executor | WARN: Completion notification for action 3 but active_count is 0

After:

executor | Execution 9061 scheduled to worker 29
worker   | Starting execution: 9061
worker   | Execution 9061 completed successfully in 142ms
executor | Execution 9061 reached terminal state: Completed, handling orchestration

5. Idempotent Message Handling

  • Executor can safely receive duplicate status change messages
  • Worker updates are authoritative
  • No special logic needed for retries

Edge Cases & Error Handling

Cancellation Before Handoff

Scenario: Execution is queued due to concurrency policy, user cancels before scheduling.

Handling:

  • Execution in Requested or Scheduling state
  • Executor updates status: → Cancelled
  • Worker never receives execution.scheduled
  • No worker resources consumed

Cancellation After Handoff

Scenario: Execution already scheduled to worker, user cancels while running.

Handling:

  • Worker has received execution.scheduled and owns execution
  • Worker updates status: RunningCancelled
  • Worker publishes status change notification
  • Executor handles orchestration (e.g., skip workflow children)

Worker Crashes Before Updating Status

Scenario: Worker receives execution.scheduled but crashes before updating status to Running.

Handling:

  • Execution remains in Scheduled state
  • Worker owned the execution but failed to update
  • Executor's heartbeat monitoring detects stale scheduled executions
  • After timeout, executor can reschedule to another worker or mark as abandoned
  • Idempotent: If worker already started, duplicate scheduling is rejected

Message Delivery Delays

Scenario: Worker updates DB but execution.status_changed message is delayed.

Handling:

  • Database reflects correct state (source of truth)
  • Executor eventually receives notification and handles orchestration
  • Orchestration logic is idempotent (safe to call multiple times)
  • Critical: Workflows may have slight delay, but remain consistent

Partial Failures

Scenario: Worker updates DB successfully but fails to publish notification.

Handling:

  • Database has correct state (worker succeeded)
  • Executor won't trigger orchestration until notification arrives
  • Future enhancement: Periodic executor polling for stale completions
  • Workaround: Worker retries message publishing with exponential backoff

Migration Notes

Changes Required

  1. Executor Service (execution_manager.rs):

    • Removed database updates from process_status_change()
    • Changed to read-only orchestration handler
    • Updated logs to reflect observer role
  2. Worker Service (service.rs):

    • Already updates DB directly (no changes needed)
    • Updated comment: "we'll update the database directly"
  3. Documentation:

    • Updated module docs to reflect ownership model
    • Added ownership boundaries to architecture docs

Backward Compatibility

  • No breaking changes to external APIs
  • Message formats unchanged
  • Database schema unchanged
  • Workflow behavior unchanged

Testing Strategy

Unit Tests

  • Executor tests verify no DB updates after scheduling
  • Worker tests verify DB updates at all lifecycle stages
  • Message handler tests verify orchestration without DB writes

Integration Tests

  • Test full execution lifecycle end-to-end
  • Verify status transitions in database
  • Confirm orchestration logic (workflow children) still works
  • Test failure scenarios (worker crashes, message delays)

Monitoring

Monitor for:

  • Executions stuck in Scheduled state (worker not picking up)
  • Large delays between status changes (message queue lag)
  • Workflow children not triggering (orchestration failure)

Future Enhancements

1. Executor Polling for Stale Completions

If execution.status_changed messages are lost, executor could periodically poll for completed executions that haven't triggered orchestration.

2. Worker Health Checks

More robust detection of worker failures before scheduled executions time out.

3. Explicit Handoff Messages

Consider adding execution.handoff message to explicitly mark ownership transfer point.

References

  • Architecture Doc: docs/architecture/executor-service.md
  • Work Summary: work-summary/2026-02-09-duplicate-completion-fix.md
  • Bug Fix Doc: docs/BUGFIX-duplicate-completion-2026-02-09.md
  • ExecutionManager: crates/executor/src/execution_manager.rs
  • Worker Executor: crates/worker/src/executor.rs
  • Worker Service: crates/worker/src/service.rs

Summary

The execution state ownership model provides clear, lifecycle-based boundaries for who updates execution records:

  • Executor: Owns state from creation through scheduling (including pre-handoff cancellations)
  • Worker: Owns state after receiving execution.scheduled message
  • Handoff: Occurs when execution.scheduled message is published to worker
  • Key Principle: Worker only knows about executions it receives; pre-handoff cancellations are executor's responsibility

This eliminates race conditions, reduces database load, and provides a clean architectural foundation for future enhancements.