attune/docs/ARCHITECTURE-execution-state-ownership.md

# Execution State Ownership Model

**Date**: 2026-02-09
**Status**: Implemented
**Related Issues**: Duplicate completion notifications, unnecessary database updates

## Overview

This document defines the **ownership model** for execution state management in Attune. It clarifies which service is responsible for updating execution records at each stage of the lifecycle, eliminating race conditions and redundant database writes.

## The Problem

Prior to this change, both the executor and worker were updating execution state in the database, causing:

1. **Race conditions** - unclear which service's update would happen first
2. **Redundant writes** - both services writing the same status value
3. **Architectural confusion** - no clear ownership boundaries
4. **Warning logs** - duplicate completion notifications

## The Solution: Lifecycle-Based Ownership

Execution state ownership is divided based on **lifecycle stage**, with a clear handoff point:

```
┌─────────────────────────────────────────────────────────────────┐
│                      EXECUTOR OWNERSHIP                         │
│                                                                 │
│  Requested → Scheduling → Scheduled                             │
│                                    │                            │
│  (includes cancellations/failures  │                            │
│   before execution.scheduled       │                            │
│   message is published)            │                            │
│                                    │                            │
│                          Handoff Point:                         │
│                          execution.scheduled message PUBLISHED  │
│                                    ▼                            │
└─────────────────────────────────────────────────────────────────┘
                                    │
                                    │ Worker receives message
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────┐
│                       WORKER OWNERSHIP                          │
│                                                                 │
│  Running → Completed / Failed / Cancelled / Timeout            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### Executor Responsibilities

The **Executor Service** owns execution state from creation through scheduling:

- ✅ Creates execution records (`Requested`)
- ✅ Updates status during scheduling (`Scheduling`)
- ✅ Updates status when scheduled to worker (`Scheduled`)
- ✅ Publishes `execution.scheduled` message **← HANDOFF POINT**
- ✅ Handles cancellations/failures BEFORE `execution.scheduled` is published
- ❌ Does NOT update status after `execution.scheduled` is published

**Lifecycle stages**: `Requested` → `Scheduling` → `Scheduled`

**Important**: If an execution is cancelled or fails before the executor publishes `execution.scheduled`, the executor is responsible for updating the status (e.g., to `Cancelled`). The worker never learns about executions that don't reach the handoff point.

### Worker Responsibilities

The **Worker Service** owns execution state after receiving the handoff:

- ✅ Receives `execution.scheduled` message **← TAKES OWNERSHIP**
- ✅ Updates status when execution starts (`Running`)
- ✅ Updates status when execution completes (`Completed`, `Failed`, etc.)
- ✅ Handles cancellations AFTER receiving `execution.scheduled`
- ✅ Updates execution result data
- ✅ Publishes `execution.status_changed` notifications
- ✅ Publishes `execution.completed` notifications
- ❌ Does NOT update status for executions it hasn't received

**Lifecycle stages**: `Running` → `Completed` / `Failed` / `Cancelled` / `Timeout`

**Important**: The worker only owns executions it has received via `execution.scheduled`. If a cancellation happens before this message is sent, the worker is never involved.

## Message Flow

### 1. Executor Creates and Schedules

```
Executor Service
  ├─> Creates execution (status: Requested)
  ├─> Updates status: Scheduling
  ├─> Selects worker
  ├─> Updates status: Scheduled
  └─> Publishes: execution.scheduled → worker-specific queue
```

### 2. Worker Receives and Executes

```
Worker Service
  ├─> Receives: execution.scheduled
  ├─> Updates DB: Scheduled → Running
  ├─> Publishes: execution.status_changed (running)
  ├─> Executes action
  ├─> Updates DB: Running → Completed/Failed
  ├─> Publishes: execution.status_changed (completed/failed)
  └─> Publishes: execution.completed
```

### 3. Executor Handles Orchestration

```
Executor Service (ExecutionManager)
  ├─> Receives: execution.status_changed
  ├─> Does NOT update database
  ├─> Handles orchestration logic:
  │   ├─> Triggers workflow children (if parent completed)
  │   ├─> Updates workflow state
  │   └─> Manages parent-child relationships
  └─> Logs event for monitoring
```

### 4. Queue Management

```
Executor Service (CompletionListener)
  ├─> Receives: execution.completed
  ├─> Releases queue slot
  ├─> Notifies waiting executions
  └─> Updates queue statistics
```

## Database Update Rules

### Executor (Pre-Scheduling)

**File**: `crates/executor/src/scheduler.rs`

```rust
// ✅ Executor updates DB before scheduling
execution.status = ExecutionStatus::Scheduled;
ExecutionRepository::update(pool, execution.id, execution.into()).await?;

// Publish to worker
Self::queue_to_worker(...).await?;
```

### Worker (Post-Scheduling)

**File**: `crates/worker/src/executor.rs`

```rust
// ✅ Worker updates DB when starting
async fn execute(&self, execution_id: i64) -> Result<ExecutionResult> {
    // Update status to running
    self.update_execution_status(execution_id, ExecutionStatus::Running).await?;

    // Execute action...
}

// ✅ Worker updates DB when completing
async fn handle_execution_success(&self, execution_id: i64, result: &ExecutionResult) -> Result<()> {
    let input = UpdateExecutionInput {
        status: Some(ExecutionStatus::Completed),
        result: Some(result_data),
        // ...
    };
    ExecutionRepository::update(&self.pool, execution_id, input).await?;
}
```

### Executor (Post-Scheduling)

**File**: `crates/executor/src/execution_manager.rs`

```rust
// ❌ Executor does NOT update DB after scheduling
async fn process_status_change(...) -> Result<()> {
    // Fetch execution (for orchestration logic only)
    let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;

    // Handle orchestration, but do NOT update DB
    match status {
        ExecutionStatus::Completed | ExecutionStatus::Failed | ExecutionStatus::Cancelled => {
            Self::handle_completion(pool, publisher, &execution).await?;
        }
        _ => {}
    }

    Ok(())
}
```

## Benefits

### 1. Clear Ownership Boundaries

- No ambiguity about who updates what
- Easy to reason about system behavior
- Reduced cognitive load for developers

### 2. Eliminated Race Conditions

- Only one service updates each lifecycle stage
- No competing writes to same fields
- Predictable state transitions

### 3. Better Performance

- No redundant database writes
- Reduced database contention
- Lower network overhead (fewer queries)

### 4. Cleaner Logs

Before:
```
executor | Updated execution 9061 status: Scheduled -> Running
executor | Updated execution 9061 status: Running -> Running
executor | Updated execution 9061 status: Completed -> Completed
executor | WARN: Completion notification for action 3 but active_count is 0
```

After:
```
executor | Execution 9061 scheduled to worker 29
worker   | Starting execution: 9061
worker   | Execution 9061 completed successfully in 142ms
executor | Execution 9061 reached terminal state: Completed, handling orchestration
```

### 5. Idempotent Message Handling

- Executor can safely receive duplicate status change messages
- Worker updates are authoritative
- No special logic needed for retries

## Edge Cases & Error Handling

### Cancellation Before Handoff

**Scenario**: Execution is queued due to concurrency policy, user cancels before scheduling.

**Handling**:
- Execution in `Requested` or `Scheduling` state
- Executor updates status: → `Cancelled`
- Worker never receives `execution.scheduled`
- No worker resources consumed ✅

### Cancellation After Handoff

**Scenario**: Execution already scheduled to worker, user cancels while running.

**Handling**:
- Worker has received `execution.scheduled` and owns execution
- Worker updates status: `Running` → `Cancelled`
- Worker publishes status change notification
- Executor handles orchestration (e.g., skip workflow children)

### Worker Crashes Before Updating Status

**Scenario**: Worker receives `execution.scheduled` but crashes before updating status to `Running`.

**Handling**:
- Execution remains in `Scheduled` state
- Worker owned the execution but failed to update
- Executor's heartbeat monitoring detects stale scheduled executions
- After timeout, executor can reschedule to another worker or mark as abandoned
- Idempotent: If worker already started, duplicate scheduling is rejected

### Message Delivery Delays

**Scenario**: Worker updates DB but `execution.status_changed` message is delayed.

**Handling**:
- Database reflects correct state (source of truth)
- Executor eventually receives notification and handles orchestration
- Orchestration logic is idempotent (safe to call multiple times)
- Critical: Workflows may have slight delay, but remain consistent

### Partial Failures

**Scenario**: Worker updates DB successfully but fails to publish notification.

**Handling**:
- Database has correct state (worker succeeded)
- Executor won't trigger orchestration until notification arrives
- Future enhancement: Periodic executor polling for stale completions
- Workaround: Worker retries message publishing with exponential backoff

## Migration Notes

### Changes Required

1. **Executor Service** (`execution_manager.rs`):
   - ✅ Removed database updates from `process_status_change()`
   - ✅ Changed to read-only orchestration handler
   - ✅ Updated logs to reflect observer role

2. **Worker Service** (`service.rs`):
   - ✅ Already updates DB directly (no changes needed)
   - ✅ Updated comment: "we'll update the database directly"

3. **Documentation**:
   - ✅ Updated module docs to reflect ownership model
   - ✅ Added ownership boundaries to architecture docs

### Backward Compatibility

- ✅ No breaking changes to external APIs
- ✅ Message formats unchanged
- ✅ Database schema unchanged
- ✅ Workflow behavior unchanged

## Testing Strategy

### Unit Tests

- ✅ Executor tests verify no DB updates after scheduling
- ✅ Worker tests verify DB updates at all lifecycle stages
- ✅ Message handler tests verify orchestration without DB writes

### Integration Tests

- Test full execution lifecycle end-to-end
- Verify status transitions in database
- Confirm orchestration logic (workflow children) still works
- Test failure scenarios (worker crashes, message delays)

### Monitoring

Monitor for:
- Executions stuck in `Scheduled` state (worker not picking up)
- Large delays between status changes (message queue lag)
- Workflow children not triggering (orchestration failure)

## Future Enhancements

### 1. Executor Polling for Stale Completions

If `execution.status_changed` messages are lost, executor could periodically poll for completed executions that haven't triggered orchestration.

### 2. Worker Health Checks

More robust detection of worker failures before scheduled executions time out.

### 3. Explicit Handoff Messages

Consider adding `execution.handoff` message to explicitly mark ownership transfer point.

## References

- **Architecture Doc**: `docs/architecture/executor-service.md`
- **Work Summary**: `work-summary/2026-02-09-duplicate-completion-fix.md`
- **Bug Fix Doc**: `docs/BUGFIX-duplicate-completion-2026-02-09.md`
- **ExecutionManager**: `crates/executor/src/execution_manager.rs`
- **Worker Executor**: `crates/worker/src/executor.rs`
- **Worker Service**: `crates/worker/src/service.rs`

## Summary

The execution state ownership model provides **clear, lifecycle-based boundaries** for who updates execution records:

- **Executor**: Owns state from creation through scheduling (including pre-handoff cancellations)
- **Worker**: Owns state after receiving `execution.scheduled` message
- **Handoff**: Occurs when `execution.scheduled` message is **published to worker**
- **Key Principle**: Worker only knows about executions it receives; pre-handoff cancellations are executor's responsibility

This eliminates race conditions, reduces database load, and provides a clean architectural foundation for future enhancements.