attune/docs/workflows/workflow-execution-engine.md

# Workflow Execution Engine

## Overview

The Workflow Execution Engine is responsible for orchestrating the execution of workflows in Attune. It manages task dependencies, parallel execution, state transitions, context passing, retries, timeouts, and error handling.

## Architecture

The execution engine consists of four main components:

### 1. Task Graph Builder (`workflow/graph.rs`)

**Purpose:** Converts workflow definitions into executable task graphs with dependency information.

**Key Features:**
- Builds directed acyclic graph (DAG) from workflow tasks
- Topological sorting for execution order
- Dependency computation from task transitions
- Cycle detection
- Entry point identification

**Data Structures:**
- `TaskGraph`: Complete executable graph with nodes, dependencies, and execution order
- `TaskNode`: Individual task with configuration, transitions, and dependencies
- `TaskTransitions`: Success/failure/complete/timeout transitions and decision branches
- `RetryConfig`: Retry configuration with backoff strategies

**Example Usage:**
```rust
use attune_executor::workflow::{TaskGraph, parse_workflow_yaml};

let workflow = parse_workflow_yaml(yaml_content)?;
let graph = TaskGraph::from_workflow(&workflow)?;

// Get entry points (tasks with no dependencies)
for entry in &graph.entry_points {
    println!("Entry point: {}", entry);
}

// Get tasks ready to execute
let completed = HashSet::new();
let ready = graph.ready_tasks(&completed);
```

### 2. Context Manager (`workflow/context.rs`)

**Purpose:** Manages workflow execution context, including variables, parameters, and template rendering.

**Key Features:**
- Workflow-level and task-level variable management
- Jinja2-like template rendering with `{{ variable }}` syntax
- Task result storage and retrieval
- With-items iteration support (current item and index)
- Nested value access (e.g., `{{ parameters.config.server.port }}`)
- Context import/export for persistence

**Variable Scopes:**
- `parameters.*` - Input parameters to the workflow
- `vars.*` or `variables.*` - Workflow-scoped variables
- `task.*` or `tasks.*` - Task results
- `item` - Current item in with-items iteration
- `index` - Current index in with-items iteration
- `system.*` - System variables (e.g., workflow start time)

**Example Usage:**
```rust
use attune_executor::workflow::WorkflowContext;
use serde_json::json;

let params = json!({"name": "Alice"});
let mut ctx = WorkflowContext::new(params, HashMap::new());

// Render template
let result = ctx.render_template("Hello {{ parameters.name }}!")?;
// Result: "Hello Alice!"

// Store task result
ctx.set_task_result("task1", json!({"status": "success"}));

// Publish variables from result
let result = json!({"output": "value"});
ctx.publish_from_result(&result, &["my_var".to_string()], None)?;
```

### 3. Task Executor (`workflow/task_executor.rs`)

**Purpose:** Executes individual workflow tasks with support for different task types, retries, and timeouts.

**Key Features:**
- Action task execution (queues actions for workers)
- Parallel task execution (spawns multiple tasks concurrently)
- Workflow task execution (nested workflows - TODO)
- With-items iteration (batch processing with concurrency limits)
- Conditional execution (when clauses)
- Retry logic with configurable backoff strategies
- Timeout handling
- Task result publishing to context

**Task Types:**
- **Action**: Execute a single action
- **Parallel**: Execute multiple sub-tasks concurrently
- **Workflow**: Execute a nested workflow (not yet implemented)

**Retry Strategies:**
- **Constant**: Fixed delay between retries
- **Linear**: Linearly increasing delay
- **Exponential**: Exponentially increasing delay with optional max delay

**Example Task Execution Flow:**
```
1. Check if task should be skipped (when condition)
2. Check if task has with-items iteration
   - If yes, process items in batches with concurrency limits
   - If no, execute single task
3. Render task input with context
4. Execute based on task type (action/parallel/workflow)
5. Apply timeout if configured
6. Handle retries on failure
7. Publish variables from result
8. Update task execution record in database
```

### 4. Workflow Coordinator (`workflow/coordinator.rs`)

**Purpose:** Main orchestration component that manages the complete workflow execution lifecycle.

**Key Features:**
- Workflow lifecycle management (start, pause, resume, cancel)
- State management (completed, failed, skipped tasks)
- Concurrent task execution coordination
- Database state persistence
- Execution result aggregation
- Error handling and recovery

**Workflow Execution States:**
- `Requested` - Workflow execution requested
- `Scheduling` - Being scheduled
- `Scheduled` - Ready to execute
- `Running` - Currently executing
- `Completed` - Successfully completed
- `Failed` - Failed with errors
- `Cancelled` - Cancelled by user
- `Timeout` - Timed out

**Example Usage:**
```rust
use attune_executor::workflow::WorkflowCoordinator;
use serde_json::json;

let coordinator = WorkflowCoordinator::new(db_pool, mq);

// Start workflow execution
let handle = coordinator
    .start_workflow("my_pack.my_workflow", json!({"param": "value"}), None)
    .await?;

// Execute to completion
let result = handle.execute().await?;

println!("Status: {:?}", result.status);
println!("Completed tasks: {}", result.completed_tasks);
println!("Failed tasks: {}", result.failed_tasks);

// Or control execution
handle.pause(Some("User requested pause".to_string())).await?;
handle.resume().await?;
handle.cancel().await?;

// Check status
let status = handle.status().await;
println!("Current: {}/{} tasks", status.completed_tasks, status.total_tasks);
```

## Execution Flow

### High-Level Workflow Execution

```
1. Load workflow definition from database
2. Parse workflow YAML definition
3. Build task graph with dependencies
4. Create parent execution record
5. Initialize workflow context with parameters and variables
6. Create workflow execution record in database
7. Enter execution loop:
   a. Check if workflow is paused -> wait
   b. Check if workflow is complete -> exit
   c. Get ready tasks (dependencies satisfied)
   d. Spawn async execution for each ready task
   e. Wait briefly before checking again
8. Aggregate results and return
```

### Task Execution Flow

```
1. Create task execution record in database
2. Get current workflow context
3. Execute task (action/parallel/workflow/with-items)
4. Update task execution record with result
5. Update workflow state:
   - Add to completed_tasks on success
   - Add to failed_tasks on failure (unless retrying)
   - Add to skipped_tasks if skipped
   - Update context with task result
6. Persist workflow state to database
```

## Database Schema

### workflow_execution Table

Stores workflow execution state:

```sql
CREATE TABLE attune.workflow_execution (
    id BIGSERIAL PRIMARY KEY,
    execution BIGINT NOT NULL REFERENCES attune.execution(id),
    workflow_def BIGINT NOT NULL REFERENCES attune.workflow_definition(id),
    current_tasks TEXT[] NOT NULL DEFAULT '{}',
    completed_tasks TEXT[] NOT NULL DEFAULT '{}',
    failed_tasks TEXT[] NOT NULL DEFAULT '{}',
    skipped_tasks TEXT[] NOT NULL DEFAULT '{}',
    variables JSONB NOT NULL DEFAULT '{}',
    task_graph JSONB NOT NULL,
    status execution_status_enum NOT NULL,
    error_message TEXT,
    paused BOOLEAN NOT NULL DEFAULT false,
    pause_reason TEXT,
    created TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
    updated TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);
```

### workflow_task_execution Table

Stores individual task execution state:

```sql
CREATE TABLE attune.workflow_task_execution (
    id BIGSERIAL PRIMARY KEY,
    workflow_execution BIGINT NOT NULL REFERENCES attune.workflow_execution(id),
    execution BIGINT NOT NULL REFERENCES attune.execution(id),
    task_name TEXT NOT NULL,
    task_index INTEGER,
    task_batch INTEGER,
    status execution_status_enum NOT NULL,
    started_at TIMESTAMP WITH TIME ZONE,
    completed_at TIMESTAMP WITH TIME ZONE,
    duration_ms BIGINT,
    result JSONB,
    error JSONB,
    retry_count INTEGER NOT NULL DEFAULT 0,
    max_retries INTEGER NOT NULL DEFAULT 0,
    next_retry_at TIMESTAMP WITH TIME ZONE,
    timeout_seconds INTEGER,
    timed_out BOOLEAN NOT NULL DEFAULT false,
    created TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
    updated TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);
```

## Template Rendering

### Syntax

Templates use Jinja2-like syntax with `{{ expression }}`:

```yaml
tasks:
  - name: greet
    action: core.echo
    input:
      message: "Hello {{ parameters.name }}!"

  - name: process
    action: core.process
    input:
      data: "{{ task.greet.result.output }}"
      count: "{{ variables.counter }}"
```

### Supported Expressions

**Parameters:**
```
{{ parameters.name }}
{{ parameters.config.server.port }}
```

**Variables:**
```
{{ vars.my_variable }}
{{ variables.counter }}
{{ my_var }}  # Direct variable reference
```

**Task Results:**
```
{{ task.task_name.result }}
{{ task.task_name.output.key }}
{{ tasks.previous_task.status }}
```

**With-Items Context:**
```
{{ item }}
{{ item.name }}
{{ index }}
```

**System Variables:**
```
{{ system.workflow_start }}
```

## With-Items Iteration

Execute a task multiple times with different items:

```yaml
tasks:
  - name: process_servers
    action: server.configure
    with_items: "{{ parameters.servers }}"
    batch_size: 5        # Process 5 items at a time
    concurrency: 10      # Max 10 concurrent executions
    input:
      server: "{{ item.hostname }}"
      index: "{{ index }}"
```

**Features:**
- Batch processing: Process items in batches of specified size
- Concurrency control: Limit number of concurrent executions
- Context isolation: Each iteration has its own `item` and `index`
- Result aggregation: All results collected in array

## Retry Strategies

### Constant Backoff

Fixed delay between retries:

```yaml
tasks:
  - name: flaky_task
    action: external.api_call
    retry:
      count: 3
      delay: 10        # 10 seconds between each retry
      backoff: constant
```

### Linear Backoff

Linearly increasing delay:

```yaml
retry:
  count: 5
  delay: 5
  backoff: linear
# Delays: 5s, 10s, 15s, 20s, 25s
```

### Exponential Backoff

Exponentially increasing delay:

```yaml
retry:
  count: 5
  delay: 2
  backoff: exponential
  max_delay: 60
# Delays: 2s, 4s, 8s, 16s, 32s (capped at 60s)
```

## Task Transitions

Control workflow flow with transitions:

```yaml
tasks:
  - name: check
    action: core.check_status
    on_success: deploy      # Go to deploy on success
    on_failure: rollback    # Go to rollback on failure
    on_complete: notify     # Always go to notify
    on_timeout: alert       # Go to alert on timeout

  - name: decision
    action: core.evaluate
    decision:
      - when: "{{ task.decision.result.action == 'approve' }}"
        next: deploy
      - when: "{{ task.decision.result.action == 'reject' }}"
        next: rollback
      - default: true
        next: manual_review
```

## Error Handling

### Task Execution Errors

Errors are captured with:
- Error message
- Error type
- Optional error details (JSON)

### Workflow Failure Handling

- Individual task failures don't immediately stop the workflow
- Dependent tasks won't execute if prerequisites failed
- Workflow completes when all executable tasks finish
- Final status is `Failed` if any task failed

### Retry on Error

```yaml
retry:
  count: 3
  delay: 5
  backoff: exponential
  on_error: "{{ result.error_code == 'RETRY_ABLE' }}"  # Only retry specific errors
```

## Parallel Execution

Execute multiple tasks concurrently:

```yaml
tasks:
  - name: parallel_checks
    type: parallel
    tasks:
      - name: check_service_a
        action: monitoring.check_health
        input:
          service: "service-a"

      - name: check_service_b
        action: monitoring.check_health
        input:
          service: "service-b"

      - name: check_database
        action: monitoring.check_db

    on_success: deploy
    on_failure: abort
```

**Features:**
- All sub-tasks execute concurrently
- Parent task waits for all sub-tasks to complete
- Success only if all sub-tasks succeed
- Individual sub-task results aggregated

## Conditional Execution

Skip tasks based on conditions:

```yaml
tasks:
  - name: deploy
    action: deployment.deploy
    when: "{{ parameters.environment == 'production' }}"
    input:
      version: "{{ parameters.version }}"
```

**When Clause Evaluation:**
- Template rendered with current context
- Evaluated as boolean (truthy/falsy)
- Task skipped if condition is false

## State Persistence

Workflow state is persisted to the database after every task completion:

- Current executing tasks
- Completed tasks list
- Failed tasks list
- Skipped tasks list
- Workflow variables (entire context)
- Execution status
- Pause state and reason
- Error messages

This enables:
- Workflow resume after service restart
- Pause/resume functionality
- Execution history and auditing
- Progress monitoring

## Integration Points

### Message Queue

Tasks queue action executions via RabbitMQ:

```rust
// Task executor creates execution record
let execution = create_execution_record(...).await?;

// Queues execution for worker (TODO: implement MQ publishing)
self.mq.publish_execution_request(execution.id, action_ref, &input).await?;
```

### Worker Coordination

- Executor creates execution records
- Workers pick up and execute actions
- Workers update execution status
- Coordinator monitors completion (TODO: implement completion listener)

### Event Publishing

Workflow events should be published for:
- Workflow started
- Workflow completed/failed
- Task started/completed/failed
- Workflow paused/resumed/cancelled

## Future Enhancements

### TODO Items

1. **Completion Listener**: Listen for task completion events from workers
2. **Nested Workflows**: Execute workflows as tasks within workflows
3. **MQ Publishing**: Implement actual message queue publishing for action execution
4. **Advanced Expressions**: Support comparisons, logical operators in templates
5. **Error Condition Evaluation**: Evaluate `on_error` expressions for selective retries
6. **Workflow Timeouts**: Global workflow timeout configuration
7. **Task Dependencies**: Explicit `depends_on` task specification
8. **Loop Constructs**: While/until loops in addition to with-items
9. **Manual Steps**: Human-in-the-loop approval tasks
10. **Sub-workflow Output**: Capture and use nested workflow results

## Testing

### Unit Tests

Each module includes unit tests:

```bash
# Run all executor tests
cargo test -p attune-executor

# Run specific module tests
cargo test -p attune-executor --lib workflow::graph
cargo test -p attune-executor --lib workflow::context
```

### Integration Tests

Integration tests require database and message queue:

```bash
# Set up test database
export DATABASE_URL="postgresql://attune_test:attune_test@localhost:5432/attune_test"
sqlx migrate run

# Run integration tests
cargo test -p attune-executor --test '*'
```

## Performance Considerations

### Concurrency

- Parallel tasks execute truly concurrently using `futures::join_all`
- With-items supports configurable concurrency limits
- Task graph execution is optimized with topological sorting

### Database Operations

- Workflow state persisted after each task completion
- Batch operations used where possible
- Connection pooling for database access

### Memory

- Task graphs and contexts can be large for complex workflows
- Consider workflow size limits in production
- Context variables should be reasonably sized

## Troubleshooting

### Workflow Not Progressing

**Symptoms**: Workflow stuck in Running state

**Causes**:
- Circular dependencies (should be caught during parsing)
- All tasks waiting on failed dependencies
- Database connection issues

**Solution**: Check workflow state in database, review task dependencies

### Tasks Not Executing

**Symptoms**: Ready tasks not starting

**Causes**:
- Worker service not running
- Message queue not connected
- Execution records not being created

**Solution**: Check worker logs, verify MQ connection, check database

### Template Rendering Errors

**Symptoms**: Tasks fail with template errors

**Causes**:
- Invalid variable references
- Missing context data
- Malformed expressions

**Solution**: Validate templates, check available context variables

## Examples

See `docs/workflows/` for complete workflow examples demonstrating:
- Sequential workflows
- Parallel execution
- With-items iteration
- Conditional execution
- Error handling and retries
- Complex workflows with decisions

## Related Documentation

- [Workflow Definition Format](workflow-definition-format.md)
- [Pack Integration](api-pack-workflows.md)
- [Execution API](api-executions.md)
- [Message Queue Architecture](message-queue.md)