attune/docs/architecture/executor-service.md

# Executor Service Architecture

## Overview

The **Executor Service** is the core orchestration engine of the Attune automation platform. It is responsible for processing rule enforcements, scheduling executions to workers, managing execution lifecycle, and orchestrating complex workflows.

## Service Architecture

The Executor is structured as a distributed microservice with three main processing components:

```
┌─────────────────────────────────────────────────────────────┐
│                     Executor Service                         │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌─────────────────────┐  ┌──────────────────────┐          │
│  │ Enforcement         │  │ Execution            │          │
│  │ Processor           │  │ Scheduler            │          │
│  └─────────────────────┘  └──────────────────────┘          │
│           │                         │                         │
│           │                         │                         │
│           v                         v                         │
│  ┌─────────────────────────────────────────────┐             │
│  │         Execution Manager                   │             │
│  └─────────────────────────────────────────────┘             │
│                                                               │
└─────────────────────────────────────────────────────────────┘
         │                    │                    │
         v                    v                    v
   PostgreSQL            RabbitMQ              Workers
```

## Core Components

### 1. Enforcement Processor

**Purpose**: Processes triggered rules and creates execution requests.

**Responsibilities**:
- Listens for `enforcement.created` messages from triggered rules
- Fetches enforcement, rule, and event data from the database
- Evaluates rule conditions and policies
- Creates execution records in the database
- Publishes `execution.requested` messages to the scheduler

**Message Flow**:
```
Rule Triggered → Enforcement Created → Enforcement Processor → Execution Created
```

**Key Implementation Details**:
- Uses `consume_with_handler` pattern for message consumption
- All processing methods are static to enable shared state across async handlers
- Validates rule is enabled before creating executions
- Links executions to enforcements for audit trail

### 2. Execution Scheduler

**Purpose**: Routes execution requests to available workers.

**Responsibilities**:
- Listens for `execution.requested` messages
- Determines runtime requirements for the action
- Selects appropriate workers based on:
  - Runtime compatibility
  - Worker status (active only)
  - Load balancing (future: capacity, affinity, locality)
- Updates execution status to `Scheduled`
- Publishes `execution.scheduled` messages to worker queues

**Message Flow**:
```
Execution Requested → Scheduler → Worker Selection → Execution Scheduled → Worker
```

**Worker Selection Algorithm**:
1. Fetch all available workers
2. Filter by runtime compatibility (if action specifies runtime)
3. Filter by worker status (only active workers)
4. Apply load balancing strategy (currently: first available)
5. Future: Consider capacity, affinity, geographic locality

**Key Implementation Details**:
- Supports multiple worker types (local, remote, container)
- Handles worker unavailability with error responses
- Plans for intelligent scheduling based on worker capabilities

### 3. Execution Manager

**Purpose**: Orchestrates execution workflows and handles lifecycle events.

**Responsibilities**:
- Listens for `execution.status.*` messages from workers
- **Does NOT update execution state** (worker owns state after scheduling)
- Handles execution completion orchestration (triggering child executions)
- Manages workflow executions (parent-child relationships)
- Coordinates workflow state transitions

**Ownership Model**:
- **Executor owns**: Requested → Scheduling → Scheduled (updates DB)
  - Includes pre-handoff cancellations/failures (before `execution.scheduled` is published)
- **Worker owns**: Running → Completed/Failed/Cancelled (updates DB)
  - Includes post-handoff cancellations/failures (after receiving `execution.scheduled`)
- **Handoff Point**: When `execution.scheduled` message is **published** to worker
  - Before publish: Executor owns and updates state
  - After publish: Worker owns and updates state

**Message Flow**:
```
Worker Status Update → Execution Manager → Orchestration Logic (Read-Only)
                                         → Trigger Child Executions
```

**Status Lifecycle**:
```
Requested → Scheduling → Scheduled → [HANDOFF: execution.scheduled published] → Running → Completed/Failed/Cancelled
    │                       │                                                     │
    └─ Executor Updates ───┘                                                     └─ Worker Updates
    │  (includes pre-handoff                                                     │  (includes post-handoff
    │   Cancelled)                                                               │   Cancelled/Timeout/Abandoned)
                                                                                  │
                                                                                  └→ Child Executions (workflows)
```

**Key Implementation Details**:
- Receives status change notifications for orchestration purposes only
- Does not update execution state after handoff to worker
- Handles workflow orchestration (parent-child execution chaining)
- Only triggers child executions on successful parent completion
- Read-only access to execution records for orchestration logic

## Message Queue Integration

### Message Types

The Executor consumes and produces several message types:

**Consumed**:
- `enforcement.created` - New enforcement from triggered rules
- `execution.requested` - Execution scheduling requests
- `execution.status.changed` - Status change notifications from workers (for orchestration)
- `execution.completed` - Completion notifications from workers (for queue management)

**Published**:
- `execution.requested` - To scheduler (from enforcement processor)
- `execution.scheduled` - To workers (from scheduler) **← OWNERSHIP HANDOFF**

**Note**: The executor does NOT publish `execution.completed` messages. This is the worker's responsibility as the authoritative source of execution state after scheduling.

### Message Envelope Structure

All messages use the standardized `MessageEnvelope<T>` structure:

```rust
MessageEnvelope {
    message_id: Uuid,
    message_type: MessageType,
    source: String,
    timestamp: DateTime<Utc>,
    correlation_id: Option<Uuid>,
    trace_id: Option<String>,
    payload: T,
    retry_count: u32,
}
```

### Consumer Handler Pattern

All processors use the `consume_with_handler` pattern for robust message consumption:

```rust
consumer.consume_with_handler(move |envelope: MessageEnvelope<PayloadType>| {
    // Clone shared state
    let pool = pool.clone();
    let publisher = publisher.clone();

    async move {
        // Process message
        Self::process_message(&pool, &publisher, &envelope).await
            .map_err(|e| format!("Error: {}", e).into())
    }
}).await?;
```

**Benefits**:
- Automatic message acknowledgment on success
- Automatic nack with requeue on retriable errors
- Automatic dead letter queue routing on non-retriable errors
- Built-in error handling and logging

## Database Integration

### Repository Pattern

All database access uses the repository layer:

```rust
use attune_common::repositories::{
    enforcement::EnforcementRepository,
    execution::ExecutionRepository,
    rule::RuleRepository,
    Create, FindById, Update, List,
};
```

### Database Update Ownership

**Executor updates execution state** from creation through handoff:
- Creates execution records (`Requested` status)
- Updates status during scheduling (`Scheduling` → `Scheduled`)
- Publishes `execution.scheduled` message to worker **← HANDOFF POINT**
- **Handles cancellations/failures BEFORE handoff** (before message is published)
  - Example: User cancels execution while queued by concurrency policy
  - Executor updates to `Cancelled`, worker never receives message

**Worker updates execution state** after receiving handoff:
- Receives `execution.scheduled` message (takes ownership)
- Updates status when execution starts (`Running`)
- Updates status when execution completes (`Completed`, `Failed`, etc.)
- **Handles cancellations/failures AFTER handoff** (after receiving message)
- Updates result data and artifacts
- Worker only owns executions it has received

**Executor reads execution state** for orchestration after handoff:
- Receives status change notifications from workers
- Reads execution records to trigger workflow children
- Does NOT update execution state after publishing `execution.scheduled`

### Transaction Support

Future implementations will use database transactions for multi-step operations:
- Creating execution + publishing message (atomic)
- Enforcement processing + execution creation (atomic)

## Configuration

The Executor service uses the standard Attune configuration system:

```yaml
# config.yaml
database:
  url: postgresql://localhost/attune
  max_connections: 20

message_queue:
  url: amqp://localhost
  exchange: attune.executions
  prefetch_count: 10
```

Environment variable overrides:
```bash
ATTUNE__DATABASE__URL=postgresql://prod-db/attune
ATTUNE__MESSAGE_QUEUE__URL=amqp://prod-mq
```

## Error Handling

### Error Types

The Executor handles several error categories:

1. **Database Errors**: Connection issues, query failures
2. **Message Queue Errors**: Connection drops, serialization failures
3. **Business Logic Errors**: Missing entities, invalid states
4. **Worker Errors**: No workers available, incompatible runtimes

### Retry Strategy

- **Retriable Errors**: Requeued for retry (connection issues, timeouts)
- **Non-Retriable Errors**: Sent to dead letter queue (invalid data, missing entities)
- **Retry Limits**: Configured per queue (future implementation)

### Dead Letter Queues

Failed messages are automatically routed to dead letter queues for investigation:
- `executor.enforcement.created.dlq`
- `executor.execution.requested.dlq`
- `executor.execution.status.dlq`

## Workflow Orchestration

### Parent-Child Executions

The Executor supports complex workflows through parent-child execution relationships:

```
Parent Execution (Completed)
  ├── Child Execution 1 (action_ref: "pack.action1")
  ├── Child Execution 2 (action_ref: "pack.action2")
  └── Child Execution 3 (action_ref: "pack.action3")
```

**Implementation**:
- Parent execution stores child action references
- On parent completion, Execution Manager creates child executions
- Child executions inherit parent's configuration
- Each child is independently scheduled and executed

### Future Enhancements

- **Conditional Workflows**: Execute children based on parent result
- **Parallel vs Sequential**: Control execution order
- **Workflow DAGs**: Complex dependency graphs
- **Workflow Templates**: Reusable workflow definitions

## Policy Enforcement

### Planned Features

1. **Rate Limiting**: Limit executions per time window
2. **Concurrency Control**: Maximum concurrent executions per action/pack
3. **Priority Queuing**: High-priority executions jump the queue
4. **Resource Quotas**: Limit resource consumption per tenant
5. **Execution Windows**: Only execute during specified time periods

### Implementation Location

Policy enforcement will be implemented in:
- Enforcement Processor (pre-execution validation)
- Scheduler (runtime constraint checking)
- New `PolicyEnforcer` module (future)

## Monitoring & Observability

### Metrics (Future)

- Executions per second (throughput)
- Average execution duration
- Queue depth and processing lag
- Worker utilization
- Error rates by type

### Logging

Structured logging at multiple levels:
- `INFO`: Successful operations, state transitions
- `WARN`: Degraded states, retry attempts
- `ERROR`: Failures requiring attention
- `DEBUG`: Detailed flow for troubleshooting

Example:
```
INFO Processing enforcement: 123
INFO Selected worker 45 for execution 789
INFO Execution 789 scheduled to worker 45
```

### Tracing

Message correlation and distributed tracing:
- `correlation_id`: Links related messages
- `trace_id`: End-to-end request tracing (future integration with OpenTelemetry)

## Running the Service

### Prerequisites

- PostgreSQL 14+ with schema initialized
- RabbitMQ 3.12+ with exchanges and queues configured
- Environment variables or config file set up

### Startup

```bash
# Using cargo
cd crates/executor
cargo run

# Or with environment overrides
ATTUNE__DATABASE__URL=postgresql://localhost/attune \
ATTUNE__MESSAGE_QUEUE__URL=amqp://localhost \
cargo run
```

### Graceful Shutdown

The service supports graceful shutdown via SIGTERM/SIGINT:
1. Stop accepting new messages
2. Finish processing in-flight messages
3. Close message queue connections
4. Close database connections
5. Exit cleanly

## Testing

### Unit Tests

Each module includes unit tests for business logic:
- Rule evaluation
- Worker selection algorithms
- Status parsing
- Workflow creation

### Integration Tests

Integration tests require PostgreSQL and RabbitMQ:
- End-to-end enforcement → execution flow
- Message queue reliability
- Database consistency

### Running Tests

```bash
# Unit tests only
cargo test -p attune-executor --lib

# Integration tests (requires services)
cargo test -p attune-executor --test '*'
```

## Future Enhancements

### Phase 1: Core Functionality (Current)
- ✅ Enforcement processing
- ✅ Execution scheduling
- ✅ Lifecycle management
- ✅ Message queue integration

### Phase 2: Advanced Features (Next)
- Policy enforcement (rate limiting, concurrency)
- Advanced workflow orchestration
- Inquiry handling (human-in-the-loop)
- Retry and failure handling improvements

### Phase 3: Production Readiness
- Comprehensive monitoring and metrics
- Performance optimization
- High availability setup
- Load testing and tuning

### Phase 4: Enterprise Features
- Multi-tenancy isolation
- Advanced scheduling algorithms
- Resource quotas and limits
- Audit logging and compliance

## Troubleshooting

### Common Issues

**Problem**: Executions stuck in "Requested" status
- **Cause**: Scheduler not running or no workers available
- **Solution**: Verify scheduler is running, check worker status

**Problem**: Messages not being consumed
- **Cause**: RabbitMQ connection issues or queue misconfiguration
- **Solution**: Check MQ connection, verify queue bindings

**Problem**: Database connection errors
- **Cause**: Connection pool exhausted or database down
- **Solution**: Increase pool size, check database health

### Debug Mode

Enable detailed logging:
```bash
RUST_LOG=attune_executor=debug,attune_common=debug cargo run
```

## Related Documentation

- [API - Executions](./api-executions.md)
- [API - Events & Enforcements](./api-events-enforcements.md)
- [API - Rules](./api-rules.md)
- [Configuration](./configuration.md)
- [Quick Start](./quick-start.md)