Files
attune/docs/architecture/executor-service.md

467 lines
16 KiB
Markdown

# Executor Service Architecture
## Overview
The **Executor Service** is the core orchestration engine of the Attune automation platform. It is responsible for processing rule enforcements, scheduling executions to workers, managing execution lifecycle, and orchestrating complex workflows.
## Service Architecture
The Executor is structured as a distributed microservice with three main processing components:
```
┌─────────────────────────────────────────────────────────────┐
│ Executor Service │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────┐ ┌──────────────────────┐ │
│ │ Enforcement │ │ Execution │ │
│ │ Processor │ │ Scheduler │ │
│ └─────────────────────┘ └──────────────────────┘ │
│ │ │ │
│ │ │ │
│ v v │
│ ┌─────────────────────────────────────────────┐ │
│ │ Execution Manager │ │
│ └─────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
│ │ │
v v v
PostgreSQL RabbitMQ Workers
```
## Core Components
### 1. Enforcement Processor
**Purpose**: Processes triggered rules and creates execution requests.
**Responsibilities**:
- Listens for `enforcement.created` messages from triggered rules
- Fetches enforcement, rule, and event data from the database
- Evaluates rule conditions and policies
- Creates execution records in the database
- Publishes `execution.requested` messages to the scheduler
**Message Flow**:
```
Rule Triggered → Enforcement Created → Enforcement Processor → Execution Created
```
**Key Implementation Details**:
- Uses `consume_with_handler` pattern for message consumption
- All processing methods are static to enable shared state across async handlers
- Validates rule is enabled before creating executions
- Links executions to enforcements for audit trail
### 2. Execution Scheduler
**Purpose**: Routes execution requests to available workers.
**Responsibilities**:
- Listens for `execution.requested` messages
- Determines runtime requirements for the action
- Selects appropriate workers based on:
- Runtime compatibility
- Worker status (active only)
- Load balancing (future: capacity, affinity, locality)
- Updates execution status to `Scheduled`
- Publishes `execution.scheduled` messages to worker queues
**Message Flow**:
```
Execution Requested → Scheduler → Worker Selection → Execution Scheduled → Worker
```
**Worker Selection Algorithm**:
1. Fetch all available workers
2. Filter by runtime compatibility (if action specifies runtime)
3. Filter by worker status (only active workers)
4. Apply load balancing strategy (currently: first available)
5. Future: Consider capacity, affinity, geographic locality
**Key Implementation Details**:
- Supports multiple worker types (local, remote, container)
- Handles worker unavailability with error responses
- Plans for intelligent scheduling based on worker capabilities
### 3. Execution Manager
**Purpose**: Orchestrates execution workflows and handles lifecycle events.
**Responsibilities**:
- Listens for `execution.status.*` messages from workers
- **Does NOT update execution state** (worker owns state after scheduling)
- Handles execution completion orchestration (triggering child executions)
- Manages workflow executions (parent-child relationships)
- Coordinates workflow state transitions
**Ownership Model**:
- **Executor owns**: Requested → Scheduling → Scheduled (updates DB)
- Includes pre-handoff cancellations/failures (before `execution.scheduled` is published)
- **Worker owns**: Running → Completed/Failed/Cancelled (updates DB)
- Includes post-handoff cancellations/failures (after receiving `execution.scheduled`)
- **Handoff Point**: When `execution.scheduled` message is **published** to worker
- Before publish: Executor owns and updates state
- After publish: Worker owns and updates state
**Message Flow**:
```
Worker Status Update → Execution Manager → Orchestration Logic (Read-Only)
→ Trigger Child Executions
```
**Status Lifecycle**:
```
Requested → Scheduling → Scheduled → [HANDOFF: execution.scheduled published] → Running → Completed/Failed/Cancelled
│ │ │
└─ Executor Updates ───┘ └─ Worker Updates
│ (includes pre-handoff │ (includes post-handoff
│ Cancelled) │ Cancelled/Timeout/Abandoned)
└→ Child Executions (workflows)
```
**Key Implementation Details**:
- Receives status change notifications for orchestration purposes only
- Does not update execution state after handoff to worker
- Handles workflow orchestration (parent-child execution chaining)
- Only triggers child executions on successful parent completion
- Read-only access to execution records for orchestration logic
## Message Queue Integration
### Message Types
The Executor consumes and produces several message types:
**Consumed**:
- `enforcement.created` - New enforcement from triggered rules
- `execution.requested` - Execution scheduling requests
- `execution.status.changed` - Status change notifications from workers (for orchestration)
- `execution.completed` - Completion notifications from workers (for queue management)
**Published**:
- `execution.requested` - To scheduler (from enforcement processor)
- `execution.scheduled` - To workers (from scheduler) **← OWNERSHIP HANDOFF**
**Note**: The executor does NOT publish `execution.completed` messages. This is the worker's responsibility as the authoritative source of execution state after scheduling.
### Message Envelope Structure
All messages use the standardized `MessageEnvelope<T>` structure:
```rust
MessageEnvelope {
message_id: Uuid,
message_type: MessageType,
source: String,
timestamp: DateTime<Utc>,
correlation_id: Option<Uuid>,
trace_id: Option<String>,
payload: T,
retry_count: u32,
}
```
### Consumer Handler Pattern
All processors use the `consume_with_handler` pattern for robust message consumption:
```rust
consumer.consume_with_handler(move |envelope: MessageEnvelope<PayloadType>| {
// Clone shared state
let pool = pool.clone();
let publisher = publisher.clone();
async move {
// Process message
Self::process_message(&pool, &publisher, &envelope).await
.map_err(|e| format!("Error: {}", e).into())
}
}).await?;
```
**Benefits**:
- Automatic message acknowledgment on success
- Automatic nack with requeue on retriable errors
- Automatic dead letter queue routing on non-retriable errors
- Built-in error handling and logging
## Database Integration
### Repository Pattern
All database access uses the repository layer:
```rust
use attune_common::repositories::{
enforcement::EnforcementRepository,
execution::ExecutionRepository,
rule::RuleRepository,
Create, FindById, Update, List,
};
```
### Database Update Ownership
**Executor updates execution state** from creation through handoff:
- Creates execution records (`Requested` status)
- Updates status during scheduling (`Scheduling``Scheduled`)
- Publishes `execution.scheduled` message to worker **← HANDOFF POINT**
- **Handles cancellations/failures BEFORE handoff** (before message is published)
- Example: User cancels execution while queued by concurrency policy
- Executor updates to `Cancelled`, worker never receives message
**Worker updates execution state** after receiving handoff:
- Receives `execution.scheduled` message (takes ownership)
- Updates status when execution starts (`Running`)
- Updates status when execution completes (`Completed`, `Failed`, etc.)
- **Handles cancellations/failures AFTER handoff** (after receiving message)
- Updates result data and artifacts
- Worker only owns executions it has received
**Executor reads execution state** for orchestration after handoff:
- Receives status change notifications from workers
- Reads execution records to trigger workflow children
- Does NOT update execution state after publishing `execution.scheduled`
### Transaction Support
Future implementations will use database transactions for multi-step operations:
- Creating execution + publishing message (atomic)
- Enforcement processing + execution creation (atomic)
## Configuration
The Executor service uses the standard Attune configuration system:
```yaml
# config.yaml
database:
url: postgresql://localhost/attune
max_connections: 20
message_queue:
url: amqp://localhost
exchange: attune.executions
prefetch_count: 10
```
Environment variable overrides:
```bash
ATTUNE__DATABASE__URL=postgresql://prod-db/attune
ATTUNE__MESSAGE_QUEUE__URL=amqp://prod-mq
```
## Error Handling
### Error Types
The Executor handles several error categories:
1. **Database Errors**: Connection issues, query failures
2. **Message Queue Errors**: Connection drops, serialization failures
3. **Business Logic Errors**: Missing entities, invalid states
4. **Worker Errors**: No workers available, incompatible runtimes
### Retry Strategy
- **Retriable Errors**: Requeued for retry (connection issues, timeouts)
- **Non-Retriable Errors**: Sent to dead letter queue (invalid data, missing entities)
- **Retry Limits**: Configured per queue (future implementation)
### Dead Letter Queues
Failed messages are automatically routed to dead letter queues for investigation:
- `executor.enforcement.created.dlq`
- `executor.execution.requested.dlq`
- `executor.execution.status.dlq`
## Workflow Orchestration
### Parent-Child Executions
The Executor supports complex workflows through parent-child execution relationships:
```
Parent Execution (Completed)
├── Child Execution 1 (action_ref: "pack.action1")
├── Child Execution 2 (action_ref: "pack.action2")
└── Child Execution 3 (action_ref: "pack.action3")
```
**Implementation**:
- Parent execution stores child action references
- On parent completion, Execution Manager creates child executions
- Child executions inherit parent's configuration
- Each child is independently scheduled and executed
### Future Enhancements
- **Conditional Workflows**: Execute children based on parent result
- **Parallel vs Sequential**: Control execution order
- **Workflow DAGs**: Complex dependency graphs
- **Workflow Templates**: Reusable workflow definitions
## Policy Enforcement
### Planned Features
1. **Rate Limiting**: Limit executions per time window
2. **Concurrency Control**: Maximum concurrent executions per action/pack
3. **Priority Queuing**: High-priority executions jump the queue
4. **Resource Quotas**: Limit resource consumption per tenant
5. **Execution Windows**: Only execute during specified time periods
### Implementation Location
Policy enforcement will be implemented in:
- Enforcement Processor (pre-execution validation)
- Scheduler (runtime constraint checking)
- New `PolicyEnforcer` module (future)
## Monitoring & Observability
### Metrics (Future)
- Executions per second (throughput)
- Average execution duration
- Queue depth and processing lag
- Worker utilization
- Error rates by type
### Logging
Structured logging at multiple levels:
- `INFO`: Successful operations, state transitions
- `WARN`: Degraded states, retry attempts
- `ERROR`: Failures requiring attention
- `DEBUG`: Detailed flow for troubleshooting
Example:
```
INFO Processing enforcement: 123
INFO Selected worker 45 for execution 789
INFO Execution 789 scheduled to worker 45
```
### Tracing
Message correlation and distributed tracing:
- `correlation_id`: Links related messages
- `trace_id`: End-to-end request tracing (future integration with OpenTelemetry)
## Running the Service
### Prerequisites
- PostgreSQL 14+ with schema initialized
- RabbitMQ 3.12+ with exchanges and queues configured
- Environment variables or config file set up
### Startup
```bash
# Using cargo
cd crates/executor
cargo run
# Or with environment overrides
ATTUNE__DATABASE__URL=postgresql://localhost/attune \
ATTUNE__MESSAGE_QUEUE__URL=amqp://localhost \
cargo run
```
### Graceful Shutdown
The service supports graceful shutdown via SIGTERM/SIGINT:
1. Stop accepting new messages
2. Finish processing in-flight messages
3. Close message queue connections
4. Close database connections
5. Exit cleanly
## Testing
### Unit Tests
Each module includes unit tests for business logic:
- Rule evaluation
- Worker selection algorithms
- Status parsing
- Workflow creation
### Integration Tests
Integration tests require PostgreSQL and RabbitMQ:
- End-to-end enforcement → execution flow
- Message queue reliability
- Database consistency
### Running Tests
```bash
# Unit tests only
cargo test -p attune-executor --lib
# Integration tests (requires services)
cargo test -p attune-executor --test '*'
```
## Future Enhancements
### Phase 1: Core Functionality (Current)
- ✅ Enforcement processing
- ✅ Execution scheduling
- ✅ Lifecycle management
- ✅ Message queue integration
### Phase 2: Advanced Features (Next)
- Policy enforcement (rate limiting, concurrency)
- Advanced workflow orchestration
- Inquiry handling (human-in-the-loop)
- Retry and failure handling improvements
### Phase 3: Production Readiness
- Comprehensive monitoring and metrics
- Performance optimization
- High availability setup
- Load testing and tuning
### Phase 4: Enterprise Features
- Multi-tenancy isolation
- Advanced scheduling algorithms
- Resource quotas and limits
- Audit logging and compliance
## Troubleshooting
### Common Issues
**Problem**: Executions stuck in "Requested" status
- **Cause**: Scheduler not running or no workers available
- **Solution**: Verify scheduler is running, check worker status
**Problem**: Messages not being consumed
- **Cause**: RabbitMQ connection issues or queue misconfiguration
- **Solution**: Check MQ connection, verify queue bindings
**Problem**: Database connection errors
- **Cause**: Connection pool exhausted or database down
- **Solution**: Increase pool size, check database health
### Debug Mode
Enable detailed logging:
```bash
RUST_LOG=attune_executor=debug,attune_common=debug cargo run
```
## Related Documentation
- [API - Executions](./api-executions.md)
- [API - Events & Enforcements](./api-events-enforcements.md)
- [API - Rules](./api-rules.md)
- [Configuration](./configuration.md)
- [Quick Start](./quick-start.md)