re-uploading work

This commit is contained in:
2026-02-04 17:46:30 -06:00
commit 3b14c65998
1388 changed files with 381262 additions and 0 deletions

View File

@@ -0,0 +1,548 @@
# Worker Service Architecture
## Overview
The **Worker Service** is responsible for executing automation actions in the Attune platform. It receives execution requests from the Executor service, runs actions in appropriate runtime environments (Python, Shell, Node.js, containers), and reports results back.
## Service Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ Worker Service │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────┐ ┌──────────────────────┐ │
│ │ Worker │ │ Heartbeat │ │
│ │ Registration │ │ Manager │ │
│ └─────────────────────┘ └──────────────────────┘ │
│ │ │ │
│ v v │
│ ┌─────────────────────────────────────────────┐ │
│ │ Action Executor │ │
│ │ ┌─────────────────────────────────────┐ │ │
│ │ │ Runtime Registry │ │ │
│ │ │ - Python Runtime │ │ │
│ │ │ - Shell Runtime │ │ │
│ │ │ - Local Runtime (Facade) │ │ │
│ │ │ - Container Runtime (Future) │ │ │
│ │ └─────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────┘ │
│ │ │ │
│ v v │
│ ┌─────────────────────┐ ┌──────────────────────┐ │
│ │ Artifact │ │ Message Queue │ │
│ │ Manager │ │ Consumer/Publisher │ │
│ └─────────────────────┘ └──────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
│ │ │
v v v
PostgreSQL RabbitMQ Local Filesystem
```
## Core Components
### 1. Worker Registration
**Purpose**: Register worker in the database and maintain worker metadata.
**Responsibilities**:
- Register worker on startup with name, type, capabilities
- Update existing worker records to active status on restart
- Deregister worker on shutdown (mark as inactive)
- Update worker capabilities dynamically
**Key Implementation Details**:
- Worker name defaults to hostname if not specified
- Capabilities include supported runtimes (python, shell, node)
- Worker type can be Local, Remote, or Container
- Uses direct SQL queries for registration (no repository pattern needed)
**Database Table**: `attune.worker`
### 2. Heartbeat Manager
**Purpose**: Keep worker status fresh in the database with periodic heartbeat updates.
**Responsibilities**:
- Send periodic heartbeat updates (default: every 30 seconds)
- Update `last_heartbeat` timestamp in database
- Run in background task until stopped
- Handle transient database errors gracefully
**Key Implementation Details**:
- Runs as a tokio background task with interval ticker
- Configurable heartbeat interval via worker config
- Logs errors but doesn't fail the worker on heartbeat issues
- Clean shutdown on service stop
### 3. Runtime System
**Purpose**: Abstraction layer for executing actions in different environments.
**Components**:
#### Runtime Trait
```rust
pub trait Runtime: Send + Sync {
fn name(&self) -> &str;
fn can_execute(&self, context: &ExecutionContext) -> bool;
async fn execute(&self, context: ExecutionContext) -> RuntimeResult<ExecutionResult>;
async fn setup(&self) -> RuntimeResult<()>;
async fn cleanup(&self) -> RuntimeResult<()>;
async fn validate(&self) -> RuntimeResult<()>;
}
```
#### Python Runtime
- Executes Python scripts via subprocess
- Generates wrapper script to inject parameters
- Supports timeout, stdout/stderr capture
- Parses JSON results from stdout
- Default entry point: `run()` function
**Example Action**:
```python
def run(x, y):
return x + y
```
#### Shell Runtime
- Executes bash/shell scripts via subprocess
- Injects parameters as environment variables (PARAM_*)
- Supports timeout, output capture
- Executes with `set -e` for error propagation
**Example Action**:
```bash
echo "Hello, $PARAM_NAME!"
```
#### Local Runtime
- Facade that delegates to Python or Shell runtime
- Selects runtime based on action metadata
- Currently supports Python and Shell
- Extensible for additional local runtimes
#### Runtime Registry
- Manages collection of registered runtimes
- Selects appropriate runtime for each action
- Handles runtime setup/cleanup lifecycle
### 4. Action Executor
**Purpose**: Orchestrate the complete execution flow for an action.
**Execution Flow**:
```
1. Load execution record from database
2. Update status to Running
3. Load action definition by reference
4. Prepare execution context (parameters, env vars, timeout)
5. Select and execute in appropriate runtime
6. Capture results (stdout, stderr, return value)
7. Store artifacts (logs, results)
8. Update execution status (Succeeded/Failed)
9. Publish status update messages
```
**Responsibilities**:
- Coordinate execution lifecycle
- Load action and execution data from database
- Prepare execution context with parameters and environment
- Execute action via runtime registry
- Handle success and failure cases
- Store execution artifacts
**Key Implementation Details**:
- Parameters merged: action defaults + execution overrides
- Environment variables include execution metadata
- Default timeout: 5 minutes (300 seconds)
- Errors captured and stored as execution result
### 5. Artifact Manager
**Purpose**: Store and manage execution artifacts (logs, results, files).
**Artifact Types**:
- **Log**: stdout/stderr from execution
- **Result**: JSON result data from action
- **File**: Custom file outputs from actions
- **Trace**: Debug/trace information (future)
**Storage Structure**:
```
/tmp/attune/artifacts/{worker_name}/
└── execution_{id}/
├── stdout.log
├── stderr.log
└── result.json
```
**Responsibilities**:
- Store logs (stdout/stderr) for each execution
- Store JSON result data
- Support custom file artifacts
- Clean up old artifacts (retention policy)
- Delete artifacts for specific executions
**Key Implementation Details**:
- Creates execution-specific directories
- Stores all IO errors as Internal errors
- Configurable base directory per worker
- Retention policy based on file modification time
### 6. Secret Management
**Purpose**: Securely manage and inject secrets into action execution environments.
**Responsibilities**:
- Fetch secrets from database based on ownership hierarchy
- Decrypt encrypted secrets using AES-256-GCM
- Inject secrets as environment variables
- Clean up secrets after execution
**Secret Ownership Hierarchy**:
1. **System-level secrets** - Available to all actions
2. **Pack-level secrets** - Available to all actions in a pack
3. **Action-level secrets** - Available to specific action only
More specific secrets override less specific ones with the same name.
**Environment Variable Injection**:
- Secret names transformed: `api_key``SECRET_API_KEY`
- Prefix: `SECRET_`
- Uppercase with hyphens replaced by underscores
**Encryption**:
- Algorithm: AES-256-GCM (authenticated encryption)
- Key derivation: SHA-256 hash of configured password
- Format: `nonce:ciphertext` (Base64-encoded)
- Random nonce per encryption operation
**Key Implementation Details**:
- Encryption key loaded from `security.encryption_key` config
- Key hash validation ensures correct decryption key
- Graceful handling of missing secrets (warning, not failure)
- Secrets never logged or exposed in artifacts
- Automatic injection during execution context preparation
**Configuration**:
```yaml
security:
encryption_key: "your-secret-encryption-password"
```
**Database Table**: `attune.key`
See `docs/secrets-management.md` for comprehensive documentation.
### 7. Worker Service
**Purpose**: Main service orchestration and message queue integration.
**Responsibilities**:
- Initialize all service components
- Register worker in database
- Start heartbeat manager
- Consume execution messages from worker-specific queue
- Publish execution status updates
- Handle graceful shutdown
**Message Flow**:
```
Executor (Scheduler)
→ Publishes: execution.scheduled
→ Queue: worker.{worker_id}.executions
→ Worker consumes message
→ Executes action
→ Publishes: execution.status.running
→ Publishes: execution.status.succeeded/failed
```
**Message Types**:
**Consumed**:
- `execution.scheduled` - New execution assigned to this worker
**Published**:
- `execution.status.running` - Execution started
- `execution.status.succeeded` - Execution completed successfully
- `execution.status.failed` - Execution failed
**Key Implementation Details**:
- Worker-specific queues enable direct routing from scheduler
- Database and MQ connections initialized on startup
- Graceful shutdown deregisters worker
- Message handlers run async and report errors
## Configuration
Worker service uses the standard Attune configuration system:
```yaml
# config.yaml
database:
url: postgresql://localhost/attune
max_connections: 20
message_queue:
url: amqp://localhost
exchange: attune.executions
worker:
name: worker-01 # Optional, defaults to hostname
worker_type: Local # Local, Remote, Container
runtime_id: null # Optional runtime association
host: null # Optional, defaults to hostname
port: null # Optional
max_concurrent_tasks: 10 # Max parallel executions
heartbeat_interval: 30 # Seconds between heartbeats
task_timeout: 300 # Default task timeout (seconds)
security:
encryption_key: "your-encryption-key" # Required for encrypted secrets
```
Environment variable overrides:
```bash
ATTUNE__WORKER__NAME=my-worker
ATTUNE__WORKER__MAX_CONCURRENT_TASKS=20
ATTUNE__WORKER__HEARTBEAT_INTERVAL=60
```
## Running the Service
### Prerequisites
- PostgreSQL 14+ with Attune schema initialized
- RabbitMQ 3.12+ with exchanges and queues configured
- Python 3.x and/or bash (for local runtimes)
- Environment variables or config file set up
### Startup
```bash
# Using cargo
cd crates/worker
cargo run
# With custom config
cargo run -- --config /path/to/config.yaml
# With custom worker name
cargo run -- --name worker-prod-01
# Or with environment overrides
ATTUNE__WORKER__NAME=worker-01 \
ATTUNE__WORKER__MAX_CONCURRENT_TASKS=20 \
cargo run
```
### Graceful Shutdown
The service supports graceful shutdown via SIGTERM/SIGINT (Ctrl+C):
1. Stop accepting new execution messages
2. Finish processing in-flight executions (future enhancement)
3. Stop heartbeat manager
4. Deregister worker (mark as inactive)
5. Close message queue connections
6. Close database connections
7. Exit cleanly
## Execution Context
The executor prepares a comprehensive execution context for each action:
```rust
pub struct ExecutionContext {
pub execution_id: i64,
pub action_ref: String, // "pack.action"
pub parameters: HashMap<String, JsonValue>,
pub env: HashMap<String, String>, // Environment variables
pub timeout: Option<u64>, // Timeout in seconds
pub working_dir: Option<PathBuf>, // Working directory
pub entry_point: String, // Function/script entry point
pub code: Option<String>, // Action code (inline)
pub code_path: Option<PathBuf>, // Action code (file path)
}
```
### Environment Variables
The executor injects these environment variables:
- `ATTUNE_EXECUTION_ID` - Execution ID
- `ATTUNE_ACTION` - Action reference (pack.action)
- `ATTUNE_RUNNER` - Runner type (if specified)
- `ATTUNE_CONTEXT_*` - Context data as environment variables
For shell actions, parameters are also injected as:
- `PARAM_{KEY}` - Each parameter as uppercase env var
## Execution Result
Actions return a standardized result:
```rust
pub struct ExecutionResult {
pub exit_code: i32, // 0 = success
pub stdout: String, // Standard output
pub stderr: String, // Standard error
pub result: Option<JsonValue>, // Parsed result data
pub duration_ms: u64, // Execution duration
pub error: Option<String>, // Error message if failed
}
```
## Error Handling
### Error Categories
1. **Setup Errors**: Runtime initialization failures
2. **Execution Errors**: Action execution failures
3. **Timeout Errors**: Execution exceeded timeout
4. **IO Errors**: File/network operations
5. **Database Errors**: Connection, query failures
### Error Propagation
- Runtime errors captured in `ExecutionResult.error`
- Execution status updated to Failed in database
- Error published in status update message
- Artifacts still stored for failed executions
- Logs preserved for debugging
## Testing
### Unit Tests
Each runtime includes unit tests:
- Simple execution
- Parameter passing
- Timeout handling
- Error handling
### Integration Tests
Integration tests require PostgreSQL and RabbitMQ:
- Worker registration and heartbeat
- End-to-end action execution
- Message queue integration
- Artifact storage
### Running Tests
```bash
# Unit tests only
cargo test -p attune-worker --lib
# Integration tests (requires services)
cargo test -p attune-worker --test '*'
# Specific runtime tests
cargo test -p attune-worker python_runtime
cargo test -p attune-worker shell_runtime
```
## Implementation Status
### Phase 5.1: Worker Foundation ✅ COMPLETE
- [x] Worker registration module
- [x] Heartbeat manager
- [x] Service initialization
- [x] Configuration loading
### Phase 5.2: Runtime System ✅ COMPLETE
- [x] Runtime trait abstraction
- [x] Python runtime implementation
- [x] Shell runtime implementation
- [x] Local runtime facade
- [x] Runtime registry
### Phase 5.3: Execution Logic ⏳ IN PROGRESS
- [x] Action executor module
- [x] Execution context preparation
- [ ] Fix data model mismatches
- [ ] Complete message queue integration
- [ ] Test end-to-end flow
### Phase 5.4: Artifact Management ✅ COMPLETE
- [x] Artifact manager module
- [x] Log storage (stdout/stderr)
- [x] Result storage (JSON)
- [x] File artifact storage
- [x] Cleanup/retention policies
### Phase 5.5: Testing 📋 TODO
- [x] Runtime unit tests (basic)
- [ ] Integration tests with database
- [ ] End-to-end execution tests
- [ ] Error handling tests
### Phase 5.6: Advanced Features 📋 TODO
- [ ] Container runtime (Docker)
- [ ] Remote worker support
- [ ] Concurrent execution limits
- [ ] Worker capacity management
- [ ] Execution queuing
## Known Issues
### Data Model Mismatches
The current implementation has several mismatches with the actual database schema:
1. **Execution.action**: Expected String, actual is `Option<i64>`
2. **Execution fields**: Missing `parameters`, `context`, `runner` fields
3. **Action fields**: `entry_point``entrypoint`, missing `timeout`
4. **Repository pattern**: Repositories don't have `::new()` constructors
5. **Error types**: `Error::BadRequest` and `Error::NotFound` have different signatures
### Required Fixes
1. Update executor to use `action_ref` field instead of `action`
2. Fix action loading to query by ID from execution
3. Update execution context preparation for actual schema
4. Fix repository usage patterns
5. Update error construction calls
6. Implement From<MqError> for Error
## Future Enhancements
### Phase 1: Core Improvements
- Concurrent execution management (max_concurrent_tasks)
- Worker capacity tracking and reporting
- Execution queuing when at capacity
- Retry logic for transient failures
### Phase 2: Advanced Runtimes
- Container runtime with Docker
- Container image management and caching
- Volume mounting for code injection
- Network isolation for security
### Phase 3: Remote Workers
- Remote worker registration
- Worker-to-worker communication
- Geographic distribution
- Load balancing strategies
### Phase 4: Monitoring & Observability
- Execution metrics (duration, success rate)
- Worker health metrics
- Runtime-specific metrics
- OpenTelemetry integration
### Phase 5: Security
- Execution sandboxing
- Resource limits (CPU, memory)
- Secret injection from key store
- Encrypted artifact storage
## Related Documentation
- [Executor Service](./executor-service.md)
- [API - Executions](./api-executions.md)
- [API - Actions](./api-actions.md)
- [Configuration](./configuration.md)
- [Quick Start](./quick-start.md)