Files

David Culbreth e31ecb781b more internal polish, resilient workers

2026-02-09 18:32:34 -06:00

19 KiB

Raw Blame History

Worker Service Architecture

Overview

The Worker Service is responsible for executing automation actions in the Attune platform. It receives execution requests from the Executor service, runs actions in appropriate runtime environments (Python, Shell, Node.js, containers), and reports results back.

Service Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Worker Service                           │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌─────────────────────┐  ┌──────────────────────┐          │
│  │ Worker              │  │ Heartbeat            │          │
│  │ Registration        │  │ Manager              │          │
│  └─────────────────────┘  └──────────────────────┘          │
│           │                         │                         │
│           v                         v                         │
│  ┌─────────────────────────────────────────────┐             │
│  │         Action Executor                     │             │
│  │  ┌─────────────────────────────────────┐   │             │
│  │  │      Runtime Registry               │   │             │
│  │  │  - Python Runtime                   │   │             │
│  │  │  - Shell Runtime                    │   │             │
│  │  │  - Local Runtime (Facade)           │   │             │
│  │  │  - Container Runtime (Future)       │   │             │
│  │  └─────────────────────────────────────┘   │             │
│  └─────────────────────────────────────────────┘             │
│           │                         │                         │
│           v                         v                         │
│  ┌─────────────────────┐  ┌──────────────────────┐          │
│  │ Artifact            │  │ Message Queue        │          │
│  │ Manager             │  │ Consumer/Publisher   │          │
│  └─────────────────────┘  └──────────────────────┘          │
│                                                               │
└─────────────────────────────────────────────────────────────┘
         │                    │                    │
         v                    v                    v
   PostgreSQL            RabbitMQ           Local Filesystem

Core Components

1. Worker Registration

Purpose: Register worker in the database and maintain worker metadata.

Responsibilities:

Register worker on startup with name, type, capabilities
Update existing worker records to active status on restart
Deregister worker on shutdown (mark as inactive)
Update worker capabilities dynamically

Key Implementation Details:

Worker name defaults to hostname if not specified
Capabilities include supported runtimes (python, shell, node)
Worker type can be Local, Remote, or Container
Uses direct SQL queries for registration (no repository pattern needed)

Database Table: attune.worker

2. Heartbeat Manager

Purpose: Keep worker status fresh in the database with periodic heartbeat updates.

Responsibilities:

Send periodic heartbeat updates (default: every 30 seconds)
Update last_heartbeat timestamp in database
Run in background task until stopped
Handle transient database errors gracefully

Key Implementation Details:

Runs as a tokio background task with interval ticker
Configurable heartbeat interval via worker config
Logs errors but doesn't fail the worker on heartbeat issues
Clean shutdown on service stop

3. Runtime System

Purpose: Abstraction layer for executing actions in different environments.

Components:

Runtime Trait

pub trait Runtime: Send + Sync {
    fn name(&self) -> &str;
    fn can_execute(&self, context: &ExecutionContext) -> bool;
    async fn execute(&self, context: ExecutionContext) -> RuntimeResult<ExecutionResult>;
    async fn setup(&self) -> RuntimeResult<()>;
    async fn cleanup(&self) -> RuntimeResult<()>;
    async fn validate(&self) -> RuntimeResult<()>;
}

Python Runtime

Executes Python scripts via subprocess
Generates wrapper script to inject parameters
Supports timeout, stdout/stderr capture
Parses JSON results from stdout
Default entry point: run() function

Example Action:

def run(x, y):
    return x + y

Shell Runtime

Executes bash/shell scripts via subprocess
Injects parameters as environment variables (PARAM_*)
Supports timeout, output capture
Executes with set -e for error propagation

Example Action:

echo "Hello, $PARAM_NAME!"

Local Runtime

Facade that delegates to Python or Shell runtime
Selects runtime based on action metadata
Currently supports Python and Shell
Extensible for additional local runtimes

Runtime Registry

Manages collection of registered runtimes
Selects appropriate runtime for each action
Handles runtime setup/cleanup lifecycle

4. Action Executor

Purpose: Orchestrate the complete execution flow for an action and own execution state after handoff.

Execution Flow:

1. Receive execution.scheduled message from executor
2. Load execution record from database
3. Update status to Running (owns state after handoff)
4. Load action definition by reference
5. Prepare execution context (parameters, env vars, timeout)
6. Select and execute in appropriate runtime
7. Capture results (stdout, stderr, return value)
8. Store artifacts (logs, results)
9. Update execution status (Completed/Failed) in database
10. Publish status change notifications
11. Publish completion notification for queue management

Ownership Model:

Worker owns execution state after receiving execution.scheduled
Authoritative source for all status updates: Running, Completed, Failed, Cancelled, etc.
Updates database directly for all state changes
Publishes notifications for orchestration and monitoring

Responsibilities:

Coordinate execution lifecycle
Load action and execution data from database
Update execution state in database (after handoff from executor)
Prepare execution context with parameters and environment
Execute action via runtime registry
Handle success and failure cases
Store execution artifacts
Publish status change notifications

Key Implementation Details:

Parameters merged: action defaults + execution overrides
Environment variables include execution metadata
Default timeout: 5 minutes (300 seconds)
Errors captured and stored as execution result

5. Artifact Manager

Purpose: Store and manage execution artifacts (logs, results, files).

Artifact Types:

Log: stdout/stderr from execution
Result: JSON result data from action
File: Custom file outputs from actions
Trace: Debug/trace information (future)

Storage Structure:

/tmp/attune/artifacts/{worker_name}/
  └── execution_{id}/
      ├── stdout.log
      ├── stderr.log
      └── result.json

Responsibilities:

Store logs (stdout/stderr) for each execution
Store JSON result data
Support custom file artifacts
Clean up old artifacts (retention policy)
Delete artifacts for specific executions

Key Implementation Details:

Creates execution-specific directories
Stores all IO errors as Internal errors
Configurable base directory per worker
Retention policy based on file modification time

6. Secret Management

Purpose: Securely manage and inject secrets into action execution environments.

Responsibilities:

Fetch secrets from database based on ownership hierarchy
Decrypt encrypted secrets using AES-256-GCM
Inject secrets as environment variables
Clean up secrets after execution

Secret Ownership Hierarchy:

System-level secrets - Available to all actions
Pack-level secrets - Available to all actions in a pack
Action-level secrets - Available to specific action only

More specific secrets override less specific ones with the same name.

Environment Variable Injection:

Secret names transformed: api_key → SECRET_API_KEY
Prefix: SECRET_
Uppercase with hyphens replaced by underscores

Encryption:

Algorithm: AES-256-GCM (authenticated encryption)
Key derivation: SHA-256 hash of configured password
Format: nonce:ciphertext (Base64-encoded)
Random nonce per encryption operation

Key Implementation Details:

Encryption key loaded from security.encryption_key config
Key hash validation ensures correct decryption key
Graceful handling of missing secrets (warning, not failure)
Secrets never logged or exposed in artifacts
Automatic injection during execution context preparation

Configuration:

security:
  encryption_key: "your-secret-encryption-password"

Database Table: attune.key

See docs/secrets-management.md for comprehensive documentation.

7. Worker Service

Purpose: Main service orchestration and message queue integration.

Responsibilities:

Initialize all service components
Register worker in database
Start heartbeat manager
Consume execution messages from worker-specific queue
Own execution state after receiving scheduled executions
Update execution status in database (Running, Completed, Failed, etc.)
Publish execution status change notifications
Publish execution completion notifications
Handle graceful shutdown

Message Flow:

Executor (Scheduler) 
  → Publishes: execution.scheduled 
    → Queue: worker.{worker_id}.executions
      → Worker consumes message
        → Executes action
          → Publishes: execution.status.running
            → Publishes: execution.status.succeeded/failed

Message Types:

Consumed:

execution.scheduled - New execution assigned to this worker

Published:

execution.status.running - Execution started
execution.status.succeeded - Execution completed successfully
execution.status.failed - Execution failed

Key Implementation Details:

Worker-specific queues enable direct routing from scheduler
Database and MQ connections initialized on startup
Graceful shutdown deregisters worker
Message handlers run async and report errors

Configuration

Worker service uses the standard Attune configuration system:

# config.yaml
database:
  url: postgresql://localhost/attune
  max_connections: 20

message_queue:
  url: amqp://localhost
  exchange: attune.executions

worker:
  name: worker-01                    # Optional, defaults to hostname
  worker_type: Local                 # Local, Remote, Container
  runtime_id: null                   # Optional runtime association
  host: null                         # Optional, defaults to hostname
  port: null                         # Optional
  max_concurrent_tasks: 10           # Max parallel executions
  heartbeat_interval: 30             # Seconds between heartbeats
  task_timeout: 300                  # Default task timeout (seconds)

security:
  encryption_key: "your-encryption-key"  # Required for encrypted secrets

Environment variable overrides:

ATTUNE__WORKER__NAME=my-worker
ATTUNE__WORKER__MAX_CONCURRENT_TASKS=20
ATTUNE__WORKER__HEARTBEAT_INTERVAL=60

Running the Service

Prerequisites

PostgreSQL 14+ with Attune schema initialized
RabbitMQ 3.12+ with exchanges and queues configured
Python 3.x and/or bash (for local runtimes)
Environment variables or config file set up

Startup

# Using cargo
cd crates/worker
cargo run

# With custom config
cargo run -- --config /path/to/config.yaml

# With custom worker name
cargo run -- --name worker-prod-01

# Or with environment overrides
ATTUNE__WORKER__NAME=worker-01 \
ATTUNE__WORKER__MAX_CONCURRENT_TASKS=20 \
cargo run

Graceful Shutdown

The service supports graceful shutdown via SIGTERM/SIGINT (Ctrl+C):

Stop accepting new execution messages
Finish processing in-flight executions (future enhancement)
Stop heartbeat manager
Deregister worker (mark as inactive)
Close message queue connections
Close database connections
Exit cleanly

Execution Context

The executor prepares a comprehensive execution context for each action:

pub struct ExecutionContext {
    pub execution_id: i64,
    pub action_ref: String,              // "pack.action"
    pub parameters: HashMap<String, JsonValue>,
    pub env: HashMap<String, String>,    // Environment variables
    pub timeout: Option<u64>,            // Timeout in seconds
    pub working_dir: Option<PathBuf>,    // Working directory
    pub entry_point: String,             // Function/script entry point
    pub code: Option<String>,            // Action code (inline)
    pub code_path: Option<PathBuf>,      // Action code (file path)
}

Environment Variables

The executor injects these environment variables:

ATTUNE_EXECUTION_ID - Execution ID
ATTUNE_ACTION - Action reference (pack.action)
ATTUNE_RUNNER - Runner type (if specified)
ATTUNE_CONTEXT_* - Context data as environment variables

For shell actions, parameters are also injected as:

PARAM_{KEY} - Each parameter as uppercase env var

Execution Result

Actions return a standardized result:

pub struct ExecutionResult {
    pub exit_code: i32,              // 0 = success
    pub stdout: String,              // Standard output
    pub stderr: String,              // Standard error
    pub result: Option<JsonValue>,   // Parsed result data
    pub duration_ms: u64,            // Execution duration
    pub error: Option<String>,       // Error message if failed
}

Error Handling

Error Categories

Setup Errors: Runtime initialization failures
Execution Errors: Action execution failures
Timeout Errors: Execution exceeded timeout
IO Errors: File/network operations
Database Errors: Connection, query failures

Error Propagation

Runtime errors captured in ExecutionResult.error
Worker updates execution status to Failed in database (owns state)
Error published in status change notification message
Error published in completion notification message
Artifacts still stored for failed executions
Logs preserved for debugging

Testing

Unit Tests

Each runtime includes unit tests:

Simple execution
Parameter passing
Timeout handling
Error handling

Integration Tests

Integration tests require PostgreSQL and RabbitMQ:

Worker registration and heartbeat
End-to-end action execution
Message queue integration
Artifact storage

Running Tests

# Unit tests only
cargo test -p attune-worker --lib

# Integration tests (requires services)
cargo test -p attune-worker --test '*'

# Specific runtime tests
cargo test -p attune-worker python_runtime
cargo test -p attune-worker shell_runtime

Implementation Status

Phase 5.1: Worker Foundation ✅ COMPLETE

Worker registration module
Heartbeat manager
Service initialization
Configuration loading

Phase 5.2: Runtime System ✅ COMPLETE

Runtime trait abstraction
Python runtime implementation
Shell runtime implementation
Local runtime facade
Runtime registry

Phase 5.3: Execution Logic ⏳ IN PROGRESS

Action executor module
Execution context preparation
Fix data model mismatches
Complete message queue integration
Test end-to-end flow

Phase 5.4: Artifact Management ✅ COMPLETE

Artifact manager module
Log storage (stdout/stderr)
Result storage (JSON)
File artifact storage
Cleanup/retention policies

Phase 5.5: Testing 📋 TODO

Runtime unit tests (basic)
Integration tests with database
End-to-end execution tests
Error handling tests

Phase 5.6: Advanced Features 📋 TODO

Container runtime (Docker)
Remote worker support
Concurrent execution limits
Worker capacity management
Execution queuing

Known Issues

Data Model Mismatches

The current implementation has several mismatches with the actual database schema:

Execution.action: Expected String, actual is Option<i64>
Execution fields: Missing parameters, context, runner fields
Action fields: entry_point → entrypoint, missing timeout
Repository pattern: Repositories don't have ::new() constructors
Error types: Error::BadRequest and Error::NotFound have different signatures

Required Fixes

Update executor to use action_ref field instead of action
Fix action loading to query by ID from execution
Update execution context preparation for actual schema
Fix repository usage patterns
Update error construction calls
Implement From for Error

Future Enhancements

Phase 1: Core Improvements

Concurrent execution management (max_concurrent_tasks)
Worker capacity tracking and reporting
Execution queuing when at capacity
Retry logic for transient failures

Phase 2: Advanced Runtimes

Container runtime with Docker
Container image management and caching
Volume mounting for code injection
Network isolation for security

Phase 3: Remote Workers

Remote worker registration
Worker-to-worker communication
Geographic distribution
Load balancing strategies

Phase 4: Monitoring & Observability

Execution metrics (duration, success rate)
Worker health metrics
Runtime-specific metrics
OpenTelemetry integration

Phase 5: Security

Execution sandboxing
Resource limits (CPU, memory)
Secret injection from key store
Encrypted artifact storage

19 KiB Raw Blame History

Worker Service Architecture

Overview

Service Architecture

Core Components

1. Worker Registration

2. Heartbeat Manager

3. Runtime System

Runtime Trait

Python Runtime

Shell Runtime

Local Runtime

Runtime Registry

4. Action Executor

5. Artifact Manager

6. Secret Management

7. Worker Service

Configuration

Running the Service

Prerequisites

Startup

Graceful Shutdown

Execution Context

Environment Variables

Execution Result

Error Handling

Error Categories

Error Propagation

Testing

Unit Tests

Integration Tests

Running Tests

Implementation Status

Phase 5.1: Worker Foundation ✅ COMPLETE

Phase 5.2: Runtime System ✅ COMPLETE

Phase 5.3: Execution Logic ⏳ IN PROGRESS

Phase 5.4: Artifact Management ✅ COMPLETE

Phase 5.5: Testing 📋 TODO

Phase 5.6: Advanced Features 📋 TODO

Known Issues

Data Model Mismatches

Required Fixes

Future Enhancements

Phase 1: Core Improvements

Phase 2: Advanced Runtimes

Phase 3: Remote Workers

Phase 4: Monitoring & Observability

Phase 5: Security

Related Documentation

19 KiB

Raw Blame History