Files
attune/docs/architecture/worker-service.md

19 KiB

Worker Service Architecture

Overview

The Worker Service is responsible for executing automation actions in the Attune platform. It receives execution requests from the Executor service, runs actions in appropriate runtime environments (Python, Shell, Node.js, containers), and reports results back.

Service Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Worker Service                           │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌─────────────────────┐  ┌──────────────────────┐          │
│  │ Worker              │  │ Heartbeat            │          │
│  │ Registration        │  │ Manager              │          │
│  └─────────────────────┘  └──────────────────────┘          │
│           │                         │                         │
│           v                         v                         │
│  ┌─────────────────────────────────────────────┐             │
│  │         Action Executor                     │             │
│  │  ┌─────────────────────────────────────┐   │             │
│  │  │      Runtime Registry               │   │             │
│  │  │  - Python Runtime                   │   │             │
│  │  │  - Shell Runtime                    │   │             │
│  │  │  - Local Runtime (Facade)           │   │             │
│  │  │  - Container Runtime (Future)       │   │             │
│  │  └─────────────────────────────────────┘   │             │
│  └─────────────────────────────────────────────┘             │
│           │                         │                         │
│           v                         v                         │
│  ┌─────────────────────┐  ┌──────────────────────┐          │
│  │ Artifact            │  │ Message Queue        │          │
│  │ Manager             │  │ Consumer/Publisher   │          │
│  └─────────────────────┘  └──────────────────────┘          │
│                                                               │
└─────────────────────────────────────────────────────────────┘
         │                    │                    │
         v                    v                    v
   PostgreSQL            RabbitMQ           Local Filesystem

Core Components

1. Worker Registration

Purpose: Register worker in the database and maintain worker metadata.

Responsibilities:

  • Register worker on startup with name, type, capabilities
  • Update existing worker records to active status on restart
  • Deregister worker on shutdown (mark as inactive)
  • Update worker capabilities dynamically

Key Implementation Details:

  • Worker name defaults to hostname if not specified
  • Capabilities include supported runtimes (python, shell, node)
  • Worker type can be Local, Remote, or Container
  • Uses direct SQL queries for registration (no repository pattern needed)

Database Table: attune.worker

2. Heartbeat Manager

Purpose: Keep worker status fresh in the database with periodic heartbeat updates.

Responsibilities:

  • Send periodic heartbeat updates (default: every 30 seconds)
  • Update last_heartbeat timestamp in database
  • Run in background task until stopped
  • Handle transient database errors gracefully

Key Implementation Details:

  • Runs as a tokio background task with interval ticker
  • Configurable heartbeat interval via worker config
  • Logs errors but doesn't fail the worker on heartbeat issues
  • Clean shutdown on service stop

3. Runtime System

Purpose: Abstraction layer for executing actions in different environments.

Components:

Runtime Trait

pub trait Runtime: Send + Sync {
    fn name(&self) -> &str;
    fn can_execute(&self, context: &ExecutionContext) -> bool;
    async fn execute(&self, context: ExecutionContext) -> RuntimeResult<ExecutionResult>;
    async fn setup(&self) -> RuntimeResult<()>;
    async fn cleanup(&self) -> RuntimeResult<()>;
    async fn validate(&self) -> RuntimeResult<()>;
}

Python Runtime

  • Executes Python scripts via subprocess
  • Generates wrapper script to inject parameters
  • Supports timeout, stdout/stderr capture
  • Parses JSON results from stdout
  • Default entry point: run() function

Example Action:

def run(x, y):
    return x + y

Shell Runtime

  • Executes bash/shell scripts via subprocess
  • Injects parameters as environment variables (PARAM_*)
  • Supports timeout, output capture
  • Executes with set -e for error propagation

Example Action:

echo "Hello, $PARAM_NAME!"

Local Runtime

  • Facade that delegates to Python or Shell runtime
  • Selects runtime based on action metadata
  • Currently supports Python and Shell
  • Extensible for additional local runtimes

Runtime Registry

  • Manages collection of registered runtimes
  • Selects appropriate runtime for each action
  • Handles runtime setup/cleanup lifecycle

4. Action Executor

Purpose: Orchestrate the complete execution flow for an action and own execution state after handoff.

Execution Flow:

1. Receive execution.scheduled message from executor
2. Load execution record from database
3. Update status to Running (owns state after handoff)
4. Load action definition by reference
5. Prepare execution context (parameters, env vars, timeout)
6. Select and execute in appropriate runtime
7. Capture results (stdout, stderr, return value)
8. Store artifacts (logs, results)
9. Update execution status (Completed/Failed) in database
10. Publish status change notifications
11. Publish completion notification for queue management

Ownership Model:

  • Worker owns execution state after receiving execution.scheduled
  • Authoritative source for all status updates: Running, Completed, Failed, Cancelled, etc.
  • Updates database directly for all state changes
  • Publishes notifications for orchestration and monitoring

Responsibilities:

  • Coordinate execution lifecycle
  • Load action and execution data from database
  • Update execution state in database (after handoff from executor)
  • Prepare execution context with parameters and environment
  • Execute action via runtime registry
  • Handle success and failure cases
  • Store execution artifacts
  • Publish status change notifications

Key Implementation Details:

  • Parameters merged: action defaults + execution overrides
  • Environment variables include execution metadata
  • Default timeout: 5 minutes (300 seconds)
  • Errors captured and stored as execution result

5. Artifact Manager

Purpose: Store and manage execution artifacts (logs, results, files).

Artifact Types:

  • Log: stdout/stderr from execution
  • Result: JSON result data from action
  • File: Custom file outputs from actions
  • Trace: Debug/trace information (future)

Storage Structure:

/tmp/attune/artifacts/{worker_name}/
  └── execution_{id}/
      ├── stdout.log
      ├── stderr.log
      └── result.json

Responsibilities:

  • Store logs (stdout/stderr) for each execution
  • Store JSON result data
  • Support custom file artifacts
  • Clean up old artifacts (retention policy)
  • Delete artifacts for specific executions

Key Implementation Details:

  • Creates execution-specific directories
  • Stores all IO errors as Internal errors
  • Configurable base directory per worker
  • Retention policy based on file modification time

6. Secret Management

Purpose: Securely manage and inject secrets into action execution environments.

Responsibilities:

  • Fetch secrets from database based on ownership hierarchy
  • Decrypt encrypted secrets using AES-256-GCM
  • Inject secrets as environment variables
  • Clean up secrets after execution

Secret Ownership Hierarchy:

  1. System-level secrets - Available to all actions
  2. Pack-level secrets - Available to all actions in a pack
  3. Action-level secrets - Available to specific action only

More specific secrets override less specific ones with the same name.

Environment Variable Injection:

  • Secret names transformed: api_keySECRET_API_KEY
  • Prefix: SECRET_
  • Uppercase with hyphens replaced by underscores

Encryption:

  • Algorithm: AES-256-GCM (authenticated encryption)
  • Key derivation: SHA-256 hash of configured password
  • Format: nonce:ciphertext (Base64-encoded)
  • Random nonce per encryption operation

Key Implementation Details:

  • Encryption key loaded from security.encryption_key config
  • Key hash validation ensures correct decryption key
  • Graceful handling of missing secrets (warning, not failure)
  • Secrets never logged or exposed in artifacts
  • Automatic injection during execution context preparation

Configuration:

security:
  encryption_key: "your-secret-encryption-password"

Database Table: attune.key

See docs/secrets-management.md for comprehensive documentation.

7. Worker Service

Purpose: Main service orchestration and message queue integration.

Responsibilities:

  • Initialize all service components
  • Register worker in database
  • Start heartbeat manager
  • Consume execution messages from worker-specific queue
  • Own execution state after receiving scheduled executions
  • Update execution status in database (Running, Completed, Failed, etc.)
  • Publish execution status change notifications
  • Publish execution completion notifications
  • Handle graceful shutdown

Message Flow:

Executor (Scheduler) 
  → Publishes: execution.scheduled 
    → Queue: worker.{worker_id}.executions
      → Worker consumes message
        → Executes action
          → Publishes: execution.status.running
            → Publishes: execution.status.succeeded/failed

Message Types:

Consumed:

  • execution.scheduled - New execution assigned to this worker

Published:

  • execution.status.running - Execution started
  • execution.status.succeeded - Execution completed successfully
  • execution.status.failed - Execution failed

Key Implementation Details:

  • Worker-specific queues enable direct routing from scheduler
  • Database and MQ connections initialized on startup
  • Graceful shutdown deregisters worker
  • Message handlers run async and report errors

Configuration

Worker service uses the standard Attune configuration system:

# config.yaml
database:
  url: postgresql://localhost/attune
  max_connections: 20

message_queue:
  url: amqp://localhost
  exchange: attune.executions

worker:
  name: worker-01                    # Optional, defaults to hostname
  worker_type: Local                 # Local, Remote, Container
  runtime_id: null                   # Optional runtime association
  host: null                         # Optional, defaults to hostname
  port: null                         # Optional
  max_concurrent_tasks: 10           # Max parallel executions
  heartbeat_interval: 30             # Seconds between heartbeats
  task_timeout: 300                  # Default task timeout (seconds)

security:
  encryption_key: "your-encryption-key"  # Required for encrypted secrets

Environment variable overrides:

ATTUNE__WORKER__NAME=my-worker
ATTUNE__WORKER__MAX_CONCURRENT_TASKS=20
ATTUNE__WORKER__HEARTBEAT_INTERVAL=60

Running the Service

Prerequisites

  • PostgreSQL 14+ with Attune schema initialized
  • RabbitMQ 3.12+ with exchanges and queues configured
  • Python 3.x and/or bash (for local runtimes)
  • Environment variables or config file set up

Startup

# Using cargo
cd crates/worker
cargo run

# With custom config
cargo run -- --config /path/to/config.yaml

# With custom worker name
cargo run -- --name worker-prod-01

# Or with environment overrides
ATTUNE__WORKER__NAME=worker-01 \
ATTUNE__WORKER__MAX_CONCURRENT_TASKS=20 \
cargo run

Graceful Shutdown

The service supports graceful shutdown via SIGTERM/SIGINT (Ctrl+C):

  1. Stop accepting new execution messages
  2. Finish processing in-flight executions (future enhancement)
  3. Stop heartbeat manager
  4. Deregister worker (mark as inactive)
  5. Close message queue connections
  6. Close database connections
  7. Exit cleanly

Execution Context

The executor prepares a comprehensive execution context for each action:

pub struct ExecutionContext {
    pub execution_id: i64,
    pub action_ref: String,              // "pack.action"
    pub parameters: HashMap<String, JsonValue>,
    pub env: HashMap<String, String>,    // Environment variables
    pub timeout: Option<u64>,            // Timeout in seconds
    pub working_dir: Option<PathBuf>,    // Working directory
    pub entry_point: String,             // Function/script entry point
    pub code: Option<String>,            // Action code (inline)
    pub code_path: Option<PathBuf>,      // Action code (file path)
}

Environment Variables

The executor injects these environment variables:

  • ATTUNE_EXECUTION_ID - Execution ID
  • ATTUNE_ACTION - Action reference (pack.action)
  • ATTUNE_RUNNER - Runner type (if specified)
  • ATTUNE_CONTEXT_* - Context data as environment variables

For shell actions, parameters are also injected as:

  • PARAM_{KEY} - Each parameter as uppercase env var

Execution Result

Actions return a standardized result:

pub struct ExecutionResult {
    pub exit_code: i32,              // 0 = success
    pub stdout: String,              // Standard output
    pub stderr: String,              // Standard error
    pub result: Option<JsonValue>,   // Parsed result data
    pub duration_ms: u64,            // Execution duration
    pub error: Option<String>,       // Error message if failed
}

Error Handling

Error Categories

  1. Setup Errors: Runtime initialization failures
  2. Execution Errors: Action execution failures
  3. Timeout Errors: Execution exceeded timeout
  4. IO Errors: File/network operations
  5. Database Errors: Connection, query failures

Error Propagation

  • Runtime errors captured in ExecutionResult.error
  • Worker updates execution status to Failed in database (owns state)
  • Error published in status change notification message
  • Error published in completion notification message
  • Artifacts still stored for failed executions
  • Logs preserved for debugging

Testing

Unit Tests

Each runtime includes unit tests:

  • Simple execution
  • Parameter passing
  • Timeout handling
  • Error handling

Integration Tests

Integration tests require PostgreSQL and RabbitMQ:

  • Worker registration and heartbeat
  • End-to-end action execution
  • Message queue integration
  • Artifact storage

Running Tests

# Unit tests only
cargo test -p attune-worker --lib

# Integration tests (requires services)
cargo test -p attune-worker --test '*'

# Specific runtime tests
cargo test -p attune-worker python_runtime
cargo test -p attune-worker shell_runtime

Implementation Status

Phase 5.1: Worker Foundation COMPLETE

  • Worker registration module
  • Heartbeat manager
  • Service initialization
  • Configuration loading

Phase 5.2: Runtime System COMPLETE

  • Runtime trait abstraction
  • Python runtime implementation
  • Shell runtime implementation
  • Local runtime facade
  • Runtime registry

Phase 5.3: Execution Logic IN PROGRESS

  • Action executor module
  • Execution context preparation
  • Fix data model mismatches
  • Complete message queue integration
  • Test end-to-end flow

Phase 5.4: Artifact Management COMPLETE

  • Artifact manager module
  • Log storage (stdout/stderr)
  • Result storage (JSON)
  • File artifact storage
  • Cleanup/retention policies

Phase 5.5: Testing 📋 TODO

  • Runtime unit tests (basic)
  • Integration tests with database
  • End-to-end execution tests
  • Error handling tests

Phase 5.6: Advanced Features 📋 TODO

  • Container runtime (Docker)
  • Remote worker support
  • Concurrent execution limits
  • Worker capacity management
  • Execution queuing

Known Issues

Data Model Mismatches

The current implementation has several mismatches with the actual database schema:

  1. Execution.action: Expected String, actual is Option<i64>
  2. Execution fields: Missing parameters, context, runner fields
  3. Action fields: entry_pointentrypoint, missing timeout
  4. Repository pattern: Repositories don't have ::new() constructors
  5. Error types: Error::BadRequest and Error::NotFound have different signatures

Required Fixes

  1. Update executor to use action_ref field instead of action
  2. Fix action loading to query by ID from execution
  3. Update execution context preparation for actual schema
  4. Fix repository usage patterns
  5. Update error construction calls
  6. Implement From for Error

Future Enhancements

Phase 1: Core Improvements

  • Concurrent execution management (max_concurrent_tasks)
  • Worker capacity tracking and reporting
  • Execution queuing when at capacity
  • Retry logic for transient failures

Phase 2: Advanced Runtimes

  • Container runtime with Docker
  • Container image management and caching
  • Volume mounting for code injection
  • Network isolation for security

Phase 3: Remote Workers

  • Remote worker registration
  • Worker-to-worker communication
  • Geographic distribution
  • Load balancing strategies

Phase 4: Monitoring & Observability

  • Execution metrics (duration, success rate)
  • Worker health metrics
  • Runtime-specific metrics
  • OpenTelemetry integration

Phase 5: Security

  • Execution sandboxing
  • Resource limits (CPU, memory)
  • Secret injection from key store
  • Encrypted artifact storage