Files
attune/docs/plans/universal-worker-agent.md

43 KiB
Raw Blame History

Universal Worker Agent Injection

Overview

This plan describes a new deployment model for Attune workers: a statically-linked agent binary that can be injected into any Docker container at runtime, turning arbitrary images into Attune workers. This eliminates the need to build custom worker Docker images for each runtime environment.

Problem

Today, every Attune worker is a purpose-built Docker image: the same attune-worker Rust binary baked into Debian images with specific interpreters installed (see docker/Dockerfile.worker.optimized). Adding a new runtime (e.g., Ruby, Go, Java, R) means:

  1. Modifying Dockerfile.worker.optimized to add a new build stage
  2. Installing the interpreter via apt or a package repository
  3. Managing the combinatorial explosion of worker variants
  4. Rebuilding images (~5 min) for every change
  5. Standardizing on debian:bookworm-slim as the base (not the runtime's official image)

Solution

Flip the model: any Docker image becomes an Attune worker by injecting a lightweight agent binary at container startup. The agent binary is a statically-linked (musl) Rust executable that connects to MQ/DB, consumes execution messages, spawns subprocesses, and reports results — functionally identical to the current worker, but packaged for universal deployment.

Want Ruby support? Point at ruby:3.3 and go. Need a GPU runtime? Use nvidia/cuda:12.3-runtime. Need a specific Python version with scientific libraries pre-installed? Use any image that has them.

Industry Precedent

This pattern is battle-tested in major CI/CD and workflow systems:

System Pattern How It Works
Tekton InitContainer + shared volume Copies a static Go entrypoint binary into an emptyDir; overrides the user container's entrypoint to use it. Steps coordinate via file-based signaling.
Argo Workflows (Emissary) InitContainer + sidecar The emissary binary runs as both an init container and a sidecar. Disk-based coordination, no Docker socket, no privileged access.
GitLab CI Runner (Step Runner) Binary injection Newer "Native Step Runner" mode injects a step-runner binary into the build container and adjusts $PATH. Communicates via gRPC.
Istio Mutating webhook Kubernetes admission controller adds init + sidecar containers transparently.

The Tekton/Argo pattern (static binary + shared volume) is the best fit for Attune because:

  • It works with Docker Compose (not K8s-only) via bind mounts / named volumes
  • It requires zero dependencies in the user image (just a Linux kernel)
  • A static Rust binary (musl-linked) is ~1525MB and runs anywhere
  • No privileged access, no Docker socket needed inside the container

Compatibility

This plan is purely additive. Nothing changes for existing workers:

  • Dockerfile.worker.optimized and its four targets remain unchanged and functional
  • Current docker-compose.yaml worker services keep working
  • All MQ protocols, DB schemas, and execution flows remain identical
  • The agent is just another way to run the same execution engine

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                      Attune Control Plane                        │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐  │
│  │   API    │    │ Executor │    │ RabbitMQ │    │ Postgres │  │
│  └────┬─────┘    └────┬─────┘    └─────┬────┘    └────┬─────┘  │
│       │               │               │               │        │
└───────┼───────────────┼───────────────┼───────────────┼────────┘
        │               │               │               │
  ┌─────┼───────────────┼───────────────┼───────────────┼──────┐
  │     ▼               ▼               ▼               ▼      │
  │  ┌──────────────────────────────────────────────────────┐  │
  │  │            attune-agent (injected binary)            │  │
  │  │  ┌──────────┐ ┌──────────┐ ┌─────────┐ ┌────────┐  │  │
  │  │  │MQ Client │ │DB Client │ │ Process │ │Artifact│  │  │
  │  │  │(lapin)   │ │(sqlx)    │ │Executor │ │Manager │  │  │
  │  │  └──────────┘ └──────────┘ └─────────┘ └────────┘  │  │
  │  └──────────────────────────────────────────────────────┘  │
  │                                                            │
  │  ┌──────────────────────────────────────────────────────┐  │
  │  │     User Container (ANY Docker image)                │  │
  │  │     ruby:3.3, python:3.12, nvidia/cuda, alpine, ...  │  │
  │  └──────────────────────────────────────────────────────┘  │
  │                                                            │
  │  Shared Volumes:                                           │
  │    /opt/attune/agent/       (agent binary, read-only)      │
  │    /opt/attune/packs/       (pack files, read-only)        │
  │    /opt/attune/runtime_envs/(virtualenvs, node_modules)    │
  │    /opt/attune/artifacts/   (artifact files)               │
  └────────────────────────────────────────────────────────────┘

Agent vs. Full Worker Comparison

The agent binary is functionally identical to the current attune-worker. The difference is packaging and startup behavior:

Capability Full Worker (attune-worker) Agent (attune-agent)
MQ consumption
DB access
Process execution
Artifact management
Secret management
Cancellation / timeout
Heartbeat
Runtime env setup (venvs) Proactive at startup Lazy on first use
Version verification Full sweep at startup On-demand per-execution
Runtime discovery Manual (ATTUNE_WORKER_RUNTIMES) Auto-detect + optional manual override
Linking Dynamic (glibc) Static (musl)
Base image requirement debian:bookworm-slim None (any Linux container)
Binary size ~3050MB ~1525MB (stripped, musl)

Binary Distribution Methods

Two methods for getting the agent binary into a container:

Method A: Shared Volume (Docker Compose — recommended)

An init container copies the agent binary into a Docker named volume. User containers mount this volume read-only and use the binary as their entrypoint.

Method B: HTTP Download (remote / cloud deployments)

A new API endpoint (GET /api/v1/agent/binary) serves the static binary. A small wrapper script in the container downloads it on first run. Useful for Kubernetes, ECS, or remote Docker hosts where shared volumes are impractical.

Implementation Phases

Phase 1: Static Binary Build Infrastructure

Goal: Produce a statically-linked attune-agent binary that runs in any Linux container.

Effort: 35 days

Dependencies: None

1.1 TLS Backend Audit and Alignment

The agent must link statically with musl. This requires all TLS to use rustls (pure Rust) instead of OpenSSL/native-tls.

Current state (from Cargo.toml workspace dependencies):

  • sqlx: Already uses runtime-tokio-rustls
  • reqwest: Uses default features (native-tls) — needs rustls-tls feature
  • tokio-tungstenite: Uses native-tls feature — needs rustls
  • lapin (v4.3): Uses native-tls by default — needs rustls feature

Changes needed in workspace Cargo.toml:

# Change reqwest to use rustls
reqwest = { version = "0.13", features = ["json", "rustls-tls"], default-features = false }

# Change tokio-tungstenite to use rustls
tokio-tungstenite = { version = "0.28", features = ["rustls"] }

# Check lapin's TLS features — if using amqps://, need rustls support.
# For plain amqp:// (typical in Docker Compose), no TLS needed.
# For production amqps://, evaluate lapin's rustls support or use a TLS-terminating proxy.

Important: These changes affect the entire workspace. Test all services (api, executor, worker, notifier, sensor, cli) after switching TLS backends. If switching workspace-wide is too disruptive, use feature flags to conditionally select the TLS backend for the agent build only.

Alternative: If workspace-wide rustls migration is too risky, the agent crate can override specific dependencies:

[dependencies]
reqwest = { workspace = true, default-features = false, features = ["json", "rustls-tls"] }

1.2 New Crate or New Binary Target

Option A (recommended): New binary target in the worker crate

Add a second binary target to crates/worker/Cargo.toml:

[[bin]]
name = "attune-worker"
path = "src/main.rs"

[[bin]]
name = "attune-agent"
path = "src/agent_main.rs"

This reuses all existing code — ActionExecutor, ProcessRuntime, WorkerService, RuntimeRegistry, SecretManager, ArtifactManager, etc. The agent entrypoint is a thin wrapper with different startup behavior (auto-detection instead of manual config).

Pros: Zero code duplication. Same test suite covers both binaries. Cons: The agent binary includes unused code paths (e.g., full worker service setup).

Option B: New crate crates/agent/

crates/agent/
├── Cargo.toml          # Depends on attune-common + selected worker modules
├── src/
│   ├── main.rs         # Entry point
│   ├── agent.rs        # Core agent loop
│   ├── detect.rs       # Runtime auto-detection
│   └── health.rs       # Health check (file-based or tiny HTTP)

Pros: Cleaner separation, can minimize binary size by excluding unused deps. Cons: Requires extracting shared execution code into a library or duplicating it.

Recommendation: Start with Option A (new binary target in worker crate) for speed. Refactor into a separate crate later if binary size becomes a concern.

1.3 Agent Entrypoint (src/agent_main.rs)

The agent entrypoint differs from main.rs in:

  1. Runtime auto-detection instead of relying on ATTUNE_WORKER_RUNTIMES
  2. Lazy environment setup instead of proactive startup sweep
  3. Simplified config loading — env vars are the primary config source (no config file required, but supported if mounted)
  4. Container-aware defaults — sensible defaults for paths, timeouts, concurrency
src/agent_main.rs responsibilities:
  1. Parse CLI args / env vars for DB URL, MQ URL, worker name
  2. Run runtime auto-detection (Phase 2) to discover available interpreters
  3. Initialize WorkerService with detected capabilities
  4. Start the normal execution consumer loop
  5. Handle SIGTERM/SIGINT for graceful shutdown

1.4 Dockerfile for Agent Binary

Create docker/Dockerfile.agent:

# Stage 1: Build the statically-linked agent binary
FROM rust:1.83-bookworm AS builder

RUN apt-get update && apt-get install -y musl-tools
RUN rustup target add x86_64-unknown-linux-musl

WORKDIR /build
ENV RUST_MIN_STACK=67108864

COPY Cargo.toml Cargo.lock ./
COPY crates/ ./crates/
COPY migrations/ ./migrations/
COPY .sqlx/ ./.sqlx/

# Build only the agent binary, statically linked
RUN --mount=type=cache,target=/usr/local/cargo/registry,sharing=shared \
    --mount=type=cache,target=/usr/local/cargo/git,sharing=shared \
    --mount=type=cache,id=agent-target,target=/build/target \
    SQLX_OFFLINE=true cargo build --release \
      --target x86_64-unknown-linux-musl \
      --bin attune-agent \
    && cp /build/target/x86_64-unknown-linux-musl/release/attune-agent /attune-agent \
    && strip /attune-agent

# Stage 2: Minimal image for volume population
FROM scratch AS agent-binary
COPY --from=builder /attune-agent /attune-agent

Multi-architecture support: For ARM64 (Apple Silicon, Graviton), add a parallel build stage targeting aarch64-unknown-linux-musl. Use Docker buildx multi-platform builds or separate images.

1.5 Makefile Targets

Add to Makefile:

build-agent:
	SQLX_OFFLINE=true cargo build --release --target x86_64-unknown-linux-musl --bin attune-agent
	strip target/x86_64-unknown-linux-musl/release/attune-agent

docker-build-agent:
	docker buildx build -f docker/Dockerfile.agent -t attune-agent:latest .

Phase 2: Runtime Auto-Detection

Goal: The agent automatically discovers what interpreters are available in the container, without requiring ATTUNE_WORKER_RUNTIMES to be set.

Effort: 12 days

Dependencies: Phase 1 (agent binary exists)

2.1 Interpreter Discovery Module

Create a new module (in crates/worker/src/ or crates/common/src/) that probes the container's filesystem for known interpreters:

src/runtime_detect.rs (or extend existing crates/common/src/runtime_detection.rs)

struct DetectedInterpreter {
    runtime_name: String,       // "python", "ruby", "node", etc.
    binary_path: PathBuf,       // "/usr/local/bin/python3"
    version: Option<String>,    // "3.12.1" (parsed from version command output)
}

/// Probe the container for available interpreters.
///
/// For each known runtime, checks common binary names via `which` or
/// direct path existence, then runs the version command to extract
/// the version string.
fn detect_interpreters() -> Vec<DetectedInterpreter> {
    let probes = [
        InterpreterProbe {
            runtime_name: "python",
            binaries: &["python3", "python"],
            version_flag: "--version",
            version_regex: r"Python (\d+\.\d+\.\d+)",
        },
        InterpreterProbe {
            runtime_name: "node",
            binaries: &["node", "nodejs"],
            version_flag: "--version",
            version_regex: r"v(\d+\.\d+\.\d+)",
        },
        InterpreterProbe {
            runtime_name: "ruby",
            binaries: &["ruby"],
            version_flag: "--version",
            version_regex: r"ruby (\d+\.\d+\.\d+)",
        },
        InterpreterProbe {
            runtime_name: "go",
            binaries: &["go"],
            version_flag: "version",
            version_regex: r"go(\d+\.\d+\.\d+)",
        },
        InterpreterProbe {
            runtime_name: "java",
            binaries: &["java"],
            version_flag: "-version",
            version_regex: r#""(\d+[\.\d+]*)""#,
        },
        InterpreterProbe {
            runtime_name: "perl",
            binaries: &["perl"],
            version_flag: "--version",
            version_regex: r"v(\d+\.\d+\.\d+)",
        },
        InterpreterProbe {
            runtime_name: "r",
            binaries: &["Rscript", "R"],
            version_flag: "--version",
            version_regex: r"R.*version (\d+\.\d+\.\d+)",
        },
        InterpreterProbe {
            runtime_name: "shell",
            binaries: &["bash", "sh"],
            version_flag: "--version",
            version_regex: r"(\d+\.\d+\.\d+)",
        },
    ];

    // For each probe:
    // 1. Run `which <binary>` or check known paths
    // 2. If found, run `<binary> <version_flag>` with a short timeout (2s)
    // 3. Parse version from output using the regex
    // 4. Return DetectedInterpreter with the results
}

Integration with existing code: The existing crates/common/src/runtime_detection.rs already has normalize_runtime_name() and alias groups. The auto-detection module should use these for matching detected interpreters against DB runtime records.

2.2 Integration with Worker Registration

The agent startup sequence:

  1. Run detect_interpreters()
  2. Match detected interpreters against known runtimes in the runtime table (using alias-aware matching from runtime_detection.rs)
  3. If ATTUNE_WORKER_RUNTIMES is set, use it as an override (intersection or union — TBD, probably override wins)
  4. Register the worker with the detected/configured capabilities
  5. Log what was detected for debugging:
    [INFO] Detected runtimes: python 3.12.1 (/usr/local/bin/python3), ruby 3.3.0 (/usr/local/bin/ruby), shell 5.2.21 (/bin/bash)
    [INFO] Registering worker with capabilities: [python, ruby, shell]
    

2.3 Runtime Hints File (Optional Enhancement)

Allow a .attune-runtime.yaml file in the container that declares runtime capabilities and custom configuration. This handles cases where auto-detection isn't sufficient (e.g., custom interpreters, non-standard paths, special environment setup).

# /opt/attune/.attune-runtime.yaml (or /.attune-runtime.yaml)
runtimes:
  - name: ruby
    interpreter: /usr/local/bin/ruby
    file_extension: .rb
    version_command: "ruby --version"
    env_setup:
      create_command: "mkdir -p {env_dir}"
      install_command: "cd {env_dir} && bundle install --gemfile {pack_dir}/Gemfile"
  - name: custom-ml
    interpreter: /opt/conda/bin/python
    file_extension: .py
    version_command: "/opt/conda/bin/python --version"

The agent checks for this file at startup and merges it with auto-detected runtimes (hints file takes precedence for conflicting runtime names).

This is a nice-to-have for Phase 2 — implement only if auto-detection proves insufficient for common use cases.


Phase 3: Refactor Worker for Code Reuse

Goal: Ensure the execution engine is cleanly reusable between the full attune-worker and the attune-agent binary, without code duplication.

Effort: 23 days

Dependencies: Phase 1 (agent entrypoint exists), can be done in parallel with Phase 2

3.1 Identify Shared vs. Agent-Specific Code

Current worker crate modules and their reuse status:

Module File(s) Shared? Notes
ActionExecutor executor.rs Fully shared Core execution orchestration
ProcessRuntime runtime/process.rs Fully shared Subprocess spawning, interpreter resolution
process_executor runtime/process_executor.rs Fully shared Streaming output capture, timeout, cancellation
NativeRuntime runtime/native.rs Fully shared Direct binary execution
LocalRuntime runtime/local.rs Fully shared Fallback runtime facade
RuntimeRegistry runtime/mod.rs Fully shared Runtime selection and registration
ExecutionContext runtime/mod.rs Fully shared Execution parameters, env vars, secrets
BoundedLogWriter runtime/log_writer.rs Fully shared Streaming log capture with size limits
parameter_passing runtime/parameter_passing.rs Fully shared Stdin/file/env parameter delivery
SecretManager secrets.rs Fully shared Secret decryption via attune_common::crypto
ArtifactManager artifacts.rs Fully shared Artifact finalization (file stat, size update)
HeartbeatManager heartbeat.rs Fully shared Periodic DB heartbeat
WorkerRegistration registration.rs Shared, extended Needs auto-detection integration
env_setup env_setup.rs Shared, lazy mode Agent uses lazy setup instead of proactive
version_verify version_verify.rs Shared, on-demand mode Agent verifies on-demand instead of full sweep
WorkerService service.rs ⚠️ Needs refactoring Extract reusable AgentService or parameterize

Conclusion: Almost everything is already reusable. The main work is in service.rs, which needs to be parameterized for the two startup modes (proactive vs. lazy).

3.2 Refactor WorkerService for Dual Modes

Instead of duplicating WorkerService, add a configuration enum:

// In service.rs or a new config module

/// Controls how the worker initializes its runtime environment.
pub enum StartupMode {
    /// Full worker mode: proactive environment setup, full version
    /// verification sweep at startup. Used by `attune-worker`.
    Worker,

    /// Agent mode: lazy environment setup (on first use), on-demand
    /// version verification, auto-detected runtimes. Used by `attune-agent`.
    Agent {
        /// Runtimes detected by the auto-detection module.
        detected_runtimes: Vec<DetectedInterpreter>,
    },
}

The WorkerService::start() method checks this mode:

match &self.startup_mode {
    StartupMode::Worker => {
        // Existing behavior: full version verification sweep
        self.verify_all_runtime_versions().await?;
        // Existing behavior: proactive environment setup for all packs
        self.setup_all_environments().await?;
    }
    StartupMode::Agent { .. } => {
        // Skip proactive setup — will happen lazily on first execution
        info!("Agent mode: deferring environment setup to first execution");
    }
}

3.3 Lazy Environment Setup

In agent mode, the first execution for a given pack+runtime combination triggers environment setup:

// In executor.rs, within execute_with_cancel()

// Before executing, ensure the runtime environment exists
if !env_dir.exists() {
    info!("Creating runtime environment on first use: {}", env_dir.display());
    self.env_setup.setup_environment(&pack_ref, &runtime_name, &env_dir).await?;
}

The current worker already handles this partially — the ProcessRuntime::execute() method has auto-repair logic for broken venvs. The lazy setup extends this to handle the case where the env directory doesn't exist at all.


Phase 4: Docker Compose Integration

Goal: Make it trivial to add agent-based workers to docker-compose.yaml.

Effort: 1 day

Dependencies: Phase 1 (agent binary and Dockerfile exist)

4.1 Init Service for Agent Volume

Add to docker-compose.yaml:

services:
  # Populates the agent binary volume (runs once)
  init-agent:
    build:
      context: .
      dockerfile: docker/Dockerfile.agent
    volumes:
      - agent_bin:/opt/attune/agent
    entrypoint: ["/bin/sh", "-c", "cp /attune-agent /opt/attune/agent/attune-agent && chmod +x /opt/attune/agent/attune-agent"]
    restart: "no"
    networks:
      - attune

volumes:
  agent_bin:  # Named volume holding the static agent binary

Note: The init-agent service needs a minimal base with /bin/sh for the cp command. Since the agent Dockerfile's final stage is FROM scratch, the init service should use the builder stage or a separate FROM alpine stage.

Revised Dockerfile.agent approach — use Alpine for the init image so it has a shell:

# Stage 1: Build
FROM rust:1.83-bookworm AS builder
# ... (build steps from Phase 1.4)

# Stage 2: Init image (has a shell for cp)
FROM alpine:3.20 AS agent-init
COPY --from=builder /attune-agent /attune-agent
# Default command copies the binary into the mounted volume
CMD ["cp", "/attune-agent", "/opt/attune/agent/attune-agent"]

# Stage 3: Bare binary (for HTTP download or direct use)
FROM scratch AS agent-binary
COPY --from=builder /attune-agent /attune-agent

4.2 Agent-Based Worker Services

Example services that can be added to docker-compose.yaml or a user's docker-compose.override.yaml:

  # Ruby worker — uses the official Ruby image
  worker-ruby:
    image: ruby:3.3-slim
    depends_on:
      init-agent:
        condition: service_completed_successfully
      postgres:
        condition: service_healthy
      rabbitmq:
        condition: service_healthy
    entrypoint: ["/opt/attune/agent/attune-agent"]
    volumes:
      - agent_bin:/opt/attune/agent:ro
      - packs_data:/opt/attune/packs:ro
      - runtime_envs:/opt/attune/runtime_envs
      - artifacts_data:/opt/attune/artifacts
      - ${ATTUNE_DOCKER_CONFIG_PATH:-./config.docker.yaml}:/opt/attune/config/config.yaml:ro
    environment:
      ATTUNE_WORKER_NAME: worker-ruby-1
      # ATTUNE_WORKER_RUNTIMES omitted — auto-detected as ruby,shell
    networks:
      - attune
    restart: unless-stopped
    stop_grace_period: 45s

  # R worker — uses the official R base image
  worker-r:
    image: r-base:4.4.0
    depends_on:
      init-agent:
        condition: service_completed_successfully
      postgres:
        condition: service_healthy
      rabbitmq:
        condition: service_healthy
    entrypoint: ["/opt/attune/agent/attune-agent"]
    volumes:
      - agent_bin:/opt/attune/agent:ro
      - packs_data:/opt/attune/packs:ro
      - runtime_envs:/opt/attune/runtime_envs
      - artifacts_data:/opt/attune/artifacts
      - ${ATTUNE_DOCKER_CONFIG_PATH:-./config.docker.yaml}:/opt/attune/config/config.yaml:ro
    environment:
      ATTUNE_WORKER_NAME: worker-r-1
    networks:
      - attune
    restart: unless-stopped

  # GPU worker — NVIDIA CUDA image with Python
  worker-gpu:
    image: nvidia/cuda:12.3.1-runtime-ubuntu22.04
    depends_on:
      init-agent:
        condition: service_completed_successfully
      postgres:
        condition: service_healthy
      rabbitmq:
        condition: service_healthy
    entrypoint: ["/opt/attune/agent/attune-agent"]
    runtime: nvidia
    volumes:
      - agent_bin:/opt/attune/agent:ro
      - packs_data:/opt/attune/packs:ro
      - runtime_envs:/opt/attune/runtime_envs
      - artifacts_data:/opt/attune/artifacts
      - ${ATTUNE_DOCKER_CONFIG_PATH:-./config.docker.yaml}:/opt/attune/config/config.yaml:ro
    environment:
      ATTUNE_WORKER_NAME: worker-gpu-1
      ATTUNE_WORKER_RUNTIMES: python,shell  # Manual override (image has python pre-installed)
    networks:
      - attune
    restart: unless-stopped

4.3 User Experience Summary

Adding a new runtime to an Attune deployment becomes a ~12 line addition to docker-compose.override.yaml:

services:
  worker-my-runtime:
    image: my-org/my-custom-image:latest
    depends_on:
      init-agent:
        condition: service_completed_successfully
      postgres:
        condition: service_healthy
      rabbitmq:
        condition: service_healthy
    entrypoint: ["/opt/attune/agent/attune-agent"]
    volumes:
      - agent_bin:/opt/attune/agent:ro
      - packs_data:/opt/attune/packs:ro
      - runtime_envs:/opt/attune/runtime_envs
      - artifacts_data:/opt/attune/artifacts
      - ${ATTUNE_DOCKER_CONFIG_PATH:-./config.docker.yaml}:/opt/attune/config/config.yaml:ro
    networks:
      - attune

No Dockerfiles. No rebuilds. No waiting for Rust compilation. Start to finish in seconds.


Phase 5: API Binary Download Endpoint

Goal: Support deployments where shared Docker volumes are impractical (Kubernetes, ECS, remote Docker hosts).

Effort: 1 day

Dependencies: Phase 1 (agent binary exists)

5.1 New API Route

Add to crates/api/src/routes/:

GET /api/v1/agent/binary
GET /api/v1/agent/binary?arch=x86_64    (default)
GET /api/v1/agent/binary?arch=aarch64

Response: application/octet-stream
Headers: Content-Disposition: attachment; filename="attune-agent"

The API serves the binary from a configurable filesystem path (e.g., /opt/attune/agent/attune-agent). The binary can be placed there at build time (baked into the API image) or mounted via volume.

Configuration (config.yaml):

agent:
  binary_dir: /opt/attune/agent   # Directory containing agent binaries
  # Files expected: attune-agent-x86_64, attune-agent-aarch64

OpenAPI documentation via utoipa:

#[utoipa::path(
    get,
    path = "/api/v1/agent/binary",
    params(("arch" = Option<String>, Query, description = "Target architecture (x86_64, aarch64)")),
    responses(
        (status = 200, description = "Agent binary", content_type = "application/octet-stream"),
        (status = 404, description = "Binary not found for requested architecture"),
    ),
    tag = "agent"
)]

Authentication: This endpoint should be unauthenticated or use a simple shared token, since the agent needs to download the binary before it can authenticate. Alternatively, require basic auth or a bootstrap token passed via environment variable.

5.2 Bootstrap Wrapper Script

Provide scripts/attune-agent-wrapper.sh for use as a container entrypoint:

#!/bin/sh
# attune-agent-wrapper.sh — Bootstrap the Attune agent in any container
set -e

AGENT_DIR="${ATTUNE_AGENT_DIR:-/opt/attune/agent}"
AGENT_BIN="$AGENT_DIR/attune-agent"
AGENT_URL="${ATTUNE_AGENT_URL:-http://attune-api:8080/api/v1/agent/binary}"

# Use volume-mounted binary if available, otherwise download
if [ ! -x "$AGENT_BIN" ]; then
  echo "[attune] Agent binary not found at $AGENT_BIN, downloading from $AGENT_URL..."
  mkdir -p "$AGENT_DIR"
  if command -v wget >/dev/null 2>&1; then
    wget -q -O "$AGENT_BIN" "$AGENT_URL"
  elif command -v curl >/dev/null 2>&1; then
    curl -sL "$AGENT_URL" -o "$AGENT_BIN"
  else
    echo "[attune] ERROR: Neither wget nor curl available. Cannot download agent." >&2
    exit 1
  fi
  chmod +x "$AGENT_BIN"
  echo "[attune] Agent binary downloaded successfully."
fi

echo "[attune] Starting agent..."
exec "$AGENT_BIN" "$@"

Usage:

# In docker-compose or K8s — when volume mount isn't available
worker-remote:
  image: python:3.12-slim
  entrypoint: ["/opt/attune/scripts/attune-agent-wrapper.sh"]
  volumes:
    - ./scripts/attune-agent-wrapper.sh:/opt/attune/scripts/attune-agent-wrapper.sh:ro
  environment:
    ATTUNE_AGENT_URL: http://attune-api:8080/api/v1/agent/binary

Phase 6: Database & Runtime Registry Extensions

Goal: Support arbitrary runtimes without requiring every possible runtime to be pre-registered in the DB.

Effort: 12 days

Dependencies: Phase 2 (auto-detection working)

6.1 Extended Runtime Detection Metadata

Add a migration to support auto-detected runtimes:

-- Migration: NNNNNN_agent_runtime_detection.sql

-- Track whether a runtime was auto-registered by an agent
ALTER TABLE runtime ADD COLUMN IF NOT EXISTS auto_detected BOOLEAN NOT NULL DEFAULT FALSE;

-- Store detection configuration for auto-discovered runtimes
-- Example: { "binaries": ["ruby", "ruby3.2"], "version_command": "--version",
--            "version_regex": "ruby (\\d+\\.\\d+\\.\\d+)" }
ALTER TABLE runtime ADD COLUMN IF NOT EXISTS detection_config JSONB NOT NULL DEFAULT '{}';

6.2 Runtime Template Packs

Ship pre-configured runtime definitions for common languages in the core pack (or a new runtimes pack). These are registered during pack loading and provide the execution_config that auto-detected interpreters need.

Add runtime YAML files for new languages:

packs/core/runtimes/ruby.yaml
packs/core/runtimes/go.yaml
packs/core/runtimes/java.yaml
packs/core/runtimes/perl.yaml
packs/core/runtimes/r.yaml

Example ruby.yaml:

ref: core.ruby
name: Ruby
label: Ruby Runtime
description: Execute Ruby scripts
execution_config:
  interpreter:
    binary: ruby
    file_extension: .rb
  env_vars:
    GEM_HOME: "{env_dir}/gems"
    GEM_PATH: "{env_dir}/gems"
    BUNDLE_PATH: "{env_dir}/gems"
  environment:
    create_command: "mkdir -p {env_dir}/gems"
    install_command: "cd {pack_dir} && GEM_HOME={env_dir}/gems bundle install --quiet 2>/dev/null || true"
    dependency_file: Gemfile

6.3 Dynamic Runtime Registration

When the agent detects an interpreter that matches a runtime template (by name/alias) but the runtime doesn't exist in the DB yet, the agent can auto-register it:

  1. Look up the runtime by name in the DB using alias-aware matching
  2. If found → use it (existing behavior)
  3. If not found → check if a runtime template exists in loaded packs
  4. If template found → register the runtime using the template's execution_config
  5. If no template → register a minimal runtime with just the detected interpreter binary path
  6. Mark auto-registered runtimes with auto_detected = true

This ensures the agent can work with new runtimes immediately, even if the runtime hasn't been explicitly configured.


Phase 7: Kubernetes Support

Status: Complete

Goal: Provide Kubernetes manifests and Helm chart support for agent-based workers.

Effort: 12 days

Dependencies: Phase 4 (Docker Compose working), Phase 5 (binary download)

Implemented:

  • Helm chart agent-workers.yaml template — creates a Deployment per agentWorkers[] entry
  • InitContainer pattern (agent-loader) copies the statically-linked binary via emptyDir volume
  • Full scheduling support: nodeSelector, tolerations, runtimeClassName (GPU/nvidia)
  • Runtime auto-detect by default; explicit runtimes list override
  • Custom env vars, resource limits, log level, termination grace period
  • images.agent added to values.yaml for registry-aware image resolution
  • attune-agent image added to the Gitea Actions publish workflow (agent-init target)
  • NOTES.txt updated to list enabled agent workers on install
  • Quick-reference docs at docs/QUICKREF-kubernetes-agent-workers.md

7.1 InitContainer Pattern

The agent maps naturally to Kubernetes using the same Tekton/Argo pattern:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: attune-worker-ruby
spec:
  replicas: 2
  selector:
    matchLabels:
      app: attune-worker-ruby
  template:
    metadata:
      labels:
        app: attune-worker-ruby
    spec:
      initContainers:
        - name: agent-loader
          image: attune/agent:latest    # Built from Dockerfile.agent, agent-init target
          command: ["cp", "/attune-agent", "/opt/attune/agent/attune-agent"]
          volumeMounts:
            - name: agent-bin
              mountPath: /opt/attune/agent
      containers:
        - name: worker
          image: ruby:3.3
          command: ["/opt/attune/agent/attune-agent"]
          env:
            - name: ATTUNE__DATABASE__URL
              valueFrom:
                secretKeyRef:
                  name: attune-secrets
                  key: database-url
            - name: ATTUNE__MESSAGE_QUEUE__URL
              valueFrom:
                secretKeyRef:
                  name: attune-secrets
                  key: mq-url
          volumeMounts:
            - name: agent-bin
              mountPath: /opt/attune/agent
              readOnly: true
            - name: packs
              mountPath: /opt/attune/packs
              readOnly: true
            - name: runtime-envs
              mountPath: /opt/attune/runtime_envs
            - name: artifacts
              mountPath: /opt/attune/artifacts
      volumes:
        - name: agent-bin
          emptyDir: {}
        - name: packs
          persistentVolumeClaim:
            claimName: attune-packs
        - name: runtime-envs
          persistentVolumeClaim:
            claimName: attune-runtime-envs
        - name: artifacts
          persistentVolumeClaim:
            claimName: attune-artifacts

7.2 Helm Chart Values

# values.yaml (future Helm chart)
workers:
  - name: ruby
    image: ruby:3.3
    replicas: 2
    runtimes: []  # auto-detect
  - name: python-gpu
    image: nvidia/cuda:12.3.1-runtime-ubuntu22.04
    replicas: 1
    runtimes: [python, shell]
    resources:
      limits:
        nvidia.com/gpu: 1

Implementation Order & Effort Summary

Phase Description Effort Dependencies Priority
Phase 1 Static binary build infrastructure 35 days None Critical
Phase 3 Refactor worker for code reuse 23 days Phase 1 Critical
Phase 2 Runtime auto-detection 12 days Phase 1 High
Phase 4 Docker Compose integration 1 day Phase 1 High
Phase 6 DB runtime registry extensions 12 days Phase 2 Medium
Phase 5 API binary download endpoint 1 day Phase 1 Medium
Phase 7 Kubernetes manifests 12 days Phase 4, 5 Complete

Total estimated effort: 1016 days

Phases 2 and 3 can be done in parallel. Phase 4 can start as soon as Phase 1 produces a working binary.

Minimum viable feature: Phases 1 + 3 + 4 (~69 days) produce a working agent that can be injected into any container via Docker Compose, with manual ATTUNE_WORKER_RUNTIMES configuration. Auto-detection (Phase 2) and dynamic registration (Phase 6) add polish.

Risks & Mitigations

musl + Crate Compatibility

Risk: Some crates may not compile cleanly with x86_64-unknown-linux-musl due to C library dependencies.

Impact: Build failures or runtime issues.

Mitigation:

  • SQLx already uses rustls (no OpenSSL dependency)
  • Switch reqwest and tokio-tungstenite to rustls features (Phase 1.1)
  • lapin uses pure Rust AMQP — no C dependencies
  • Test the musl build early in Phase 1 to surface issues quickly
  • If a specific crate is problematic, evaluate alternatives or use cross for cross-compilation

DNS Resolution with musl

Risk: musl's DNS resolver behaves differently from glibc (no /etc/nsswitch.conf, limited mDNS support). This can cause DNS resolution failures in Docker networks.

Impact: Agent can't resolve postgres, rabbitmq, etc. by Docker service name.

Mitigation:

  • Use trust-dns (now hickory-dns) resolver feature in SQLx and reqwest instead of the system resolver
  • Test DNS resolution in Docker Compose early
  • If issues arise, document the workaround: use IP addresses or add dns configuration to the container

Binary Size

Risk: A full statically-linked binary with all worker deps could be 40MB+.

Impact: Slow volume population, slow download via API.

Mitigation:

  • Strip debug symbols (strip command) — typically reduces by 5070%
  • Use opt-level = 'z' and lto = true in release profile
  • Consider upx compression (trades CPU at startup for smaller binary)
  • Feature-gate unused functionality if size is excessive
  • Target: <25MB stripped

Non-root User Conflicts

Risk: Different base images run as different UIDs. The agent needs write access to runtime_envs and artifacts volumes.

Impact: Permission denied errors when the container UID doesn't match the volume owner.

Mitigation:

  • Document the UID requirement (current standard: UID 1000)
  • Provide guidance for running the agent as root with privilege drop
  • Consider adding a --user flag to the agent that drops privileges after setup
  • For Kubernetes, use securityContext.runAsUser in the Pod spec

Existing Workers Must Keep Working

Risk: Refactoring WorkerService (Phase 3) could introduce regressions in existing workers.

Impact: Production workers break.

Mitigation:

  • The refactoring is additive — existing code paths don't change behavior
  • Run the full test suite after Phase 3
  • Both attune-worker and attune-agent share the same test infrastructure
  • The StartupMode::Worker path is the existing code path with no behavioral changes

Volume Mount Ordering

Risk: The agent container starts before the init-agent service has populated the volume.

Impact: Agent binary not found, container crashes.

Mitigation:

  • Use depends_on: { init-agent: { condition: service_completed_successfully } } in Docker Compose
  • The wrapper script (Phase 5.2) retries with a short sleep
  • For Kubernetes, the initContainer pattern guarantees ordering

Testing Strategy

Unit Tests

  • Auto-detection module: mock filesystem and process execution to test interpreter discovery
  • StartupMode::Agent code paths: ensure lazy setup and on-demand verification work correctly
  • All existing worker tests continue to pass (regression safety net)

Integration Tests

  • Build the agent binary with musl and run it in various container images:
    • ruby:3.3-slim (Ruby + shell)
    • python:3.12-slim (Python + shell)
    • node:20-slim (Node.js + shell)
    • alpine:3.20 (shell only)
    • ubuntu:24.04 (shell only)
    • debian:bookworm-slim (shell only, matches current worker)
  • Verify: agent starts, auto-detects runtimes, registers with correct capabilities, executes a simple action, reports results
  • Verify: DNS resolution works for Docker service names

Docker Compose Tests

  • Spin up the full stack with agent-based workers alongside traditional workers
  • Execute actions that target specific runtimes
  • Verify the scheduler routes to the correct worker based on capabilities
  • Verify graceful shutdown (SIGTERM handling)

Binary Compatibility Tests

  • Test the musl binary on: Alpine, Debian, Ubuntu, CentOS/Rocky, Amazon Linux
  • Test on both x86_64 and aarch64 (if multi-arch build is implemented)
  • Verify no glibc dependency: ldd attune-agent should report "not a dynamic executable"

Future Enhancements

These are not part of the initial implementation but are natural extensions:

  1. Per-execution container isolation: Instead of a long-running agent, spawn a fresh container per execution with the agent injected. Provides maximum isolation (each action runs in a clean environment) at the cost of startup latency.

  2. Container image selection in action YAML: Allow actions to declare container: ruby:3.3 in their YAML, and have the executor spin up an appropriate container with the agent injected. Similar to GitHub Actions' container actions.

  3. Warm pool: Pre-start a pool of agent containers for common runtimes to reduce first-execution latency.

  4. Agent self-update: The agent periodically checks for a newer version of itself (via the API endpoint) and restarts if updated.

  5. Windows support: Cross-compile the agent for Windows (MSVC static linking) to support Windows containers.

  6. WebAssembly runtime: Compile actions to WASM and execute them inside the agent using wasmtime, eliminating the need for interpreter binaries entirely.

References