attune/docs/plans/universal-worker-agent.md

# Universal Worker Agent Injection

## Overview

This plan describes a new deployment model for Attune workers: a **statically-linked agent binary** that can be injected into any Docker container at runtime, turning arbitrary images into Attune workers. This eliminates the need to build custom worker Docker images for each runtime environment.

### Problem

Today, every Attune worker is a purpose-built Docker image: the same `attune-worker` Rust binary baked into Debian images with specific interpreters installed (see `docker/Dockerfile.worker.optimized`). Adding a new runtime (e.g., Ruby, Go, Java, R) means:

1. Modifying `Dockerfile.worker.optimized` to add a new build stage
2. Installing the interpreter via apt or a package repository
3. Managing the combinatorial explosion of worker variants
4. Rebuilding images (~5 min) for every change
5. Standardizing on `debian:bookworm-slim` as the base (not the runtime's official image)

### Solution

Flip the model: **any Docker image becomes an Attune worker** by injecting a lightweight agent binary at container startup. The agent binary is a statically-linked (musl) Rust executable that connects to MQ/DB, consumes execution messages, spawns subprocesses, and reports results — functionally identical to the current worker, but packaged for universal deployment.

Want Ruby support? Point at `ruby:3.3` and go. Need a GPU runtime? Use `nvidia/cuda:12.3-runtime`. Need a specific Python version with scientific libraries pre-installed? Use any image that has them.

### Industry Precedent

This pattern is battle-tested in major CI/CD and workflow systems:

| System | Pattern | How It Works |
|--------|---------|-------------|
| **Tekton** | InitContainer + shared volume | Copies a static Go `entrypoint` binary into an `emptyDir`; overrides the user container's entrypoint to use it. Steps coordinate via file-based signaling. |
| **Argo Workflows (Emissary)** | InitContainer + sidecar | The `emissary` binary runs as both an init container and a sidecar. Disk-based coordination, no Docker socket, no privileged access. |
| **GitLab CI Runner (Step Runner)** | Binary injection | Newer "Native Step Runner" mode injects a `step-runner` binary into the build container and adjusts `$PATH`. Communicates via gRPC. |
| **Istio** | Mutating webhook | Kubernetes admission controller adds init + sidecar containers transparently. |

The **Tekton/Argo pattern** (static binary + shared volume) is the best fit for Attune because:

- It works with Docker Compose (not K8s-only) via bind mounts / named volumes
- It requires zero dependencies in the user image (just a Linux kernel)
- A static Rust binary (musl-linked) is ~15–25MB and runs anywhere
- No privileged access, no Docker socket needed inside the container

### Compatibility

This plan is **purely additive**. Nothing changes for existing workers:

- `Dockerfile.worker.optimized` and its four targets remain unchanged and functional
- Current `docker-compose.yaml` worker services keep working
- All MQ protocols, DB schemas, and execution flows remain identical
- The agent is just another way to run the same execution engine

## Architecture

```
┌──────────────────────────────────────────────────────────────────┐
│                      Attune Control Plane                        │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐  │
│  │   API    │    │ Executor │    │ RabbitMQ │    │ Postgres │  │
│  └────┬─────┘    └────┬─────┘    └─────┬────┘    └────┬─────┘  │
│       │               │               │               │        │
└───────┼───────────────┼───────────────┼───────────────┼────────┘
        │               │               │               │
  ┌─────┼───────────────┼───────────────┼───────────────┼──────┐
  │     ▼               ▼               ▼               ▼      │
  │  ┌──────────────────────────────────────────────────────┐  │
  │  │            attune-agent (injected binary)            │  │
  │  │  ┌──────────┐ ┌──────────┐ ┌─────────┐ ┌────────┐  │  │
  │  │  │MQ Client │ │DB Client │ │ Process │ │Artifact│  │  │
  │  │  │(lapin)   │ │(sqlx)    │ │Executor │ │Manager │  │  │
  │  │  └──────────┘ └──────────┘ └─────────┘ └────────┘  │  │
  │  └──────────────────────────────────────────────────────┘  │
  │                                                            │
  │  ┌──────────────────────────────────────────────────────┐  │
  │  │     User Container (ANY Docker image)                │  │
  │  │     ruby:3.3, python:3.12, nvidia/cuda, alpine, ...  │  │
  │  └──────────────────────────────────────────────────────┘  │
  │                                                            │
  │  Shared Volumes:                                           │
  │    /opt/attune/agent/       (agent binary, read-only)      │
  │    /opt/attune/packs/       (pack files, read-only)        │
  │    /opt/attune/runtime_envs/(virtualenvs, node_modules)    │
  │    /opt/attune/artifacts/   (artifact files)               │
  └────────────────────────────────────────────────────────────┘
```

### Agent vs. Full Worker Comparison

The agent binary is functionally identical to the current `attune-worker`. The difference is packaging and startup behavior:

| Capability | Full Worker (`attune-worker`) | Agent (`attune-agent`) |
|-----------|------------------------------|------------------------|
| MQ consumption | ✅ | ✅ |
| DB access | ✅ | ✅ |
| Process execution | ✅ | ✅ |
| Artifact management | ✅ | ✅ |
| Secret management | ✅ | ✅ |
| Cancellation / timeout | ✅ | ✅ |
| Heartbeat | ✅ | ✅ |
| Runtime env setup (venvs) | ✅ Proactive at startup | ✅ Lazy on first use |
| Version verification | ✅ Full sweep at startup | ✅ On-demand per-execution |
| Runtime discovery | Manual (`ATTUNE_WORKER_RUNTIMES`) | Auto-detect + optional manual override |
| Linking | Dynamic (glibc) | Static (musl) |
| Base image requirement | `debian:bookworm-slim` | None (any Linux container) |
| Binary size | ~30–50MB | ~15–25MB (stripped, musl) |

### Binary Distribution Methods

Two methods for getting the agent binary into a container:

**Method A: Shared Volume (Docker Compose — recommended)**

An init container copies the agent binary into a Docker named volume. User containers mount this volume read-only and use the binary as their entrypoint.

**Method B: HTTP Download (remote / cloud deployments)**

A new API endpoint (`GET /api/v1/agent/binary`) serves the static binary. A small wrapper script in the container downloads it on first run. Useful for Kubernetes, ECS, or remote Docker hosts where shared volumes are impractical.

## Implementation Phases

### Phase 1: Static Binary Build Infrastructure

**Goal**: Produce a statically-linked `attune-agent` binary that runs in any Linux container.

**Effort**: 3–5 days

**Dependencies**: None

#### 1.1 TLS Backend Audit and Alignment

The agent must link statically with musl. This requires all TLS to use `rustls` (pure Rust) instead of OpenSSL/native-tls.

**Current state** (from `Cargo.toml` workspace dependencies):
- `sqlx`: Already uses `runtime-tokio-rustls` ✅
- `reqwest`: Uses default features (native-tls) — needs `rustls-tls` feature ❌
- `tokio-tungstenite`: Uses `native-tls` feature — needs `rustls` ❌
- `lapin` (v4.3): Uses native-tls by default — needs `rustls` feature ❌

**Changes needed in workspace `Cargo.toml`**:

```toml
# Change reqwest to use rustls
reqwest = { version = "0.13", features = ["json", "rustls-tls"], default-features = false }

# Change tokio-tungstenite to use rustls
tokio-tungstenite = { version = "0.28", features = ["rustls"] }

# Check lapin's TLS features — if using amqps://, need rustls support.
# For plain amqp:// (typical in Docker Compose), no TLS needed.
# For production amqps://, evaluate lapin's rustls support or use a TLS-terminating proxy.
```

**Important**: These changes affect the entire workspace. Test all services (`api`, `executor`, `worker`, `notifier`, `sensor`, `cli`) after switching TLS backends. If switching workspace-wide is too disruptive, use feature flags to conditionally select the TLS backend for the agent build only.

**Alternative**: If workspace-wide rustls migration is too risky, the agent crate can override specific dependencies:

```toml
[dependencies]
reqwest = { workspace = true, default-features = false, features = ["json", "rustls-tls"] }
```

#### 1.2 New Crate or New Binary Target

**Option A (recommended): New binary target in the worker crate**

Add a second binary target to `crates/worker/Cargo.toml`:

```toml
[[bin]]
name = "attune-worker"
path = "src/main.rs"

[[bin]]
name = "attune-agent"
path = "src/agent_main.rs"
```

This reuses all existing code — `ActionExecutor`, `ProcessRuntime`, `WorkerService`, `RuntimeRegistry`, `SecretManager`, `ArtifactManager`, etc. The agent entrypoint is a thin wrapper with different startup behavior (auto-detection instead of manual config).

**Pros**: Zero code duplication. Same test suite covers both binaries.
**Cons**: The agent binary includes unused code paths (e.g., full worker service setup).

**Option B: New crate `crates/agent/`**

```
crates/agent/
├── Cargo.toml          # Depends on attune-common + selected worker modules
├── src/
│   ├── main.rs         # Entry point
│   ├── agent.rs        # Core agent loop
│   ├── detect.rs       # Runtime auto-detection
│   └── health.rs       # Health check (file-based or tiny HTTP)
```

**Pros**: Cleaner separation, can minimize binary size by excluding unused deps.
**Cons**: Requires extracting shared execution code into a library or duplicating it.

**Recommendation**: Start with **Option A** (new binary target in worker crate) for speed. Refactor into a separate crate later if binary size becomes a concern.

#### 1.3 Agent Entrypoint (`src/agent_main.rs`)

The agent entrypoint differs from `main.rs` in:

1. **Runtime auto-detection** instead of relying on `ATTUNE_WORKER_RUNTIMES`
2. **Lazy environment setup** instead of proactive startup sweep
3. **Simplified config loading** — env vars are the primary config source (no config file required, but supported if mounted)
4. **Container-aware defaults** — sensible defaults for paths, timeouts, concurrency

```
src/agent_main.rs responsibilities:
  1. Parse CLI args / env vars for DB URL, MQ URL, worker name
  2. Run runtime auto-detection (Phase 2) to discover available interpreters
  3. Initialize WorkerService with detected capabilities
  4. Start the normal execution consumer loop
  5. Handle SIGTERM/SIGINT for graceful shutdown
```

#### 1.4 Dockerfile for Agent Binary

Create `docker/Dockerfile.agent`:

```dockerfile
# Stage 1: Build the statically-linked agent binary
FROM rust:1.83-bookworm AS builder

RUN apt-get update && apt-get install -y musl-tools
RUN rustup target add x86_64-unknown-linux-musl

WORKDIR /build
ENV RUST_MIN_STACK=67108864

COPY Cargo.toml Cargo.lock ./
COPY crates/ ./crates/
COPY migrations/ ./migrations/
COPY .sqlx/ ./.sqlx/

# Build only the agent binary, statically linked
RUN --mount=type=cache,target=/usr/local/cargo/registry,sharing=shared \
    --mount=type=cache,target=/usr/local/cargo/git,sharing=shared \
    --mount=type=cache,id=agent-target,target=/build/target \
    SQLX_OFFLINE=true cargo build --release \
      --target x86_64-unknown-linux-musl \
      --bin attune-agent \
    && cp /build/target/x86_64-unknown-linux-musl/release/attune-agent /attune-agent \
    && strip /attune-agent

# Stage 2: Minimal image for volume population
FROM scratch AS agent-binary
COPY --from=builder /attune-agent /attune-agent
```

**Multi-architecture support**: For ARM64 (Apple Silicon, Graviton), add a parallel build stage targeting `aarch64-unknown-linux-musl`. Use Docker buildx multi-platform builds or separate images.

#### 1.5 Makefile Targets

Add to `Makefile`:

```makefile
build-agent:
	SQLX_OFFLINE=true cargo build --release --target x86_64-unknown-linux-musl --bin attune-agent
	strip target/x86_64-unknown-linux-musl/release/attune-agent

docker-build-agent:
	docker buildx build -f docker/Dockerfile.agent -t attune-agent:latest .
```

---

### Phase 2: Runtime Auto-Detection

**Goal**: The agent automatically discovers what interpreters are available in the container, without requiring `ATTUNE_WORKER_RUNTIMES` to be set.

**Effort**: 1–2 days

**Dependencies**: Phase 1 (agent binary exists)

#### 2.1 Interpreter Discovery Module

Create a new module (in `crates/worker/src/` or `crates/common/src/`) that probes the container's filesystem for known interpreters:

```
src/runtime_detect.rs (or extend existing crates/common/src/runtime_detection.rs)

struct DetectedInterpreter {
    runtime_name: String,       // "python", "ruby", "node", etc.
    binary_path: PathBuf,       // "/usr/local/bin/python3"
    version: Option<String>,    // "3.12.1" (parsed from version command output)
}

/// Probe the container for available interpreters.
///
/// For each known runtime, checks common binary names via `which` or
/// direct path existence, then runs the version command to extract
/// the version string.
fn detect_interpreters() -> Vec<DetectedInterpreter> {
    let probes = [
        InterpreterProbe {
            runtime_name: "python",
            binaries: &["python3", "python"],
            version_flag: "--version",
            version_regex: r"Python (\d+\.\d+\.\d+)",
        },
        InterpreterProbe {
            runtime_name: "node",
            binaries: &["node", "nodejs"],
            version_flag: "--version",
            version_regex: r"v(\d+\.\d+\.\d+)",
        },
        InterpreterProbe {
            runtime_name: "ruby",
            binaries: &["ruby"],
            version_flag: "--version",
            version_regex: r"ruby (\d+\.\d+\.\d+)",
        },
        InterpreterProbe {
            runtime_name: "go",
            binaries: &["go"],
            version_flag: "version",
            version_regex: r"go(\d+\.\d+\.\d+)",
        },
        InterpreterProbe {
            runtime_name: "java",
            binaries: &["java"],
            version_flag: "-version",
            version_regex: r#""(\d+[\.\d+]*)""#,
        },
        InterpreterProbe {
            runtime_name: "perl",
            binaries: &["perl"],
            version_flag: "--version",
            version_regex: r"v(\d+\.\d+\.\d+)",
        },
        InterpreterProbe {
            runtime_name: "r",
            binaries: &["Rscript", "R"],
            version_flag: "--version",
            version_regex: r"R.*version (\d+\.\d+\.\d+)",
        },
        InterpreterProbe {
            runtime_name: "shell",
            binaries: &["bash", "sh"],
            version_flag: "--version",
            version_regex: r"(\d+\.\d+\.\d+)",
        },
    ];

    // For each probe:
    // 1. Run `which <binary>` or check known paths
    // 2. If found, run `<binary> <version_flag>` with a short timeout (2s)
    // 3. Parse version from output using the regex
    // 4. Return DetectedInterpreter with the results
}
```

**Integration with existing code**: The existing `crates/common/src/runtime_detection.rs` already has `normalize_runtime_name()` and alias groups. The auto-detection module should use these for matching detected interpreters against DB runtime records.

#### 2.2 Integration with Worker Registration

The agent startup sequence:

1. Run `detect_interpreters()`
2. Match detected interpreters against known runtimes in the `runtime` table (using alias-aware matching from `runtime_detection.rs`)
3. If `ATTUNE_WORKER_RUNTIMES` is set, use it as an override (intersection or union — TBD, probably override wins)
4. Register the worker with the detected/configured capabilities
5. Log what was detected for debugging:
   ```
   [INFO] Detected runtimes: python 3.12.1 (/usr/local/bin/python3), ruby 3.3.0 (/usr/local/bin/ruby), shell 5.2.21 (/bin/bash)
   [INFO] Registering worker with capabilities: [python, ruby, shell]
   ```

#### 2.3 Runtime Hints File (Optional Enhancement)

Allow a `.attune-runtime.yaml` file in the container that declares runtime capabilities and custom configuration. This handles cases where auto-detection isn't sufficient (e.g., custom interpreters, non-standard paths, special environment setup).

```yaml
# /opt/attune/.attune-runtime.yaml (or /.attune-runtime.yaml)
runtimes:
  - name: ruby
    interpreter: /usr/local/bin/ruby
    file_extension: .rb
    version_command: "ruby --version"
    env_setup:
      create_command: "mkdir -p {env_dir}"
      install_command: "cd {env_dir} && bundle install --gemfile {pack_dir}/Gemfile"
  - name: custom-ml
    interpreter: /opt/conda/bin/python
    file_extension: .py
    version_command: "/opt/conda/bin/python --version"
```

The agent checks for this file at startup and merges it with auto-detected runtimes (hints file takes precedence for conflicting runtime names).

**This is a nice-to-have for Phase 2 — implement only if auto-detection proves insufficient for common use cases.**

---

### Phase 3: Refactor Worker for Code Reuse

**Goal**: Ensure the execution engine is cleanly reusable between the full `attune-worker` and the `attune-agent` binary, without code duplication.

**Effort**: 2–3 days

**Dependencies**: Phase 1 (agent entrypoint exists), can be done in parallel with Phase 2

#### 3.1 Identify Shared vs. Agent-Specific Code

Current worker crate modules and their reuse status:

| Module | File(s) | Shared? | Notes |
|--------|---------|---------|-------|
| `ActionExecutor` | `executor.rs` | ✅ Fully shared | Core execution orchestration |
| `ProcessRuntime` | `runtime/process.rs` | ✅ Fully shared | Subprocess spawning, interpreter resolution |
| `process_executor` | `runtime/process_executor.rs` | ✅ Fully shared | Streaming output capture, timeout, cancellation |
| `NativeRuntime` | `runtime/native.rs` | ✅ Fully shared | Direct binary execution |
| `LocalRuntime` | `runtime/local.rs` | ✅ Fully shared | Fallback runtime facade |
| `RuntimeRegistry` | `runtime/mod.rs` | ✅ Fully shared | Runtime selection and registration |
| `ExecutionContext` | `runtime/mod.rs` | ✅ Fully shared | Execution parameters, env vars, secrets |
| `BoundedLogWriter` | `runtime/log_writer.rs` | ✅ Fully shared | Streaming log capture with size limits |
| `parameter_passing` | `runtime/parameter_passing.rs` | ✅ Fully shared | Stdin/file/env parameter delivery |
| `SecretManager` | `secrets.rs` | ✅ Fully shared | Secret decryption via `attune_common::crypto` |
| `ArtifactManager` | `artifacts.rs` | ✅ Fully shared | Artifact finalization (file stat, size update) |
| `HeartbeatManager` | `heartbeat.rs` | ✅ Fully shared | Periodic DB heartbeat |
| `WorkerRegistration` | `registration.rs` | ✅ Shared, extended | Needs auto-detection integration |
| `env_setup` | `env_setup.rs` | ✅ Shared, lazy mode | Agent uses lazy setup instead of proactive |
| `version_verify` | `version_verify.rs` | ✅ Shared, on-demand mode | Agent verifies on-demand instead of full sweep |
| `WorkerService` | `service.rs` | ⚠️ Needs refactoring | Extract reusable `AgentService` or parameterize |

**Conclusion**: Almost everything is already reusable. The main work is in `service.rs`, which needs to be parameterized for the two startup modes (proactive vs. lazy).

#### 3.2 Refactor `WorkerService` for Dual Modes

Instead of duplicating `WorkerService`, add a configuration enum:

```rust
// In service.rs or a new config module

/// Controls how the worker initializes its runtime environment.
pub enum StartupMode {
    /// Full worker mode: proactive environment setup, full version
    /// verification sweep at startup. Used by `attune-worker`.
    Worker,

    /// Agent mode: lazy environment setup (on first use), on-demand
    /// version verification, auto-detected runtimes. Used by `attune-agent`.
    Agent {
        /// Runtimes detected by the auto-detection module.
        detected_runtimes: Vec<DetectedInterpreter>,
    },
}
```

The `WorkerService::start()` method checks this mode:

```rust
match &self.startup_mode {
    StartupMode::Worker => {
        // Existing behavior: full version verification sweep
        self.verify_all_runtime_versions().await?;
        // Existing behavior: proactive environment setup for all packs
        self.setup_all_environments().await?;
    }
    StartupMode::Agent { .. } => {
        // Skip proactive setup — will happen lazily on first execution
        info!("Agent mode: deferring environment setup to first execution");
    }
}
```

#### 3.3 Lazy Environment Setup

In agent mode, the first execution for a given pack+runtime combination triggers environment setup:

```rust
// In executor.rs, within execute_with_cancel()

// Before executing, ensure the runtime environment exists
if !env_dir.exists() {
    info!("Creating runtime environment on first use: {}", env_dir.display());
    self.env_setup.setup_environment(&pack_ref, &runtime_name, &env_dir).await?;
}
```

The current worker already handles this partially — the `ProcessRuntime::execute()` method has auto-repair logic for broken venvs. The lazy setup extends this to handle the case where the env directory doesn't exist at all.

---

### Phase 4: Docker Compose Integration

**Goal**: Make it trivial to add agent-based workers to `docker-compose.yaml`.

**Effort**: 1 day

**Dependencies**: Phase 1 (agent binary and Dockerfile exist)

#### 4.1 Init Service for Agent Volume

Add to `docker-compose.yaml`:

```yaml
services:
  # Populates the agent binary volume (runs once)
  init-agent:
    build:
      context: .
      dockerfile: docker/Dockerfile.agent
    volumes:
      - agent_bin:/opt/attune/agent
    entrypoint: ["/bin/sh", "-c", "cp /attune-agent /opt/attune/agent/attune-agent && chmod +x /opt/attune/agent/attune-agent"]
    restart: "no"
    networks:
      - attune

volumes:
  agent_bin:  # Named volume holding the static agent binary
```

Note: The init-agent service needs a minimal base with `/bin/sh` for the `cp` command. Since the agent Dockerfile's final stage is `FROM scratch`, the init service should use the builder stage or a separate `FROM alpine` stage.

**Revised Dockerfile.agent approach** — use Alpine for the init image so it has a shell:

```dockerfile
# Stage 1: Build
FROM rust:1.83-bookworm AS builder
# ... (build steps from Phase 1.4)

# Stage 2: Init image (has a shell for cp)
FROM alpine:3.20 AS agent-init
COPY --from=builder /attune-agent /attune-agent
# Default command copies the binary into the mounted volume
CMD ["cp", "/attune-agent", "/opt/attune/agent/attune-agent"]

# Stage 3: Bare binary (for HTTP download or direct use)
FROM scratch AS agent-binary
COPY --from=builder /attune-agent /attune-agent
```

#### 4.2 Agent-Based Worker Services

Example services that can be added to `docker-compose.yaml` or a user's `docker-compose.override.yaml`:

```yaml
  # Ruby worker — uses the official Ruby image
  worker-ruby:
    image: ruby:3.3-slim
    depends_on:
      init-agent:
        condition: service_completed_successfully
      postgres:
        condition: service_healthy
      rabbitmq:
        condition: service_healthy
    entrypoint: ["/opt/attune/agent/attune-agent"]
    volumes:
      - agent_bin:/opt/attune/agent:ro
      - packs_data:/opt/attune/packs:ro
      - runtime_envs:/opt/attune/runtime_envs
      - artifacts_data:/opt/attune/artifacts
      - ${ATTUNE_DOCKER_CONFIG_PATH:-./config.docker.yaml}:/opt/attune/config/config.yaml:ro
    environment:
      ATTUNE_WORKER_NAME: worker-ruby-1
      # ATTUNE_WORKER_RUNTIMES omitted — auto-detected as ruby,shell
    networks:
      - attune
    restart: unless-stopped
    stop_grace_period: 45s

  # R worker — uses the official R base image
  worker-r:
    image: r-base:4.4.0
    depends_on:
      init-agent:
        condition: service_completed_successfully
      postgres:
        condition: service_healthy
      rabbitmq:
        condition: service_healthy
    entrypoint: ["/opt/attune/agent/attune-agent"]
    volumes:
      - agent_bin:/opt/attune/agent:ro
      - packs_data:/opt/attune/packs:ro
      - runtime_envs:/opt/attune/runtime_envs
      - artifacts_data:/opt/attune/artifacts
      - ${ATTUNE_DOCKER_CONFIG_PATH:-./config.docker.yaml}:/opt/attune/config/config.yaml:ro
    environment:
      ATTUNE_WORKER_NAME: worker-r-1
    networks:
      - attune
    restart: unless-stopped

  # GPU worker — NVIDIA CUDA image with Python
  worker-gpu:
    image: nvidia/cuda:12.3.1-runtime-ubuntu22.04
    depends_on:
      init-agent:
        condition: service_completed_successfully
      postgres:
        condition: service_healthy
      rabbitmq:
        condition: service_healthy
    entrypoint: ["/opt/attune/agent/attune-agent"]
    runtime: nvidia
    volumes:
      - agent_bin:/opt/attune/agent:ro
      - packs_data:/opt/attune/packs:ro
      - runtime_envs:/opt/attune/runtime_envs
      - artifacts_data:/opt/attune/artifacts
      - ${ATTUNE_DOCKER_CONFIG_PATH:-./config.docker.yaml}:/opt/attune/config/config.yaml:ro
    environment:
      ATTUNE_WORKER_NAME: worker-gpu-1
      ATTUNE_WORKER_RUNTIMES: python,shell  # Manual override (image has python pre-installed)
    networks:
      - attune
    restart: unless-stopped
```

#### 4.3 User Experience Summary

Adding a new runtime to an Attune deployment becomes a ~12 line addition to `docker-compose.override.yaml`:

```yaml
services:
  worker-my-runtime:
    image: my-org/my-custom-image:latest
    depends_on:
      init-agent:
        condition: service_completed_successfully
      postgres:
        condition: service_healthy
      rabbitmq:
        condition: service_healthy
    entrypoint: ["/opt/attune/agent/attune-agent"]
    volumes:
      - agent_bin:/opt/attune/agent:ro
      - packs_data:/opt/attune/packs:ro
      - runtime_envs:/opt/attune/runtime_envs
      - artifacts_data:/opt/attune/artifacts
      - ${ATTUNE_DOCKER_CONFIG_PATH:-./config.docker.yaml}:/opt/attune/config/config.yaml:ro
    networks:
      - attune
```

No Dockerfiles. No rebuilds. No waiting for Rust compilation. Start to finish in seconds.

---

### Phase 5: API Binary Download Endpoint

**Goal**: Support deployments where shared Docker volumes are impractical (Kubernetes, ECS, remote Docker hosts).

**Effort**: 1 day

**Dependencies**: Phase 1 (agent binary exists)

#### 5.1 New API Route

Add to `crates/api/src/routes/`:

```
GET /api/v1/agent/binary
GET /api/v1/agent/binary?arch=x86_64    (default)
GET /api/v1/agent/binary?arch=aarch64

Response: application/octet-stream
Headers: Content-Disposition: attachment; filename="attune-agent"
```

The API serves the binary from a configurable filesystem path (e.g., `/opt/attune/agent/attune-agent`). The binary can be placed there at build time (baked into the API image) or mounted via volume.

**Configuration** (`config.yaml`):

```yaml
agent:
  binary_dir: /opt/attune/agent   # Directory containing agent binaries
  # Files expected: attune-agent-x86_64, attune-agent-aarch64
```

**OpenAPI documentation** via `utoipa`:

```rust
#[utoipa::path(
    get,
    path = "/api/v1/agent/binary",
    params(("arch" = Option<String>, Query, description = "Target architecture (x86_64, aarch64)")),
    responses(
        (status = 200, description = "Agent binary", content_type = "application/octet-stream"),
        (status = 404, description = "Binary not found for requested architecture"),
    ),
    tag = "agent"
)]
```

**Authentication**: This endpoint should be **unauthenticated** or use a simple shared token, since the agent needs to download the binary before it can authenticate. Alternatively, require basic auth or a bootstrap token passed via environment variable.

#### 5.2 Bootstrap Wrapper Script

Provide `scripts/attune-agent-wrapper.sh` for use as a container entrypoint:

```bash
#!/bin/sh
# attune-agent-wrapper.sh — Bootstrap the Attune agent in any container
set -e

AGENT_DIR="${ATTUNE_AGENT_DIR:-/opt/attune/agent}"
AGENT_BIN="$AGENT_DIR/attune-agent"
AGENT_URL="${ATTUNE_AGENT_URL:-http://attune-api:8080/api/v1/agent/binary}"

# Use volume-mounted binary if available, otherwise download
if [ ! -x "$AGENT_BIN" ]; then
  echo "[attune] Agent binary not found at $AGENT_BIN, downloading from $AGENT_URL..."
  mkdir -p "$AGENT_DIR"
  if command -v wget >/dev/null 2>&1; then
    wget -q -O "$AGENT_BIN" "$AGENT_URL"
  elif command -v curl >/dev/null 2>&1; then
    curl -sL "$AGENT_URL" -o "$AGENT_BIN"
  else
    echo "[attune] ERROR: Neither wget nor curl available. Cannot download agent." >&2
    exit 1
  fi
  chmod +x "$AGENT_BIN"
  echo "[attune] Agent binary downloaded successfully."
fi

echo "[attune] Starting agent..."
exec "$AGENT_BIN" "$@"
```

Usage:

```yaml
# In docker-compose or K8s — when volume mount isn't available
worker-remote:
  image: python:3.12-slim
  entrypoint: ["/opt/attune/scripts/attune-agent-wrapper.sh"]
  volumes:
    - ./scripts/attune-agent-wrapper.sh:/opt/attune/scripts/attune-agent-wrapper.sh:ro
  environment:
    ATTUNE_AGENT_URL: http://attune-api:8080/api/v1/agent/binary
```

---

### Phase 6: Database & Runtime Registry Extensions

**Goal**: Support arbitrary runtimes without requiring every possible runtime to be pre-registered in the DB.

**Effort**: 1–2 days

**Dependencies**: Phase 2 (auto-detection working)

#### 6.1 Extended Runtime Detection Metadata

Add a migration to support auto-detected runtimes:

```sql
-- Migration: NNNNNN_agent_runtime_detection.sql

-- Track whether a runtime was auto-registered by an agent
ALTER TABLE runtime ADD COLUMN IF NOT EXISTS auto_detected BOOLEAN NOT NULL DEFAULT FALSE;

-- Store detection configuration for auto-discovered runtimes
-- Example: { "binaries": ["ruby", "ruby3.2"], "version_command": "--version",
--            "version_regex": "ruby (\\d+\\.\\d+\\.\\d+)" }
ALTER TABLE runtime ADD COLUMN IF NOT EXISTS detection_config JSONB NOT NULL DEFAULT '{}';
```

#### 6.2 Runtime Template Packs

Ship pre-configured runtime definitions for common languages in the `core` pack (or a new `runtimes` pack). These are registered during pack loading and provide the `execution_config` that auto-detected interpreters need.

Add runtime YAML files for new languages:

```
packs/core/runtimes/ruby.yaml
packs/core/runtimes/go.yaml
packs/core/runtimes/java.yaml
packs/core/runtimes/perl.yaml
packs/core/runtimes/r.yaml
```

Example `ruby.yaml`:

```yaml
ref: core.ruby
name: Ruby
label: Ruby Runtime
description: Execute Ruby scripts
execution_config:
  interpreter:
    binary: ruby
    file_extension: .rb
  env_vars:
    GEM_HOME: "{env_dir}/gems"
    GEM_PATH: "{env_dir}/gems"
    BUNDLE_PATH: "{env_dir}/gems"
  environment:
    create_command: "mkdir -p {env_dir}/gems"
    install_command: "cd {pack_dir} && GEM_HOME={env_dir}/gems bundle install --quiet 2>/dev/null || true"
    dependency_file: Gemfile
```

#### 6.3 Dynamic Runtime Registration

When the agent detects an interpreter that matches a runtime template (by name/alias) but the runtime doesn't exist in the DB yet, the agent can auto-register it:

1. Look up the runtime by name in the DB using alias-aware matching
2. If found → use it (existing behavior)
3. If not found → check if a runtime template exists in loaded packs
4. If template found → register the runtime using the template's `execution_config`
5. If no template → register a minimal runtime with just the detected interpreter binary path
6. Mark auto-registered runtimes with `auto_detected = true`

This ensures the agent can work with new runtimes immediately, even if the runtime hasn't been explicitly configured.

---

### Phase 7: Kubernetes Support ✅

**Status**: Complete

**Goal**: Provide Kubernetes manifests and Helm chart support for agent-based workers.

**Effort**: 1–2 days

**Dependencies**: Phase 4 (Docker Compose working), Phase 5 (binary download)

**Implemented**:
- Helm chart `agent-workers.yaml` template — creates a Deployment per `agentWorkers[]` entry
- InitContainer pattern (`agent-loader`) copies the statically-linked binary via `emptyDir` volume
- Full scheduling support: `nodeSelector`, `tolerations`, `runtimeClassName` (GPU/nvidia)
- Runtime auto-detect by default; explicit `runtimes` list override
- Custom env vars, resource limits, log level, termination grace period
- `images.agent` added to `values.yaml` for registry-aware image resolution
- `attune-agent` image added to the Gitea Actions publish workflow (`agent-init` target)
- `NOTES.txt` updated to list enabled agent workers on install
- Quick-reference docs at `docs/QUICKREF-kubernetes-agent-workers.md`

#### 7.1 InitContainer Pattern

The agent maps naturally to Kubernetes using the same Tekton/Argo pattern:

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: attune-worker-ruby
spec:
  replicas: 2
  selector:
    matchLabels:
      app: attune-worker-ruby
  template:
    metadata:
      labels:
        app: attune-worker-ruby
    spec:
      initContainers:
        - name: agent-loader
          image: attune/agent:latest    # Built from Dockerfile.agent, agent-init target
          command: ["cp", "/attune-agent", "/opt/attune/agent/attune-agent"]
          volumeMounts:
            - name: agent-bin
              mountPath: /opt/attune/agent
      containers:
        - name: worker
          image: ruby:3.3
          command: ["/opt/attune/agent/attune-agent"]
          env:
            - name: ATTUNE__DATABASE__URL
              valueFrom:
                secretKeyRef:
                  name: attune-secrets
                  key: database-url
            - name: ATTUNE__MESSAGE_QUEUE__URL
              valueFrom:
                secretKeyRef:
                  name: attune-secrets
                  key: mq-url
          volumeMounts:
            - name: agent-bin
              mountPath: /opt/attune/agent
              readOnly: true
            - name: packs
              mountPath: /opt/attune/packs
              readOnly: true
            - name: runtime-envs
              mountPath: /opt/attune/runtime_envs
            - name: artifacts
              mountPath: /opt/attune/artifacts
      volumes:
        - name: agent-bin
          emptyDir: {}
        - name: packs
          persistentVolumeClaim:
            claimName: attune-packs
        - name: runtime-envs
          persistentVolumeClaim:
            claimName: attune-runtime-envs
        - name: artifacts
          persistentVolumeClaim:
            claimName: attune-artifacts
```

#### 7.2 Helm Chart Values

```yaml
# values.yaml (future Helm chart)
workers:
  - name: ruby
    image: ruby:3.3
    replicas: 2
    runtimes: []  # auto-detect
  - name: python-gpu
    image: nvidia/cuda:12.3.1-runtime-ubuntu22.04
    replicas: 1
    runtimes: [python, shell]
    resources:
      limits:
        nvidia.com/gpu: 1
```

---

## Implementation Order & Effort Summary

| Phase | Description | Effort | Dependencies | Priority |
|-------|------------|--------|-------------|----------|
| **Phase 1** | Static binary build infrastructure | 3–5 days | None | Critical |
| **Phase 3** | Refactor worker for code reuse | 2–3 days | Phase 1 | Critical |
| **Phase 2** | Runtime auto-detection | 1–2 days | Phase 1 | High |
| **Phase 4** | Docker Compose integration | 1 day | Phase 1 | High |
| **Phase 6** | DB runtime registry extensions | 1–2 days | Phase 2 | Medium |
| **Phase 5** | API binary download endpoint | 1 day | Phase 1 | Medium |
| **Phase 7** ✅ | Kubernetes manifests | 1–2 days | Phase 4, 5 | Complete |

**Total estimated effort: 10–16 days**

Phases 2 and 3 can be done in parallel. Phase 4 can start as soon as Phase 1 produces a working binary.

**Minimum viable feature**: Phases 1 + 3 + 4 (~6–9 days) produce a working agent that can be injected into any container via Docker Compose, with manual `ATTUNE_WORKER_RUNTIMES` configuration. Auto-detection (Phase 2) and dynamic registration (Phase 6) add polish.

## Risks & Mitigations

### musl + Crate Compatibility

**Risk**: Some crates may not compile cleanly with `x86_64-unknown-linux-musl` due to C library dependencies.

**Impact**: Build failures or runtime issues.

**Mitigation**:
- SQLx already uses `rustls` (no OpenSSL dependency) ✅
- Switch `reqwest` and `tokio-tungstenite` to `rustls` features (Phase 1.1)
- `lapin` uses pure Rust AMQP — no C dependencies ✅
- Test the musl build early in Phase 1 to surface issues quickly
- If a specific crate is problematic, evaluate alternatives or use `cross` for cross-compilation

### DNS Resolution with musl

**Risk**: musl's DNS resolver behaves differently from glibc (no `/etc/nsswitch.conf`, limited mDNS support). This can cause DNS resolution failures in Docker networks.

**Impact**: Agent can't resolve `postgres`, `rabbitmq`, etc. by Docker service name.

**Mitigation**:
- Use `trust-dns` (now `hickory-dns`) resolver feature in SQLx and reqwest instead of the system resolver
- Test DNS resolution in Docker Compose early
- If issues arise, document the workaround: use IP addresses or add `dns` configuration to the container

### Binary Size

**Risk**: A full statically-linked binary with all worker deps could be 40MB+.

**Impact**: Slow volume population, slow download via API.

**Mitigation**:
- Strip debug symbols (`strip` command) — typically reduces by 50–70%
- Use `opt-level = 'z'` and `lto = true` in release profile
- Consider `upx` compression (trades CPU at startup for smaller binary)
- Feature-gate unused functionality if size is excessive
- Target: <25MB stripped

### Non-root User Conflicts

**Risk**: Different base images run as different UIDs. The agent needs write access to `runtime_envs` and `artifacts` volumes.

**Impact**: Permission denied errors when the container UID doesn't match the volume owner.

**Mitigation**:
- Document the UID requirement (current standard: UID 1000)
- Provide guidance for running the agent as root with privilege drop
- Consider adding a `--user` flag to the agent that drops privileges after setup
- For Kubernetes, use `securityContext.runAsUser` in the Pod spec

### Existing Workers Must Keep Working

**Risk**: Refactoring `WorkerService` (Phase 3) could introduce regressions in existing workers.

**Impact**: Production workers break.

**Mitigation**:
- The refactoring is additive — existing code paths don't change behavior
- Run the full test suite after Phase 3
- Both `attune-worker` and `attune-agent` share the same test infrastructure
- The `StartupMode::Worker` path is the existing code path with no behavioral changes

### Volume Mount Ordering

**Risk**: The agent container starts before the `init-agent` service has populated the volume.

**Impact**: Agent binary not found, container crashes.

**Mitigation**:
- Use `depends_on: { init-agent: { condition: service_completed_successfully } }` in Docker Compose
- The wrapper script (Phase 5.2) retries with a short sleep
- For Kubernetes, the initContainer pattern guarantees ordering

## Testing Strategy

### Unit Tests

- Auto-detection module: mock filesystem and process execution to test interpreter discovery
- `StartupMode::Agent` code paths: ensure lazy setup and on-demand verification work correctly
- All existing worker tests continue to pass (regression safety net)

### Integration Tests

- Build the agent binary with musl and run it in various container images:
  - `ruby:3.3-slim` (Ruby + shell)
  - `python:3.12-slim` (Python + shell)
  - `node:20-slim` (Node.js + shell)
  - `alpine:3.20` (shell only)
  - `ubuntu:24.04` (shell only)
  - `debian:bookworm-slim` (shell only, matches current worker)
- Verify: agent starts, auto-detects runtimes, registers with correct capabilities, executes a simple action, reports results
- Verify: DNS resolution works for Docker service names

### Docker Compose Tests

- Spin up the full stack with agent-based workers alongside traditional workers
- Execute actions that target specific runtimes
- Verify the scheduler routes to the correct worker based on capabilities
- Verify graceful shutdown (SIGTERM handling)

### Binary Compatibility Tests

- Test the musl binary on: Alpine, Debian, Ubuntu, CentOS/Rocky, Amazon Linux
- Test on both x86_64 and aarch64 (if multi-arch build is implemented)
- Verify no glibc dependency: `ldd attune-agent` should report "not a dynamic executable"

## Future Enhancements

These are not part of the initial implementation but are natural extensions:

1. **Per-execution container isolation**: Instead of a long-running agent, spawn a fresh container per execution with the agent injected. Provides maximum isolation (each action runs in a clean environment) at the cost of startup latency.

2. **Container image selection in action YAML**: Allow actions to declare `container: ruby:3.3` in their YAML, and have the executor spin up an appropriate container with the agent injected. Similar to GitHub Actions' container actions.

3. **Warm pool**: Pre-start a pool of agent containers for common runtimes to reduce first-execution latency.

4. **Agent self-update**: The agent periodically checks for a newer version of itself (via the API endpoint) and restarts if updated.

5. **Windows support**: Cross-compile the agent for Windows (MSVC static linking) to support Windows containers.

6. **WebAssembly runtime**: Compile actions to WASM and execute them inside the agent using wasmtime, eliminating the need for interpreter binaries entirely.

## References

- Tekton Entrypoint: https://github.com/tektoncd/pipeline/tree/main/cmd/entrypoint
- Argo Emissary Executor: https://argoproj.github.io/argo-workflows/workflow-executors/
- GitLab Runner Docker Executor: https://docs.gitlab.com/runner/executors/docker.html
- Current worker containerization: `docs/worker-containerization.md`
- Current runtime detection: `crates/common/src/runtime_detection.rs`
- Worker service: `crates/worker/src/service.rs`
- Process executor: `crates/worker/src/runtime/process_executor.rs`
- Worker Dockerfile: `docker/Dockerfile.worker.optimized`