# File-Based Artifact Storage Plan ## Overview Replace PostgreSQL BYTEA storage for file-type artifacts with a shared filesystem volume. Execution processes write artifact files directly to disk via paths assigned by the API; the API serves those files from disk on download. The database stores only metadata (path, size, content type) — no binary content for file-based artifacts. **Motivation:** - Eliminates PostgreSQL bloat from large binary artifacts - Enables executions to write files incrementally (streaming logs, large outputs) without buffering in memory for an API upload - Artifacts can be retained independently of execution records (executions are hypertables with 90-day retention) - Decouples artifact lifecycle from execution lifecycle — artifacts created by one execution can be accessed by others or by external systems ## Artifact Type Classification | Type | Storage | Notes | |------|---------|-------| | `FileBinary` | **Disk** (shared volume) | Binary files produced by executions | | `FileDatatable` | **Disk** (shared volume) | Tabular data files (CSV, etc.) | | `FileText` | **Disk** (shared volume) | Text files, logs | | `Log` | **Disk** (shared volume) | Execution stdout/stderr logs | | `Progress` | **DB** (`artifact.data` JSONB) | Small structured progress entries — unchanged | | `Url` | **DB** (`artifact.data` JSONB) | URL references — unchanged | ## Directory Structure ``` /opt/attune/artifacts/ # artifacts_dir (configurable) └── {artifact_ref_slug}/ # derived from artifact ref (globally unique) ├── v1.txt # version 1 ├── v2.txt # version 2 └── v3.txt # version 3 ``` **Key decisions:** - **No execution ID in the path.** Artifacts may outlive execution records (hypertable retention) and may be shared across executions or created externally. - **Keyed by artifact ref.** The `ref` column has a unique index, making it a stable, globally unique identifier. Dots in refs become directory separators (e.g., `mypack.build_log` → `mypack/build_log/`). - **Version files named `v{N}.{ext}`** where `N` is the version number from `next_artifact_version()` and `ext` is derived from `content_type`. ## End-to-End Flow ### Happy Path ``` ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────────┐ │ Worker │────▶│Execution │────▶│ API │────▶│ Shared Volume │ │ Service │ │ Process │ │ Service │ │ /opt/attune/ │ │ │ │(Py/Node/ │ │ │ │ artifacts/ │ │ │ │ Shell) │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ └────────────────┘ │ │ │ │ │ 1. Start exec │ │ │ │ Set ATTUNE_ │ │ │ │ ARTIFACTS_DIR │ │ │ │───────────────▶│ │ │ │ │ │ │ │ │ 2. POST /api/v1/artifacts │ │ │ {ref, type, execution} │ │ │───────────────▶│ │ │ │ │ 3. Create artifact │ │ │ │ row in DB │ │ │ │ │ │ │◀───────────────│ │ │ │ {id, ref, ...}│ │ │ │ │ │ │ │ 4. POST /api/v1/artifacts/{id}/versions │ │ {content_type} │ │ │───────────────▶│ │ │ │ │ 5. Create version │ │ │ │ row (file_path, │ │ │ │ no BYTEA content) │ │ │ │ + mkdir on disk │ │ │◀───────────────│ │ │ │ {id, version, │ │ │ │ file_path} │ │ │ │ │ │ │ │ 6. Write file to │ │ │ $ATTUNE_ARTIFACTS_DIR/file_path │ │ │─────────────────────────────────────▶│ │ │ │ │ │ 7. Exec exits │ │ │ │◀───────────────│ │ │ │ │ │ │ 8. Finalize: stat files, │ │ │ update size_bytes in DB │ │ │ (direct DB access) │ │ │─────────────────────────────────┘ │ │ │ ▼ │ ┌──────────┐ │ │ Client │ 9. GET /api/v1/artifacts/{id}/download │ │ (UI) │──────────────────▶ API reads from disk ◀──────┘ └──────────┘ ``` ### Step-by-Step 1. **Worker receives execution from MQ**, prepares `ExecutionContext`, sets `ATTUNE_ARTIFACTS_DIR` environment variable. 2. **Execution process** calls `POST /api/v1/artifacts` to create the artifact record (ref, type, execution ID, content_type). 3. **API** creates the `artifact` row in DB, returns the artifact ID. 4. **Execution process** calls `POST /api/v1/artifacts/{id}/versions` to create a new version. For file-type artifacts, the request body contains content_type and optional metadata — **no file content**. 5. **API** creates the `artifact_version` row with a computed `file_path` (e.g., `mypack/build_log/v1.txt`), `content` BYTEA left NULL. Creates the parent directory on disk. Returns version ID and `file_path`. 6. **Execution process** writes file content to `$ATTUNE_ARTIFACTS_DIR/{file_path}`. Can write incrementally (append, stream, etc.). 7. **Execution process exits.** 8. **Worker finalizes**: scans artifact versions linked to this execution, `stat()`s each file on disk, updates `artifact_version.size_bytes` and `artifact.size_bytes` in the DB via direct repository access. 9. **Client requests download**: API reads from `{artifacts_dir}/{file_path}` on disk and streams the response. ## Implementation Phases ### Phase 1: Configuration & Volume Infrastructure **`crates/common/src/config.rs`** - Add `artifacts_dir: String` to `Config` struct with default `/opt/attune/artifacts` - Add `default_artifacts_dir()` function **`config.development.yaml`** - Add `artifacts_dir: ./artifacts` **`config.docker.yaml`** - Add `artifacts_dir: /opt/attune/artifacts` **`docker-compose.yaml`** - Add `artifacts_data` named volume - Mount `artifacts_data:/opt/attune/artifacts` in: api (rw), all workers (rw), executor (ro) - Add `ATTUNE__ARTIFACTS_DIR: /opt/attune/artifacts` to service environments where needed ### Phase 2: Database Schema Changes **New migration: `migrations/20250101000011_artifact_file_storage.sql`** ```sql -- Add file_path to artifact_version for disk-based storage ALTER TABLE artifact_version ADD COLUMN IF NOT EXISTS file_path TEXT; -- Index for finding versions by file_path (orphan cleanup) CREATE INDEX IF NOT EXISTS idx_artifact_version_file_path ON artifact_version(file_path) WHERE file_path IS NOT NULL; COMMENT ON COLUMN artifact_version.file_path IS 'Relative path from artifacts_dir root for disk-stored content. ' 'When set, content BYTEA is NULL — file lives on shared volume.'; ``` **`crates/common/src/models.rs`** — `artifact_version` module: - Add `file_path: Option` to `ArtifactVersion` struct - Update `SELECT_COLUMNS` and `SELECT_COLUMNS_WITH_CONTENT` to include `file_path` **`crates/common/src/repositories/artifact.rs`** — `ArtifactVersionRepository`: - Add `file_path: Option` to `CreateArtifactVersionInput` - Wire `file_path` through the `create` query - Add `update_size_bytes(executor, version_id, size_bytes)` method for worker finalization - Add `find_file_versions_by_execution(executor, execution_id)` method — joins `artifact_version` → `artifact` on `artifact.execution` to find all file-based versions for an execution ### Phase 3: API Changes #### Create Version Endpoint (modified) `POST /api/v1/artifacts/{id}/versions` — currently `create_version_json` Add a new endpoint or modify existing behavior: **`POST /api/v1/artifacts/{id}/versions/file`** (new endpoint) - Request body: `CreateFileVersionRequest { content_type: Option, meta: Option, created_by: Option }` - **No file content in the request** — this is the key difference from `upload_version` - API computes `file_path` from artifact ref + version number + content_type extension - Creates `artifact_version` row with `file_path` set, `content` NULL - Creates parent directory on disk: `{artifacts_dir}/{file_path_parent}/` - Returns `ArtifactVersionResponse` **with `file_path` included** **File path computation logic:** ```rust fn compute_file_path(artifact_ref: &str, version: i32, content_type: &str) -> String { // "mypack.build_log" → "mypack/build_log" let ref_path = artifact_ref.replace('.', "/"); let ext = extension_from_content_type(content_type); format!("{}/v{}.{}", ref_path, version, ext) } ``` #### Download Endpoints (modified) `GET /api/v1/artifacts/{id}/download` and `GET /api/v1/artifacts/{id}/versions/{v}/download`: - If `artifact_version.file_path` is set: - Resolve absolute path: `{artifacts_dir}/{file_path}` - Verify file exists, return 404 if not - `stat()` the file for Content-Length header - Stream file content as response body - If `file_path` is NULL: - Fall back to existing BYTEA/JSON content from DB (backward compatible) #### Upload Endpoint (unchanged for now) `POST /api/v1/artifacts/{id}/versions/upload` (multipart) — continues to store in DB BYTEA. This remains available for non-execution uploads (external systems, small files, etc.). #### Response DTO Changes **`crates/api/src/dto/artifact.rs`**: - Add `file_path: Option` to `ArtifactVersionResponse` - Add `file_path: Option` to `ArtifactVersionSummary` - Add `CreateFileVersionRequest` DTO ### Phase 4: Worker Changes #### Environment Variable Injection **`crates/worker/src/executor.rs`** — `prepare_execution_context()`: - Add `ATTUNE_ARTIFACTS_DIR` to the standard env vars block: ```rust env.insert("ATTUNE_ARTIFACTS_DIR".to_string(), self.artifacts_dir.clone()); ``` - The `ActionExecutor` struct needs to hold the `artifacts_dir` value (sourced from config) #### Post-Execution Finalization **`crates/worker/src/executor.rs`** — after execution completes (success or failure): ``` async fn finalize_artifacts(&self, execution_id: i64) -> Result<()> ``` 1. Query `artifact_version` rows joined through `artifact.execution = execution_id` where `file_path IS NOT NULL` 2. For each version with a `file_path`: - Resolve absolute path: `{artifacts_dir}/{file_path}` - `tokio::fs::metadata(path).await` to get file size - If file exists: update `artifact_version.size_bytes` via repository - If file doesn't exist: set `size_bytes = 0` (execution didn't produce the file) 3. For each parent artifact: update `artifact.size_bytes` to the latest version's `size_bytes` This runs after every execution regardless of success/failure status, since even failed executions may have written partial artifacts. #### Simplify Old ArtifactManager **`crates/worker/src/artifacts.rs`**: - The existing `ArtifactManager` is a standalone prototype disconnected from the DB-backed system. It can be simplified to only handle the `artifacts_dir` path resolution and directory creation, or removed entirely since the API now manages paths. - Keep the struct as a thin wrapper if it's useful for the finalization logic, but remove the `store_logs`, `store_result`, `store_file` methods that duplicate what the API does. ### Phase 5: Retention & Cleanup #### DB Trigger (existing, minor update) The `enforce_artifact_retention` trigger fires `AFTER INSERT ON artifact_version` and deletes old version rows when the count exceeds the limit. This continues to work for row deletion. However, it **cannot** delete files on disk (triggers can't do filesystem I/O). #### Orphan File Cleanup (new) Add an async cleanup mechanism — either a periodic task in the worker/executor or a dedicated CLI command: **`attune artifact cleanup`** (CLI) or periodic task: 1. Scan all files under `{artifacts_dir}/` 2. For each file, check if a matching `artifact_version.file_path` row exists 3. If no row exists (orphaned file), delete the file 4. Also delete empty directories This handles: - Files left behind after the retention trigger deletes version rows - Files from crashed executions that created directories but whose version rows were cleaned up - Manual DB cleanup scenarios **Frequency:** Daily or on-demand via CLI. Orphaned files are not harmful (just wasted disk space), so aggressive cleanup isn't critical. #### Artifact Deletion Endpoint The existing `DELETE /api/v1/artifacts/{id}` cascades to `artifact_version` rows via FK. Enhance it to also delete files on disk: - Before deleting the DB row, query all versions with `file_path IS NOT NULL` - Delete each file from disk - Then delete the DB row (cascades to version rows) - Clean up empty parent directories Similarly for `DELETE /api/v1/artifacts/{id}/versions/{v}`. ## Schema Summary ### artifact table (unchanged) Existing columns remain. `size_bytes` continues to reflect the latest version's size (updated by worker finalization for file-based artifacts, updated by DB trigger for DB-stored artifacts). ### artifact_version table (modified) | Column | Type | Notes | |--------|------|-------| | `id` | BIGSERIAL | PK | | `artifact` | BIGINT | FK → artifact(id) ON DELETE CASCADE | | `version` | INTEGER | Auto-assigned by `next_artifact_version()` | | `content_type` | TEXT | MIME type | | `size_bytes` | BIGINT | Set by worker finalization for file-based; set at insert for DB-stored | | `content` | BYTEA | NULL for file-based artifacts; populated for DB-stored uploads | | `content_json` | JSONB | For JSON content versions (unchanged) | | **`file_path`** | **TEXT** | **NEW — relative path from `artifacts_dir`. When set, `content` is NULL** | | `meta` | JSONB | Free-form metadata | | `created_by` | TEXT | Who created this version | | `created` | TIMESTAMPTZ | Immutable | **Invariant:** Exactly one of `content`, `content_json`, or `file_path` should be non-NULL for a given version row. ## Files Changed | File | Changes | |------|---------| | `crates/common/src/config.rs` | Add `artifacts_dir` field with default | | `crates/common/src/models.rs` | Add `file_path` to `ArtifactVersion` | | `crates/common/src/repositories/artifact.rs` | Wire `file_path` through create; add `update_size_bytes`, `find_file_versions_by_execution` | | `crates/api/src/dto/artifact.rs` | Add `file_path` to version response DTOs; add `CreateFileVersionRequest` | | `crates/api/src/routes/artifacts.rs` | New `create_version_file` endpoint; modify download endpoints for disk reads | | `crates/api/src/state.rs` | No change needed — `config` already accessible via `AppState.config` | | `crates/worker/src/executor.rs` | Inject `ATTUNE_ARTIFACTS_DIR` env var; add `finalize_artifacts()` post-execution | | `crates/worker/src/service.rs` | Pass `artifacts_dir` config to `ActionExecutor` | | `crates/worker/src/artifacts.rs` | Simplify or remove old `ArtifactManager` | | `migrations/20250101000011_artifact_file_storage.sql` | Add `file_path` column to `artifact_version` | | `config.development.yaml` | Add `artifacts_dir: ./artifacts` | | `config.docker.yaml` | Add `artifacts_dir: /opt/attune/artifacts` | | `docker-compose.yaml` | Add `artifacts_data` volume; mount in api + worker services | ## Environment Variables | Variable | Set By | Available To | Value | |----------|--------|--------------|-------| | `ATTUNE_ARTIFACTS_DIR` | Worker | Execution process | Absolute path to artifacts volume (e.g., `/opt/attune/artifacts`) | | `ATTUNE__ARTIFACTS_DIR` | Docker Compose | API / Worker services | Config override for `artifacts_dir` | ## Backward Compatibility - **Existing DB-stored artifacts continue to work.** Download endpoints check `file_path` first, fall back to BYTEA/JSON content. - **Existing multipart upload endpoint unchanged.** External systems can still upload small files via `POST /artifacts/{id}/versions/upload` — those go to DB as before. - **Progress and URL artifacts unchanged.** They don't use `artifact_version` content at all. - **No data migration needed.** Existing artifacts have `file_path = NULL` and continue to serve from DB. ## Future Considerations - **External object storage (S3/MinIO):** The `file_path` abstraction makes it straightforward to swap the local filesystem for S3 later — the path becomes an object key, and the download endpoint proxies or redirects. - **Streaming writes:** With disk-based storage, a future enhancement could allow the API to stream large file uploads directly to disk instead of buffering in memory. - **Artifact garbage collection:** The orphan cleanup could be integrated into the executor's periodic maintenance loop alongside execution timeout monitoring. - **Cross-execution artifact access:** Since artifacts are keyed by ref (not execution ID), a future enhancement could let actions declare artifact dependencies, and the worker could resolve and mount those paths.