artifacts!

This commit is contained in:
2026-03-03 13:42:41 -06:00
parent 5da940639a
commit 8299e5efcb
50 changed files with 4779 additions and 341 deletions

View File

@@ -0,0 +1,330 @@
# File-Based Artifact Storage Plan
## Overview
Replace PostgreSQL BYTEA storage for file-type artifacts with a shared filesystem volume. Execution processes write artifact files directly to disk via paths assigned by the API; the API serves those files from disk on download. The database stores only metadata (path, size, content type) — no binary content for file-based artifacts.
**Motivation:**
- Eliminates PostgreSQL bloat from large binary artifacts
- Enables executions to write files incrementally (streaming logs, large outputs) without buffering in memory for an API upload
- Artifacts can be retained independently of execution records (executions are hypertables with 90-day retention)
- Decouples artifact lifecycle from execution lifecycle — artifacts created by one execution can be accessed by others or by external systems
## Artifact Type Classification
| Type | Storage | Notes |
|------|---------|-------|
| `FileBinary` | **Disk** (shared volume) | Binary files produced by executions |
| `FileDatatable` | **Disk** (shared volume) | Tabular data files (CSV, etc.) |
| `FileText` | **Disk** (shared volume) | Text files, logs |
| `Log` | **Disk** (shared volume) | Execution stdout/stderr logs |
| `Progress` | **DB** (`artifact.data` JSONB) | Small structured progress entries — unchanged |
| `Url` | **DB** (`artifact.data` JSONB) | URL references — unchanged |
## Directory Structure
```
/opt/attune/artifacts/ # artifacts_dir (configurable)
└── {artifact_ref_slug}/ # derived from artifact ref (globally unique)
├── v1.txt # version 1
├── v2.txt # version 2
└── v3.txt # version 3
```
**Key decisions:**
- **No execution ID in the path.** Artifacts may outlive execution records (hypertable retention) and may be shared across executions or created externally.
- **Keyed by artifact ref.** The `ref` column has a unique index, making it a stable, globally unique identifier. Dots in refs become directory separators (e.g., `mypack.build_log``mypack/build_log/`).
- **Version files named `v{N}.{ext}`** where `N` is the version number from `next_artifact_version()` and `ext` is derived from `content_type`.
## End-to-End Flow
### Happy Path
```
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────────┐
│ Worker │────▶│Execution │────▶│ API │────▶│ Shared Volume │
│ Service │ │ Process │ │ Service │ │ /opt/attune/ │
│ │ │(Py/Node/ │ │ │ │ artifacts/ │
│ │ │ Shell) │ │ │ │ │
└──────────┘ └──────────┘ └──────────┘ └────────────────┘
│ │ │ │
│ 1. Start exec │ │ │
│ Set ATTUNE_ │ │ │
│ ARTIFACTS_DIR │ │ │
│───────────────▶│ │ │
│ │ │ │
│ │ 2. POST /api/v1/artifacts │
│ │ {ref, type, execution} │
│ │───────────────▶│ │
│ │ │ 3. Create artifact │
│ │ │ row in DB │
│ │ │ │
│ │◀───────────────│ │
│ │ {id, ref, ...}│ │
│ │ │ │
│ │ 4. POST /api/v1/artifacts/{id}/versions
│ │ {content_type} │
│ │───────────────▶│ │
│ │ │ 5. Create version │
│ │ │ row (file_path, │
│ │ │ no BYTEA content) │
│ │ │ + mkdir on disk │
│ │◀───────────────│ │
│ │ {id, version, │ │
│ │ file_path} │ │
│ │ │ │
│ │ 6. Write file to │
│ │ $ATTUNE_ARTIFACTS_DIR/file_path │
│ │─────────────────────────────────────▶│
│ │ │ │
│ 7. Exec exits │ │ │
│◀───────────────│ │ │
│ │ │
│ 8. Finalize: stat files, │ │
│ update size_bytes in DB │ │
│ (direct DB access) │ │
│─────────────────────────────────┘ │
│ │
▼ │
┌──────────┐ │
│ Client │ 9. GET /api/v1/artifacts/{id}/download │
│ (UI) │──────────────────▶ API reads from disk ◀──────┘
└──────────┘
```
### Step-by-Step
1. **Worker receives execution from MQ**, prepares `ExecutionContext`, sets `ATTUNE_ARTIFACTS_DIR` environment variable.
2. **Execution process** calls `POST /api/v1/artifacts` to create the artifact record (ref, type, execution ID, content_type).
3. **API** creates the `artifact` row in DB, returns the artifact ID.
4. **Execution process** calls `POST /api/v1/artifacts/{id}/versions` to create a new version. For file-type artifacts, the request body contains content_type and optional metadata — **no file content**.
5. **API** creates the `artifact_version` row with a computed `file_path` (e.g., `mypack/build_log/v1.txt`), `content` BYTEA left NULL. Creates the parent directory on disk. Returns version ID and `file_path`.
6. **Execution process** writes file content to `$ATTUNE_ARTIFACTS_DIR/{file_path}`. Can write incrementally (append, stream, etc.).
7. **Execution process exits.**
8. **Worker finalizes**: scans artifact versions linked to this execution, `stat()`s each file on disk, updates `artifact_version.size_bytes` and `artifact.size_bytes` in the DB via direct repository access.
9. **Client requests download**: API reads from `{artifacts_dir}/{file_path}` on disk and streams the response.
## Implementation Phases
### Phase 1: Configuration & Volume Infrastructure
**`crates/common/src/config.rs`**
- Add `artifacts_dir: String` to `Config` struct with default `/opt/attune/artifacts`
- Add `default_artifacts_dir()` function
**`config.development.yaml`**
- Add `artifacts_dir: ./artifacts`
**`config.docker.yaml`**
- Add `artifacts_dir: /opt/attune/artifacts`
**`docker-compose.yaml`**
- Add `artifacts_data` named volume
- Mount `artifacts_data:/opt/attune/artifacts` in: api (rw), all workers (rw), executor (ro)
- Add `ATTUNE__ARTIFACTS_DIR: /opt/attune/artifacts` to service environments where needed
### Phase 2: Database Schema Changes
**New migration: `migrations/20250101000011_artifact_file_storage.sql`**
```sql
-- Add file_path to artifact_version for disk-based storage
ALTER TABLE artifact_version ADD COLUMN IF NOT EXISTS file_path TEXT;
-- Index for finding versions by file_path (orphan cleanup)
CREATE INDEX IF NOT EXISTS idx_artifact_version_file_path
ON artifact_version(file_path) WHERE file_path IS NOT NULL;
COMMENT ON COLUMN artifact_version.file_path IS
'Relative path from artifacts_dir root for disk-stored content. '
'When set, content BYTEA is NULL — file lives on shared volume.';
```
**`crates/common/src/models.rs`** — `artifact_version` module:
- Add `file_path: Option<String>` to `ArtifactVersion` struct
- Update `SELECT_COLUMNS` and `SELECT_COLUMNS_WITH_CONTENT` to include `file_path`
**`crates/common/src/repositories/artifact.rs`** — `ArtifactVersionRepository`:
- Add `file_path: Option<String>` to `CreateArtifactVersionInput`
- Wire `file_path` through the `create` query
- Add `update_size_bytes(executor, version_id, size_bytes)` method for worker finalization
- Add `find_file_versions_by_execution(executor, execution_id)` method — joins `artifact_version``artifact` on `artifact.execution` to find all file-based versions for an execution
### Phase 3: API Changes
#### Create Version Endpoint (modified)
`POST /api/v1/artifacts/{id}/versions` — currently `create_version_json`
Add a new endpoint or modify existing behavior:
**`POST /api/v1/artifacts/{id}/versions/file`** (new endpoint)
- Request body: `CreateFileVersionRequest { content_type: Option<String>, meta: Option<Value>, created_by: Option<String> }`
- **No file content in the request** — this is the key difference from `upload_version`
- API computes `file_path` from artifact ref + version number + content_type extension
- Creates `artifact_version` row with `file_path` set, `content` NULL
- Creates parent directory on disk: `{artifacts_dir}/{file_path_parent}/`
- Returns `ArtifactVersionResponse` **with `file_path` included**
**File path computation logic:**
```rust
fn compute_file_path(artifact_ref: &str, version: i32, content_type: &str) -> String {
// "mypack.build_log" → "mypack/build_log"
let ref_path = artifact_ref.replace('.', "/");
let ext = extension_from_content_type(content_type);
format!("{}/v{}.{}", ref_path, version, ext)
}
```
#### Download Endpoints (modified)
`GET /api/v1/artifacts/{id}/download` and `GET /api/v1/artifacts/{id}/versions/{v}/download`:
- If `artifact_version.file_path` is set:
- Resolve absolute path: `{artifacts_dir}/{file_path}`
- Verify file exists, return 404 if not
- `stat()` the file for Content-Length header
- Stream file content as response body
- If `file_path` is NULL:
- Fall back to existing BYTEA/JSON content from DB (backward compatible)
#### Upload Endpoint (unchanged for now)
`POST /api/v1/artifacts/{id}/versions/upload` (multipart) — continues to store in DB BYTEA. This remains available for non-execution uploads (external systems, small files, etc.).
#### Response DTO Changes
**`crates/api/src/dto/artifact.rs`**:
- Add `file_path: Option<String>` to `ArtifactVersionResponse`
- Add `file_path: Option<String>` to `ArtifactVersionSummary`
- Add `CreateFileVersionRequest` DTO
### Phase 4: Worker Changes
#### Environment Variable Injection
**`crates/worker/src/executor.rs`** — `prepare_execution_context()`:
- Add `ATTUNE_ARTIFACTS_DIR` to the standard env vars block:
```rust
env.insert("ATTUNE_ARTIFACTS_DIR".to_string(), self.artifacts_dir.clone());
```
- The `ActionExecutor` struct needs to hold the `artifacts_dir` value (sourced from config)
#### Post-Execution Finalization
**`crates/worker/src/executor.rs`** — after execution completes (success or failure):
```
async fn finalize_artifacts(&self, execution_id: i64) -> Result<()>
```
1. Query `artifact_version` rows joined through `artifact.execution = execution_id` where `file_path IS NOT NULL`
2. For each version with a `file_path`:
- Resolve absolute path: `{artifacts_dir}/{file_path}`
- `tokio::fs::metadata(path).await` to get file size
- If file exists: update `artifact_version.size_bytes` via repository
- If file doesn't exist: set `size_bytes = 0` (execution didn't produce the file)
3. For each parent artifact: update `artifact.size_bytes` to the latest version's `size_bytes`
This runs after every execution regardless of success/failure status, since even failed executions may have written partial artifacts.
#### Simplify Old ArtifactManager
**`crates/worker/src/artifacts.rs`**:
- The existing `ArtifactManager` is a standalone prototype disconnected from the DB-backed system. It can be simplified to only handle the `artifacts_dir` path resolution and directory creation, or removed entirely since the API now manages paths.
- Keep the struct as a thin wrapper if it's useful for the finalization logic, but remove the `store_logs`, `store_result`, `store_file` methods that duplicate what the API does.
### Phase 5: Retention & Cleanup
#### DB Trigger (existing, minor update)
The `enforce_artifact_retention` trigger fires `AFTER INSERT ON artifact_version` and deletes old version rows when the count exceeds the limit. This continues to work for row deletion. However, it **cannot** delete files on disk (triggers can't do filesystem I/O).
#### Orphan File Cleanup (new)
Add an async cleanup mechanism — either a periodic task in the worker/executor or a dedicated CLI command:
**`attune artifact cleanup`** (CLI) or periodic task:
1. Scan all files under `{artifacts_dir}/`
2. For each file, check if a matching `artifact_version.file_path` row exists
3. If no row exists (orphaned file), delete the file
4. Also delete empty directories
This handles:
- Files left behind after the retention trigger deletes version rows
- Files from crashed executions that created directories but whose version rows were cleaned up
- Manual DB cleanup scenarios
**Frequency:** Daily or on-demand via CLI. Orphaned files are not harmful (just wasted disk space), so aggressive cleanup isn't critical.
#### Artifact Deletion Endpoint
The existing `DELETE /api/v1/artifacts/{id}` cascades to `artifact_version` rows via FK. Enhance it to also delete files on disk:
- Before deleting the DB row, query all versions with `file_path IS NOT NULL`
- Delete each file from disk
- Then delete the DB row (cascades to version rows)
- Clean up empty parent directories
Similarly for `DELETE /api/v1/artifacts/{id}/versions/{v}`.
## Schema Summary
### artifact table (unchanged)
Existing columns remain. `size_bytes` continues to reflect the latest version's size (updated by worker finalization for file-based artifacts, updated by DB trigger for DB-stored artifacts).
### artifact_version table (modified)
| Column | Type | Notes |
|--------|------|-------|
| `id` | BIGSERIAL | PK |
| `artifact` | BIGINT | FK → artifact(id) ON DELETE CASCADE |
| `version` | INTEGER | Auto-assigned by `next_artifact_version()` |
| `content_type` | TEXT | MIME type |
| `size_bytes` | BIGINT | Set by worker finalization for file-based; set at insert for DB-stored |
| `content` | BYTEA | NULL for file-based artifacts; populated for DB-stored uploads |
| `content_json` | JSONB | For JSON content versions (unchanged) |
| **`file_path`** | **TEXT** | **NEW — relative path from `artifacts_dir`. When set, `content` is NULL** |
| `meta` | JSONB | Free-form metadata |
| `created_by` | TEXT | Who created this version |
| `created` | TIMESTAMPTZ | Immutable |
**Invariant:** Exactly one of `content`, `content_json`, or `file_path` should be non-NULL for a given version row.
## Files Changed
| File | Changes |
|------|---------|
| `crates/common/src/config.rs` | Add `artifacts_dir` field with default |
| `crates/common/src/models.rs` | Add `file_path` to `ArtifactVersion` |
| `crates/common/src/repositories/artifact.rs` | Wire `file_path` through create; add `update_size_bytes`, `find_file_versions_by_execution` |
| `crates/api/src/dto/artifact.rs` | Add `file_path` to version response DTOs; add `CreateFileVersionRequest` |
| `crates/api/src/routes/artifacts.rs` | New `create_version_file` endpoint; modify download endpoints for disk reads |
| `crates/api/src/state.rs` | No change needed — `config` already accessible via `AppState.config` |
| `crates/worker/src/executor.rs` | Inject `ATTUNE_ARTIFACTS_DIR` env var; add `finalize_artifacts()` post-execution |
| `crates/worker/src/service.rs` | Pass `artifacts_dir` config to `ActionExecutor` |
| `crates/worker/src/artifacts.rs` | Simplify or remove old `ArtifactManager` |
| `migrations/20250101000011_artifact_file_storage.sql` | Add `file_path` column to `artifact_version` |
| `config.development.yaml` | Add `artifacts_dir: ./artifacts` |
| `config.docker.yaml` | Add `artifacts_dir: /opt/attune/artifacts` |
| `docker-compose.yaml` | Add `artifacts_data` volume; mount in api + worker services |
## Environment Variables
| Variable | Set By | Available To | Value |
|----------|--------|--------------|-------|
| `ATTUNE_ARTIFACTS_DIR` | Worker | Execution process | Absolute path to artifacts volume (e.g., `/opt/attune/artifacts`) |
| `ATTUNE__ARTIFACTS_DIR` | Docker Compose | API / Worker services | Config override for `artifacts_dir` |
## Backward Compatibility
- **Existing DB-stored artifacts continue to work.** Download endpoints check `file_path` first, fall back to BYTEA/JSON content.
- **Existing multipart upload endpoint unchanged.** External systems can still upload small files via `POST /artifacts/{id}/versions/upload` — those go to DB as before.
- **Progress and URL artifacts unchanged.** They don't use `artifact_version` content at all.
- **No data migration needed.** Existing artifacts have `file_path = NULL` and continue to serve from DB.
## Future Considerations
- **External object storage (S3/MinIO):** The `file_path` abstraction makes it straightforward to swap the local filesystem for S3 later — the path becomes an object key, and the download endpoint proxies or redirects.
- **Streaming writes:** With disk-based storage, a future enhancement could allow the API to stream large file uploads directly to disk instead of buffering in memory.
- **Artifact garbage collection:** The orphan cleanup could be integrated into the executor's periodic maintenance loop alongside execution timeout monitoring.
- **Cross-execution artifact access:** Since artifacts are keyed by ref (not execution ID), a future enhancement could let actions declare artifact dependencies, and the worker could resolve and mount those paths.