# StackStorm Pitfalls Analysis: Current Implementation Review **Date:** 2024-01-02 **Status:** Analysis Complete - Action Items Identified ## Executive Summary This document analyzes the current Attune implementation against the StackStorm lessons learned to identify replicated pitfalls and propose solutions. The analysis reveals **3 critical issues** and **2 moderate concerns** that need to be addressed before production deployment. --- ## 1. HIGH COUPLING WITH CUSTOM ACTIONS ✅ AVOIDED ### StackStorm Problem - Custom actions are tightly coupled to st2 services - Minimal documentation around action/sensor service interfaces - Actions must import st2 libraries and inherit from st2 classes ### Current Attune Status: **GOOD** - ✅ Actions are executed as standalone processes via `tokio::process::Command` - ✅ No Attune-specific imports or base classes required - ✅ Runtime abstraction layer in `worker/src/runtime/` is well-designed - ✅ Actions receive data via environment variables and stdin (code execution) ### Recommendations - **Keep current approach** - the runtime abstraction is solid - Consider documenting the runtime interface contract for pack developers - Add examples of "pure" Python/Shell/Node.js actions that work without any Attune dependencies --- ## 2. TYPE SAFETY AND DOCUMENTATION ✅ AVOIDED ### StackStorm Problem - Python with minimal type hints - Runtime property injection makes types hard to determine - Poor documentation of service interfaces ### Current Attune Status: **EXCELLENT** - ✅ Built in Rust with full compile-time type checking - ✅ All models in `common/src/models.rs` are strongly typed with SQLx - ✅ Clear type definitions for `ExecutionContext`, `ExecutionResult`, `RuntimeError` - ✅ Repository pattern enforces type contracts ### Recommendations - **No changes needed** - Rust's type system provides the safety we need - Continue documenting public APIs in `docs/` folder - Consider generating OpenAPI specs from Axum routes for external consumers --- ## 3. LIMITED LANGUAGE ECOSYSTEM SUPPORT ⚠️ PARTIALLY ADDRESSED ### StackStorm Problem - Only Python packs natively supported - Other languages require custom installation logic - No standard way to declare dependencies per language ecosystem ### Current Attune Status: **NEEDS WORK** #### What's Good - ✅ Runtime abstraction supports multiple languages (Python, Shell, Node.js planned) - ✅ `Pack` model has `runtime_deps: Vec` field for dependencies - ✅ `Runtime` table has `distributions` JSONB and `installation` JSONB fields #### Problems Identified **Problem 3.1: No Dependency Installation Implementation** ```rust // In crates/common/src/models.rs pub struct Pack { // ... pub runtime_deps: Vec, // ← DEFINED BUT NOT USED // ... } pub struct Runtime { // ... pub distributions: JsonDict, // ← NO INSTALLATION LOGIC pub installation: Option, // ... } ``` **Problem 3.2: No Pack Installation/Setup Service** - No code exists to process `runtime_deps` field - No integration with pip, npm, cargo, etc. - No isolation of dependencies between packs **Problem 3.3: Runtime Detection is Naive** ```rust // In crates/worker/src/runtime/python.rs:279 fn can_execute(&self, context: &ExecutionContext) -> bool { // Only checks file extension - doesn't verify runtime availability context.action_ref.contains(".py") || context.entry_point.ends_with(".py") // ... } ``` ### Recommendations **IMMEDIATE (Before Production):** 1. **Implement Pack Installation Service** - Create `attune-packman` service or add to `attune-api` - Support installing Python deps via `pip install -r requirements.txt` - Support installing Node.js deps via `npm install` - Store pack code in isolated directories: `/var/lib/attune/packs/{pack_ref}/` 2. **Enhance Runtime Model** - Add `installation_status` enum: `not_installed`, `installing`, `installed`, `failed` - Add `installed_at` timestamp - Add `installation_log` field for troubleshooting 3. **Implement Dependency Isolation** - Python: Use `venv` per pack in `/var/lib/attune/packs/{pack_ref}/.venv/` - Node.js: Use local `node_modules` per pack - Document in pack schema: how to declare dependencies **FUTURE (v2.0):** 4. **Container-based Runtime** - Each pack gets its own container image - Dependencies baked into image - Complete isolation from Attune system --- ## 4. DEPENDENCY HELL AND SYSTEM COUPLING 🔴 CRITICAL ISSUE ### StackStorm Problem - st2 services run on Python 2.7/3.6 (EOL) - Upgrading st2 system breaks user actions - User actions are coupled to st2 Python version ### Current Attune Status: **VULNERABLE** #### Problems Identified **Problem 4.1: Shared System Python Runtime** ```rust // In crates/worker/src/runtime/python.rs:19 pub fn new() -> Self { Self { python_path: PathBuf::from("python3"), // ← SYSTEM PYTHON! // ... } } ``` - Currently uses system-wide `python3` - If Attune upgrades system Python, user actions may break - No version pinning or isolation **Problem 4.2: No Runtime Version Management** - No way to specify Python 3.9 vs 3.11 vs 3.12 - Runtime table has `name` field but it's not used for version selection - Shell runtime hardcoded to `/bin/bash` **Problem 4.3: Attune System Dependencies Could Conflict** - If Attune worker needs a Python library (e.g., for parsing), it could conflict with action deps - No separation between "Attune system dependencies" and "action dependencies" ### Recommendations **CRITICAL (Must Fix Before v1.0):** 1. **Implement Per-Pack Virtual Environments** ```rust // Pseudocode for python.rs enhancement pub struct PythonRuntime { python_path: PathBuf, // System python3 for venv creation venv_base: PathBuf, // /var/lib/attune/packs/ default_python_version: String, // "3.11" } impl PythonRuntime { async fn get_or_create_venv(&self, pack_ref: &str) -> Result { let venv_path = self.venv_base.join(pack_ref).join(".venv"); if !venv_path.exists() { self.create_venv(&venv_path).await?; self.install_pack_deps(pack_ref, &venv_path).await?; } Ok(venv_path.join("bin/python")) } } ``` 2. **Support Multiple Runtime Versions** - Store available Python versions: `/opt/attune/runtimes/python-3.9/`, `.../python-3.11/` - Pack declares required version in metadata: `"runtime_version": "3.11"` - Worker selects appropriate runtime based on pack requirements 3. **Decouple Attune System from Action Execution** - Attune services (API, executor, worker) remain in Rust - no Python coupling - Actions run in isolated environments - Clear boundary: Attune communicates with actions only via stdin/stdout/env/files **DESIGN PRINCIPLE:** > "Upgrading Attune system dependencies should NEVER break existing user actions." --- ## 5. INSECURE SECRET PASSING 🔴 CRITICAL SECURITY ISSUE ### StackStorm Problem - Secrets passed as environment variables or CLI arguments - Visible to all users with login access via `ps`, `/proc/{pid}/environ` - Major security vulnerability ### Current Attune Status: **VULNERABLE** #### Problems Identified **Problem 5.1: Secrets Exposed in Environment Variables** ```rust // In crates/worker/src/secrets.rs:142 pub fn prepare_secret_env(&self, secrets: &HashMap) -> HashMap { secrets .iter() .map(|(name, value)| { let env_name = format!("SECRET_{}", name.to_uppercase().replace('-', "_")); (env_name, value.clone()) // ← EXPOSED IN PROCESS ENV! }) .collect() } // In crates/worker/src/executor.rs:228 env.extend(secret_env); // ← Secrets added to env vars ``` **Problem 5.2: Secrets Visible in Process Table** ```rust // In crates/worker/src/runtime/python.rs:122 let mut cmd = Command::new(&self.python_path); cmd.arg("-c").arg(&script) .stdin(Stdio::null()) // ← NOT USING STDIN! // ... for (key, value) in env { cmd.env(key, value); // ← Secrets visible via /proc/{pid}/environ } ``` **Problem 5.3: Parameters Also Exposed (Lower Risk)** ```rust // In crates/worker/src/runtime/shell.rs:49 for (key, value) in &context.parameters { script.push_str(&format!( "export PARAM_{}='{}'\n", // ← Parameters visible in env key.to_uppercase(), value_str )); } ``` ### Security Impact - **HIGH**: Any user with shell access can view secrets via: - `ps auxwwe` - shows environment variables - `cat /proc/{pid}/environ` - shows full environment - `strings /proc/{pid}/environ` - extracts secret values - **MEDIUM**: Short-lived processes reduce exposure window, but still vulnerable ### Recommendations **CRITICAL (Must Fix Before v1.0):** 1. **Pass Secrets via Stdin (Preferred Method)** ```rust // Enhanced approach for python.rs async fn execute_python_code( &self, script: String, secrets: &HashMap, parameters: &HashMap, env: &HashMap, // Only non-secret env vars timeout_secs: Option, ) -> RuntimeResult { // Create secrets JSON file let secrets_json = serde_json::to_string(&serde_json::json!({ "secrets": secrets, "parameters": parameters, }))?; let mut cmd = Command::new(&self.python_path); cmd.arg("-c").arg(&script) .stdin(Stdio::piped()) // ← Use stdin! .stdout(Stdio::piped()) .stderr(Stdio::piped()); // Only add non-secret env vars for (key, value) in env { if !key.starts_with("SECRET_") { cmd.env(key, value); } } let mut child = cmd.spawn()?; // Write secrets to stdin and close if let Some(mut stdin) = child.stdin.take() { stdin.write_all(secrets_json.as_bytes()).await?; drop(stdin); // Close stdin } let output = child.wait_with_output().await?; // ... } ``` 2. **Alternative: Use Temporary Secret Files** ```rust // Create secure temporary file (0600 permissions) let secrets_file = format!("/tmp/attune-secrets-{}-{}.json", execution_id, uuid::Uuid::new_v4()); let mut file = OpenOptions::new() .create_new(true) .write(true) .mode(0o600) // Read/write for owner only .open(&secrets_file).await?; file.write_all(serde_json::to_string(secrets)?.as_bytes()).await?; file.sync_all().await?; drop(file); // Pass file path via env (not the secrets themselves) cmd.env("ATTUNE_SECRETS_FILE", &secrets_file); // Clean up after execution tokio::fs::remove_file(&secrets_file).await?; ``` 3. **Update Python Wrapper Script** ```python # Modified wrapper script generator def main(): import sys, json # Read secrets and parameters from stdin input_data = json.load(sys.stdin) secrets = input_data.get('secrets', {}) parameters = input_data.get('parameters', {}) # Secrets available in code but not in environment # ... ``` 4. **Document Secure Secret Access Pattern** - Create `docs/secure-secret-handling.md` - Provide action templates that read from stdin - Add security best practices guide for pack developers **IMPLEMENTATION PRIORITY: IMMEDIATE** - This is a security vulnerability that must be fixed before any production use - Should be addressed in Phase 3 (Worker Service completion) --- ## 6. STDERR DATABASE STORAGE CAUSING FAILURES ⚠️ MODERATE ISSUE ### StackStorm Problem - stderr output stored directly in database - Excessive logging can exceed database field limits - Jobs fail unexpectedly due to log size ### Current Attune Status: **GOOD APPROACH, NEEDS LIMITS** #### What's Good ✅ **Attune uses filesystem storage for logs** ```rust // In crates/worker/src/artifacts.rs:72 pub async fn store_logs( &self, execution_id: i64, stdout: &str, stderr: &str, ) -> Result> { // Stores to files: /tmp/attune/artifacts/execution_{id}/stdout.log // /tmp/attune/artifacts/execution_{id}/stderr.log // NOT stored in database! } ``` ✅ **Database only stores result JSON** ```rust // In crates/worker/src/executor.rs:331 let input = UpdateExecutionInput { status: Some(ExecutionStatus::Completed), result: result.result.clone(), // ← Only structured result, not logs executor: None, }; ``` #### Problems Identified **Problem 6.1: No Size Limits on Log Files** ```rust // In artifacts.rs - no size checks! file.write_all(stdout.as_bytes()).await?; // ← Could be gigabytes! ``` **Problem 6.2: No Log Rotation** - Single file per execution - If action produces GB of logs, file grows unbounded - Could fill disk **Problem 6.3: In-Memory Log Collection** ```rust // In python.rs and shell.rs let output = execution_future.await?; let stdout = String::from_utf8_lossy(&output.stdout).to_string(); // ← ALL in memory! let stderr = String::from_utf8_lossy(&output.stderr).to_string(); ``` - If action produces 1GB of output, worker could OOM ### Recommendations **HIGH PRIORITY (Before Production):** 1. **Implement Streaming Log Collection** ```rust // Replace `.output()` with streaming approach use tokio::io::{AsyncBufReadExt, BufReader}; async fn execute_with_streaming_logs( &self, mut cmd: Command, execution_id: i64, max_log_size: usize, // e.g., 10MB ) -> RuntimeResult { let mut child = cmd.spawn()?; // Stream stdout to file with size limit if let Some(stdout) = child.stdout.take() { let reader = BufReader::new(stdout); let mut lines = reader.lines(); let mut total_size = 0; let log_file = /* open stdout.log */; while let Some(line) = lines.next_line().await? { total_size += line.len(); if total_size > max_log_size { // Truncate and add warning write!(log_file, "\n[TRUNCATED: Log exceeded {}MB]", max_log_size / 1024 / 1024).await?; break; } writeln!(log_file, "{}", line).await?; } } // Similar for stderr // ... } ``` 2. **Add Configuration Limits** ```yaml # config.yaml worker: log_limits: max_stdout_size: 10485760 # 10MB max_stderr_size: 10485760 # 10MB max_total_size: 20971520 # 20MB truncate_on_exceed: true ``` 3. **Implement Log Rotation Per Execution** ``` /var/lib/attune/artifacts/ execution_123/ stdout.0.log (first 10MB) stdout.1.log (next 10MB) stdout.2.log (final chunk) stderr.0.log result.json ``` 4. **Add Log Streaming API Endpoint** - API endpoint: `GET /api/v1/executions/{id}/logs/stdout?follow=true` - Stream logs to client as execution progresses - Similar to `docker logs --follow` **MEDIUM PRIORITY (v1.1):** 5. **Implement Log Compression** - Compress logs after execution completes - Save disk space for long-term retention - Decompress on-demand for viewing --- ## 7. POLICY EXECUTION ORDERING 🔴 CRITICAL ISSUE ### Problem Statement When multiple executions are delayed due to policy enforcement (e.g., concurrency limits), there is no guaranteed ordering for when they will be scheduled once resources become available. ### Current Implementation Status: **MISSING CRITICAL FEATURE** #### What Exists ✅ **Policy enforcement framework** ```rust // In crates/executor/src/policy_enforcer.rs:428 pub async fn wait_for_policy_compliance( &self, action_id: Id, pack_id: Option, max_wait_seconds: u32, ) -> Result { // Polls until policies allow execution // BUT: No queue management! } ``` ✅ **Concurrency and rate limiting** ```rust // Can detect when limits are exceeded PolicyViolation::ConcurrencyLimitExceeded { limit: 5, current_count: 7 } ``` #### Problems Identified **Problem 7.1: Non-Deterministic Scheduling Order** **Scenario:** ``` Action has concurrency limit: 2 Time 0: E1 requested → starts (slot 1/2) Time 1: E2 requested → starts (slot 2/2) Time 2: E3 requested → DELAYED (no slots) Time 3: E4 requested → DELAYED (no slots) Time 4: E5 requested → DELAYED (no slots) Time 5: E1 completes → which delayed execution runs? Current behavior: UNDEFINED ORDER (possibly E5, then E3, then E4) Expected behavior: FIFO - E3, then E4, then E5 ``` **Problem 7.2: No Queue Data Structure** ```rust // Current implementation in policy_enforcer.rs // Only polls for compliance - no queue! loop { if self.check_policies(action_id, pack_id).await?.is_none() { return Ok(true); // ← Just returns true, no coordination } tokio::time::sleep(Duration::from_secs(1)).await; } ``` **Problem 7.3: Race Conditions** - Multiple delayed executions poll simultaneously - When slot opens, multiple executions might see it - First to update wins, others keep waiting - No fairness guarantee **Problem 7.4: No Visibility into Queue** - Can't see how many executions are waiting - Can't see position in queue - No way to estimate wait time - Difficult to debug policy issues ### Business Impact **Fairness Issues:** - Later requests might execute before earlier ones - Violates user expectations (FIFO is standard) - Unpredictable execution order **Workflow Dependencies:** - Workflow step B requested after step A - Step B might execute before A completes - Data dependencies violated - Incorrect results or failures **Testing/Debugging:** - Non-deterministic behavior hard to reproduce - Integration tests become flaky - Production issues difficult to diagnose **Performance:** - Polling wastes CPU cycles - Multiple executions wake up unnecessarily - Database load from repeated policy checks ### Recommendations **CRITICAL (Must Fix Before v1.0):** 1. **Implement Per-Action Execution Queue** ```rust // New file: crates/executor/src/execution_queue.rs use std::collections::{HashMap, VecDeque}; use tokio::sync::{Mutex, Notify}; /// Manages FIFO queues of delayed executions per action pub struct ExecutionQueueManager { /// Queue per action_id queues: Arc>>, } struct ActionQueue { /// FIFO queue of waiting execution IDs waiting: VecDeque, /// Notify when slot becomes available notify: Arc, /// Current running count running_count: u32, /// Concurrency limit for this action limit: u32, } impl ExecutionQueueManager { /// Enqueue an execution (returns position in queue) pub async fn enqueue(&self, action_id: i64, execution_id: i64) -> usize { let mut queues = self.queues.lock().await; let queue = queues.entry(action_id).or_insert_with(ActionQueue::new); queue.waiting.push_back(execution_id); queue.waiting.len() } /// Wait for turn (blocks until this execution can proceed) pub async fn wait_for_turn(&self, action_id: i64, execution_id: i64) -> Result<()> { loop { // Check if it's our turn let notify = { let mut queues = self.queues.lock().await; let queue = queues.get_mut(&action_id).unwrap(); // Are we at the front AND is there capacity? if queue.waiting.front() == Some(&execution_id) && queue.running_count < queue.limit { // It's our turn! queue.waiting.pop_front(); queue.running_count += 1; return Ok(()); } queue.notify.clone() }; // Not our turn, wait for notification notify.notified().await; } } /// Mark execution as complete (frees up slot) pub async fn complete(&self, action_id: i64, execution_id: i64) { let mut queues = self.queues.lock().await; if let Some(queue) = queues.get_mut(&action_id) { queue.running_count = queue.running_count.saturating_sub(1); queue.notify.notify_one(); // Wake next waiting execution } } /// Get queue stats for monitoring pub async fn get_queue_stats(&self, action_id: i64) -> QueueStats { let queues = self.queues.lock().await; if let Some(queue) = queues.get(&action_id) { QueueStats { waiting: queue.waiting.len(), running: queue.running_count as usize, limit: queue.limit as usize, } } else { QueueStats::default() } } } ``` 2. **Integrate with PolicyEnforcer** ```rust // Update policy_enforcer.rs pub struct PolicyEnforcer { pool: PgPool, queue_manager: Arc, // ← NEW // ... existing fields } pub async fn enforce_and_wait( &self, action_id: Id, execution_id: Id, pack_id: Option, ) -> Result<()> { // Check if policy would be violated if let Some(violation) = self.check_policies(action_id, pack_id).await? { match violation { PolicyViolation::ConcurrencyLimitExceeded { .. } => { // Enqueue and wait for turn let position = self.queue_manager.enqueue(action_id, execution_id).await; info!("Execution {} queued at position {}", execution_id, position); self.queue_manager.wait_for_turn(action_id, execution_id).await?; info!("Execution {} proceeding after queue wait", execution_id); } _ => { // Other policy types: retry with backoff self.retry_with_backoff(action_id, pack_id).await?; } } } Ok(()) } ``` 3. **Update Scheduler to Use Queue** ```rust // In scheduler.rs async fn process_execution_requested( pool: &PgPool, publisher: &Publisher, policy_enforcer: &PolicyEnforcer, // ← NEW parameter envelope: &MessageEnvelope, ) -> Result<()> { let execution_id = envelope.payload.execution_id; let execution = ExecutionRepository::find_by_id(pool, execution_id).await?; let action = Self::get_action_for_execution(pool, &execution).await?; // Enforce policies with queueing policy_enforcer.enforce_and_wait( action.id, execution_id, Some(action.pack), ).await?; // Now proceed with scheduling let worker = Self::select_worker(pool, &action).await?; // ... } ``` 4. **Add Completion Notification** ```rust // Worker must notify when execution completes // In worker/src/executor.rs async fn handle_execution_success( &self, execution_id: i64, action_id: i64, result: &ExecutionResult, ) -> Result<()> { // Update database ExecutionRepository::update(...).await?; // Notify queue manager (via message queue) let payload = ExecutionCompletedPayload { execution_id, action_id, status: ExecutionStatus::Completed, }; self.publisher.publish("execution.completed", payload).await?; Ok(()) } ``` 5. **Add Queue Monitoring API** ```rust // New endpoint in API service /// GET /api/v1/actions/:id/queue-stats async fn get_action_queue_stats( State(state): State>, Path(action_id): Path, ) -> Result>> { let stats = state.queue_manager.get_queue_stats(action_id).await; Ok(Json(ApiResponse::success(stats))) } #[derive(Serialize)] pub struct QueueStats { pub waiting: usize, pub running: usize, pub limit: usize, pub avg_wait_time_seconds: Option, } ``` **IMPLEMENTATION PRIORITY: CRITICAL** - This affects correctness and fairness of the system - Must be implemented before production use - Should be addressed in Phase 3 (Executor Service completion) ### Testing Requirements **Unit Tests:** - [ ] Queue maintains FIFO order - [ ] Multiple executions enqueue correctly - [ ] Dequeue happens in order - [ ] Notify wakes correct waiting execution - [ ] Concurrent enqueue/dequeue operations safe **Integration Tests:** - [ ] End-to-end execution ordering with policies - [ ] Three executions with limit=1 execute in order - [ ] Queue stats reflect actual state - [ ] Worker completion notification releases queue slot **Load Tests:** - [ ] 1000 concurrent delayed executions - [ ] Correct ordering maintained under load - [ ] No missed notifications or deadlocks --- ## Summary of Critical Issues | Issue | Severity | Status | Must Fix Before v1.0 | |-------|----------|--------|---------------------| | 1. Action Coupling | ✅ Good | Avoided | No | | 2. Type Safety | ✅ Excellent | Avoided | No | | 3. Language Ecosystems | ⚠️ Moderate | Partial | **Yes** - Implement pack installation | | 4. Dependency Hell | 🔴 Critical | Vulnerable | **Yes** - Implement venv isolation | | 5. Secret Security | 🔴 Critical | Vulnerable | **Yes** - Use stdin/files for secrets | | 6. Log Storage | ⚠️ Moderate | Good Design | **Yes** - Add size limits | | 7. Policy Execution Order | 🔴 Critical | Missing | **Yes** - Implement FIFO queue | --- ## Recommended Implementation Order ### Phase 1: Security & Correctness Fixes (Sprint 1 - Week 1-3) **Priority: CRITICAL - Block All Other Work** 1. Fix secret passing vulnerability (Issue 5) - Implement stdin-based secret injection - Remove secrets from environment variables - Update Python/Shell runtime wrappers - Add security documentation 2. Implement execution queue for policies (Issue 7) **NEW** - FIFO queue per action - Notify mechanism for slot availability - Integration with PolicyEnforcer - Queue monitoring API ### Phase 2: Runtime Isolation (Sprint 2 - Week 4-5) **Priority: HIGH - Required for Production** 3. Implement per-pack virtual environments (Issue 4) - Python venv creation per pack - Dependency installation service - Runtime version management 4. Add pack installation service (Issue 3) - Pack setup/teardown lifecycle - Dependency resolution - Installation status tracking ### Phase 3: Operational Hardening (Sprint 3 - Week 6-7) **Priority: MEDIUM - Quality of Life** 5. Implement log size limits (Issue 6) - Streaming log collection - Size-based truncation - Configuration options 6. Add log rotation and compression - Multi-file logs - Automatic compression - Retention policies ### Phase 4: Advanced Features (v1.1+) **Priority: LOW - Future Enhancement** 6. Container-based runtimes 7. Multi-version runtime support 8. Advanced dependency management 9. Log streaming API 10. Pack marketplace/registry --- ## Testing Checklist Before marking issues as resolved, verify: ### Issue 5 (Secret Security) - [ ] Secrets not visible in `ps auxwwe` - [ ] Secrets not readable from `/proc/{pid}/environ` - [ ] Actions can successfully read secrets from stdin/file - [ ] Python wrapper script reads secrets securely - [ ] Shell wrapper script reads secrets securely - [ ] Documentation updated with secure patterns ### Issue 7 (Policy Execution Order) **NEW** - [ ] Execution queue maintains FIFO order - [ ] Three executions with limit=1 execute in correct order - [ ] Queue stats API returns accurate counts - [ ] Worker completion notification releases queue slot - [ ] No race conditions under concurrent load - [ ] Correct ordering with 1000 delayed executions ### Issue 4 (Dependency Isolation) - [ ] Each pack gets isolated venv - [ ] Installing pack A dependencies doesn't affect pack B - [ ] Upgrading system Python doesn't break existing packs - [ ] Runtime version can be specified per pack - [ ] Multiple Python versions can coexist ### Issue 3 (Language Support) - [ ] Python packs can declare dependencies in metadata - [ ] `pip install` runs during pack installation - [ ] Node.js packs supported with npm install - [ ] Pack installation status tracked - [ ] Failed installations reported with logs ### Issue 6 (Log Limits) - [ ] Logs truncated at configured size limit - [ ] Worker doesn't OOM on large output - [ ] Truncation is clearly marked in logs - [ ] Multiple log files created for rotation - [ ] Old logs cleaned up per retention policy --- ## Architecture Decision Records ### ADR-001: Use Stdin for Secret Injection **Decision:** Pass secrets via stdin as JSON instead of environment variables. **Rationale:** - Environment variables visible in `/proc/{pid}/environ` - stdin content not exposed to other processes - Follows principle of least privilege - Industry best practice (used by Kubernetes, HashiCorp Vault) **Consequences:** - Requires wrapper script modifications - Actions must explicitly read from stdin - Slight increase in complexity - **Major security improvement** ### ADR-002: Per-Pack Virtual Environments **Decision:** Each pack gets isolated Python virtual environment. **Rationale:** - Prevents dependency conflicts between packs - Allows different Python versions per pack - Protects against system Python upgrades - Standard practice in Python ecosystem **Consequences:** - Increased disk usage (one venv per pack) - Pack installation takes longer - Worker must manage venv lifecycle - **Eliminates dependency hell** ### ADR-003: Filesystem-Based Log Storage **Decision:** Store logs in filesystem, not database. **Rationale:** - Database not designed for large blob storage - Filesystem handles large files efficiently - Easy to implement rotation and compression - Can stream logs without loading entire file **Consequences:** - Logs separate from structured execution data - Need backup strategy for log directory - Cleanup/retention requires separate process - **Avoids database bloat and failures** --- ## References - StackStorm Lessons Learned: `work-summary/StackStorm-Lessons-Learned.md` - Current Worker Implementation: `crates/worker/src/` - Runtime Abstraction: `crates/worker/src/runtime/` - Secret Management: `crates/worker/src/secrets.rs` - Artifact Storage: `crates/worker/src/artifacts.rs` - Database Schema: `migrations/20240101000004_create_runtime_worker.sql` --- ## Next Steps 1. **Review this analysis with team** - Discuss priorities and timeline 2. **Create GitHub issues** - One issue per critical problem 3. **Update TODO.md** - Add tasks from Implementation Order section 4. **Begin Phase 1** - Security fixes first, before any other work 5. **Schedule security review** - After Phase 1 completion --- **Document Status:** Complete - Ready for Review **Author:** AI Assistant **Reviewers Needed:** Security Team, Architecture Team, DevOps Lead