31 KiB
StackStorm Pitfalls Analysis: Current Implementation Review
Date: 2024-01-02
Status: Analysis Complete - Action Items Identified
Executive Summary
This document analyzes the current Attune implementation against the StackStorm lessons learned to identify replicated pitfalls and propose solutions. The analysis reveals 3 critical issues and 2 moderate concerns that need to be addressed before production deployment.
1. HIGH COUPLING WITH CUSTOM ACTIONS ✅ AVOIDED
StackStorm Problem
- Custom actions are tightly coupled to st2 services
- Minimal documentation around action/sensor service interfaces
- Actions must import st2 libraries and inherit from st2 classes
Current Attune Status: GOOD
- ✅ Actions are executed as standalone processes via
tokio::process::Command - ✅ No Attune-specific imports or base classes required
- ✅ Runtime abstraction layer in
worker/src/runtime/is well-designed - ✅ Actions receive data via environment variables and stdin (code execution)
Recommendations
- Keep current approach - the runtime abstraction is solid
- Consider documenting the runtime interface contract for pack developers
- Add examples of "pure" Python/Shell/Node.js actions that work without any Attune dependencies
2. TYPE SAFETY AND DOCUMENTATION ✅ AVOIDED
StackStorm Problem
- Python with minimal type hints
- Runtime property injection makes types hard to determine
- Poor documentation of service interfaces
Current Attune Status: EXCELLENT
- ✅ Built in Rust with full compile-time type checking
- ✅ All models in
common/src/models.rsare strongly typed with SQLx - ✅ Clear type definitions for
ExecutionContext,ExecutionResult,RuntimeError - ✅ Repository pattern enforces type contracts
Recommendations
- No changes needed - Rust's type system provides the safety we need
- Continue documenting public APIs in
docs/folder - Consider generating OpenAPI specs from Axum routes for external consumers
3. LIMITED LANGUAGE ECOSYSTEM SUPPORT ⚠️ PARTIALLY ADDRESSED
StackStorm Problem
- Only Python packs natively supported
- Other languages require custom installation logic
- No standard way to declare dependencies per language ecosystem
Current Attune Status: NEEDS WORK
What's Good
- ✅ Runtime abstraction supports multiple languages (Python, Shell, Node.js planned)
- ✅
Packmodel hasruntime_deps: Vec<String>field for dependencies - ✅
Runtimetable hasdistributionsJSONB andinstallationJSONB fields
Problems Identified
Problem 3.1: No Dependency Installation Implementation
// In crates/common/src/models.rs
pub struct Pack {
// ...
pub runtime_deps: Vec<String>, // ← DEFINED BUT NOT USED
// ...
}
pub struct Runtime {
// ...
pub distributions: JsonDict, // ← NO INSTALLATION LOGIC
pub installation: Option<JsonDict>,
// ...
}
Problem 3.2: No Pack Installation/Setup Service
- No code exists to process
runtime_depsfield - No integration with pip, npm, cargo, etc.
- No isolation of dependencies between packs
Problem 3.3: Runtime Detection is Naive
// In crates/worker/src/runtime/python.rs:279
fn can_execute(&self, context: &ExecutionContext) -> bool {
// Only checks file extension - doesn't verify runtime availability
context.action_ref.contains(".py")
|| context.entry_point.ends_with(".py")
// ...
}
Recommendations
IMMEDIATE (Before Production):
-
Implement Pack Installation Service
- Create
attune-packmanservice or add toattune-api - Support installing Python deps via
pip install -r requirements.txt - Support installing Node.js deps via
npm install - Store pack code in isolated directories:
/var/lib/attune/packs/{pack_ref}/
- Create
-
Enhance Runtime Model
- Add
installation_statusenum:not_installed,installing,installed,failed - Add
installed_attimestamp - Add
installation_logfield for troubleshooting
- Add
-
Implement Dependency Isolation
- Python: Use
venvper pack in/var/lib/attune/packs/{pack_ref}/.venv/ - Node.js: Use local
node_modulesper pack - Document in pack schema: how to declare dependencies
- Python: Use
FUTURE (v2.0): 4. Container-based Runtime
- Each pack gets its own container image
- Dependencies baked into image
- Complete isolation from Attune system
4. DEPENDENCY HELL AND SYSTEM COUPLING 🔴 CRITICAL ISSUE
StackStorm Problem
- st2 services run on Python 2.7/3.6 (EOL)
- Upgrading st2 system breaks user actions
- User actions are coupled to st2 Python version
Current Attune Status: VULNERABLE
Problems Identified
Problem 4.1: Shared System Python Runtime
// In crates/worker/src/runtime/python.rs:19
pub fn new() -> Self {
Self {
python_path: PathBuf::from("python3"), // ← SYSTEM PYTHON!
// ...
}
}
- Currently uses system-wide
python3 - If Attune upgrades system Python, user actions may break
- No version pinning or isolation
Problem 4.2: No Runtime Version Management
- No way to specify Python 3.9 vs 3.11 vs 3.12
- Runtime table has
namefield but it's not used for version selection - Shell runtime hardcoded to
/bin/bash
Problem 4.3: Attune System Dependencies Could Conflict
- If Attune worker needs a Python library (e.g., for parsing), it could conflict with action deps
- No separation between "Attune system dependencies" and "action dependencies"
Recommendations
CRITICAL (Must Fix Before v1.0):
-
Implement Per-Pack Virtual Environments
// Pseudocode for python.rs enhancement pub struct PythonRuntime { python_path: PathBuf, // System python3 for venv creation venv_base: PathBuf, // /var/lib/attune/packs/ default_python_version: String, // "3.11" } impl PythonRuntime { async fn get_or_create_venv(&self, pack_ref: &str) -> Result<PathBuf> { let venv_path = self.venv_base.join(pack_ref).join(".venv"); if !venv_path.exists() { self.create_venv(&venv_path).await?; self.install_pack_deps(pack_ref, &venv_path).await?; } Ok(venv_path.join("bin/python")) } } -
Support Multiple Runtime Versions
- Store available Python versions:
/opt/attune/runtimes/python-3.9/,.../python-3.11/ - Pack declares required version in metadata:
"runtime_version": "3.11" - Worker selects appropriate runtime based on pack requirements
- Store available Python versions:
-
Decouple Attune System from Action Execution
- Attune services (API, executor, worker) remain in Rust - no Python coupling
- Actions run in isolated environments
- Clear boundary: Attune communicates with actions only via stdin/stdout/env/files
DESIGN PRINCIPLE:
"Upgrading Attune system dependencies should NEVER break existing user actions."
5. INSECURE SECRET PASSING 🔴 CRITICAL SECURITY ISSUE
StackStorm Problem
- Secrets passed as environment variables or CLI arguments
- Visible to all users with login access via
ps,/proc/{pid}/environ - Major security vulnerability
Current Attune Status: VULNERABLE
Problems Identified
Problem 5.1: Secrets Exposed in Environment Variables
// In crates/worker/src/secrets.rs:142
pub fn prepare_secret_env(&self, secrets: &HashMap<String, String>)
-> HashMap<String, String> {
secrets
.iter()
.map(|(name, value)| {
let env_name = format!("SECRET_{}", name.to_uppercase().replace('-', "_"));
(env_name, value.clone()) // ← EXPOSED IN PROCESS ENV!
})
.collect()
}
// In crates/worker/src/executor.rs:228
env.extend(secret_env); // ← Secrets added to env vars
Problem 5.2: Secrets Visible in Process Table
// In crates/worker/src/runtime/python.rs:122
let mut cmd = Command::new(&self.python_path);
cmd.arg("-c").arg(&script)
.stdin(Stdio::null()) // ← NOT USING STDIN!
// ...
for (key, value) in env {
cmd.env(key, value); // ← Secrets visible via /proc/{pid}/environ
}
Problem 5.3: Parameters Also Exposed (Lower Risk)
// In crates/worker/src/runtime/shell.rs:49
for (key, value) in &context.parameters {
script.push_str(&format!(
"export PARAM_{}='{}'\n", // ← Parameters visible in env
key.to_uppercase(),
value_str
));
}
Security Impact
- HIGH: Any user with shell access can view secrets via:
ps auxwwe- shows environment variablescat /proc/{pid}/environ- shows full environmentstrings /proc/{pid}/environ- extracts secret values
- MEDIUM: Short-lived processes reduce exposure window, but still vulnerable
Recommendations
CRITICAL (Must Fix Before v1.0):
-
Pass Secrets via Stdin (Preferred Method)
// Enhanced approach for python.rs async fn execute_python_code( &self, script: String, secrets: &HashMap<String, String>, parameters: &HashMap<String, serde_json::Value>, env: &HashMap<String, String>, // Only non-secret env vars timeout_secs: Option<u64>, ) -> RuntimeResult<ExecutionResult> { // Create secrets JSON file let secrets_json = serde_json::to_string(&serde_json::json!({ "secrets": secrets, "parameters": parameters, }))?; let mut cmd = Command::new(&self.python_path); cmd.arg("-c").arg(&script) .stdin(Stdio::piped()) // ← Use stdin! .stdout(Stdio::piped()) .stderr(Stdio::piped()); // Only add non-secret env vars for (key, value) in env { if !key.starts_with("SECRET_") { cmd.env(key, value); } } let mut child = cmd.spawn()?; // Write secrets to stdin and close if let Some(mut stdin) = child.stdin.take() { stdin.write_all(secrets_json.as_bytes()).await?; drop(stdin); // Close stdin } let output = child.wait_with_output().await?; // ... } -
Alternative: Use Temporary Secret Files
// Create secure temporary file (0600 permissions) let secrets_file = format!("/tmp/attune-secrets-{}-{}.json", execution_id, uuid::Uuid::new_v4()); let mut file = OpenOptions::new() .create_new(true) .write(true) .mode(0o600) // Read/write for owner only .open(&secrets_file).await?; file.write_all(serde_json::to_string(secrets)?.as_bytes()).await?; file.sync_all().await?; drop(file); // Pass file path via env (not the secrets themselves) cmd.env("ATTUNE_SECRETS_FILE", &secrets_file); // Clean up after execution tokio::fs::remove_file(&secrets_file).await?; -
Update Python Wrapper Script
# Modified wrapper script generator def main(): import sys, json # Read secrets and parameters from stdin input_data = json.load(sys.stdin) secrets = input_data.get('secrets', {}) parameters = input_data.get('parameters', {}) # Secrets available in code but not in environment # ... -
Document Secure Secret Access Pattern
- Create
docs/secure-secret-handling.md - Provide action templates that read from stdin
- Add security best practices guide for pack developers
- Create
IMPLEMENTATION PRIORITY: IMMEDIATE
- This is a security vulnerability that must be fixed before any production use
- Should be addressed in Phase 3 (Worker Service completion)
6. STDERR DATABASE STORAGE CAUSING FAILURES ⚠️ MODERATE ISSUE
StackStorm Problem
- stderr output stored directly in database
- Excessive logging can exceed database field limits
- Jobs fail unexpectedly due to log size
Current Attune Status: GOOD APPROACH, NEEDS LIMITS
What's Good
✅ Attune uses filesystem storage for logs
// In crates/worker/src/artifacts.rs:72
pub async fn store_logs(
&self,
execution_id: i64,
stdout: &str,
stderr: &str,
) -> Result<Vec<Artifact>> {
// Stores to files: /tmp/attune/artifacts/execution_{id}/stdout.log
// /tmp/attune/artifacts/execution_{id}/stderr.log
// NOT stored in database!
}
✅ Database only stores result JSON
// In crates/worker/src/executor.rs:331
let input = UpdateExecutionInput {
status: Some(ExecutionStatus::Completed),
result: result.result.clone(), // ← Only structured result, not logs
executor: None,
};
Problems Identified
Problem 6.1: No Size Limits on Log Files
// In artifacts.rs - no size checks!
file.write_all(stdout.as_bytes()).await?; // ← Could be gigabytes!
Problem 6.2: No Log Rotation
- Single file per execution
- If action produces GB of logs, file grows unbounded
- Could fill disk
Problem 6.3: In-Memory Log Collection
// In python.rs and shell.rs
let output = execution_future.await?;
let stdout = String::from_utf8_lossy(&output.stdout).to_string(); // ← ALL in memory!
let stderr = String::from_utf8_lossy(&output.stderr).to_string();
- If action produces 1GB of output, worker could OOM
Recommendations
HIGH PRIORITY (Before Production):
-
Implement Streaming Log Collection
// Replace `.output()` with streaming approach use tokio::io::{AsyncBufReadExt, BufReader}; async fn execute_with_streaming_logs( &self, mut cmd: Command, execution_id: i64, max_log_size: usize, // e.g., 10MB ) -> RuntimeResult<ExecutionResult> { let mut child = cmd.spawn()?; // Stream stdout to file with size limit if let Some(stdout) = child.stdout.take() { let reader = BufReader::new(stdout); let mut lines = reader.lines(); let mut total_size = 0; let log_file = /* open stdout.log */; while let Some(line) = lines.next_line().await? { total_size += line.len(); if total_size > max_log_size { // Truncate and add warning write!(log_file, "\n[TRUNCATED: Log exceeded {}MB]", max_log_size / 1024 / 1024).await?; break; } writeln!(log_file, "{}", line).await?; } } // Similar for stderr // ... } -
Add Configuration Limits
# config.yaml worker: log_limits: max_stdout_size: 10485760 # 10MB max_stderr_size: 10485760 # 10MB max_total_size: 20971520 # 20MB truncate_on_exceed: true -
Implement Log Rotation Per Execution
/var/lib/attune/artifacts/ execution_123/ stdout.0.log (first 10MB) stdout.1.log (next 10MB) stdout.2.log (final chunk) stderr.0.log result.json -
Add Log Streaming API Endpoint
- API endpoint:
GET /api/v1/executions/{id}/logs/stdout?follow=true - Stream logs to client as execution progresses
- Similar to
docker logs --follow
- API endpoint:
MEDIUM PRIORITY (v1.1):
- Implement Log Compression
- Compress logs after execution completes
- Save disk space for long-term retention
- Decompress on-demand for viewing
7. POLICY EXECUTION ORDERING 🔴 CRITICAL ISSUE
Problem Statement
When multiple executions are delayed due to policy enforcement (e.g., concurrency limits), there is no guaranteed ordering for when they will be scheduled once resources become available.
Current Implementation Status: MISSING CRITICAL FEATURE
What Exists
✅ Policy enforcement framework
// In crates/executor/src/policy_enforcer.rs:428
pub async fn wait_for_policy_compliance(
&self,
action_id: Id,
pack_id: Option<Id>,
max_wait_seconds: u32,
) -> Result<bool> {
// Polls until policies allow execution
// BUT: No queue management!
}
✅ Concurrency and rate limiting
// Can detect when limits are exceeded
PolicyViolation::ConcurrencyLimitExceeded { limit: 5, current_count: 7 }
Problems Identified
Problem 7.1: Non-Deterministic Scheduling Order
Scenario:
Action has concurrency limit: 2
Time 0: E1 requested → starts (slot 1/2)
Time 1: E2 requested → starts (slot 2/2)
Time 2: E3 requested → DELAYED (no slots)
Time 3: E4 requested → DELAYED (no slots)
Time 4: E5 requested → DELAYED (no slots)
Time 5: E1 completes → which delayed execution runs?
Current behavior: UNDEFINED ORDER (possibly E5, then E3, then E4)
Expected behavior: FIFO - E3, then E4, then E5
Problem 7.2: No Queue Data Structure
// Current implementation in policy_enforcer.rs
// Only polls for compliance - no queue!
loop {
if self.check_policies(action_id, pack_id).await?.is_none() {
return Ok(true); // ← Just returns true, no coordination
}
tokio::time::sleep(Duration::from_secs(1)).await;
}
Problem 7.3: Race Conditions
- Multiple delayed executions poll simultaneously
- When slot opens, multiple executions might see it
- First to update wins, others keep waiting
- No fairness guarantee
Problem 7.4: No Visibility into Queue
- Can't see how many executions are waiting
- Can't see position in queue
- No way to estimate wait time
- Difficult to debug policy issues
Business Impact
Fairness Issues:
- Later requests might execute before earlier ones
- Violates user expectations (FIFO is standard)
- Unpredictable execution order
Workflow Dependencies:
- Workflow step B requested after step A
- Step B might execute before A completes
- Data dependencies violated
- Incorrect results or failures
Testing/Debugging:
- Non-deterministic behavior hard to reproduce
- Integration tests become flaky
- Production issues difficult to diagnose
Performance:
- Polling wastes CPU cycles
- Multiple executions wake up unnecessarily
- Database load from repeated policy checks
Recommendations
CRITICAL (Must Fix Before v1.0):
-
Implement Per-Action Execution Queue
// New file: crates/executor/src/execution_queue.rs use std::collections::{HashMap, VecDeque}; use tokio::sync::{Mutex, Notify}; /// Manages FIFO queues of delayed executions per action pub struct ExecutionQueueManager { /// Queue per action_id queues: Arc<Mutex<HashMap<i64, ActionQueue>>>, } struct ActionQueue { /// FIFO queue of waiting execution IDs waiting: VecDeque<i64>, /// Notify when slot becomes available notify: Arc<Notify>, /// Current running count running_count: u32, /// Concurrency limit for this action limit: u32, } impl ExecutionQueueManager { /// Enqueue an execution (returns position in queue) pub async fn enqueue(&self, action_id: i64, execution_id: i64) -> usize { let mut queues = self.queues.lock().await; let queue = queues.entry(action_id).or_insert_with(ActionQueue::new); queue.waiting.push_back(execution_id); queue.waiting.len() } /// Wait for turn (blocks until this execution can proceed) pub async fn wait_for_turn(&self, action_id: i64, execution_id: i64) -> Result<()> { loop { // Check if it's our turn let notify = { let mut queues = self.queues.lock().await; let queue = queues.get_mut(&action_id).unwrap(); // Are we at the front AND is there capacity? if queue.waiting.front() == Some(&execution_id) && queue.running_count < queue.limit { // It's our turn! queue.waiting.pop_front(); queue.running_count += 1; return Ok(()); } queue.notify.clone() }; // Not our turn, wait for notification notify.notified().await; } } /// Mark execution as complete (frees up slot) pub async fn complete(&self, action_id: i64, execution_id: i64) { let mut queues = self.queues.lock().await; if let Some(queue) = queues.get_mut(&action_id) { queue.running_count = queue.running_count.saturating_sub(1); queue.notify.notify_one(); // Wake next waiting execution } } /// Get queue stats for monitoring pub async fn get_queue_stats(&self, action_id: i64) -> QueueStats { let queues = self.queues.lock().await; if let Some(queue) = queues.get(&action_id) { QueueStats { waiting: queue.waiting.len(), running: queue.running_count as usize, limit: queue.limit as usize, } } else { QueueStats::default() } } } -
Integrate with PolicyEnforcer
// Update policy_enforcer.rs pub struct PolicyEnforcer { pool: PgPool, queue_manager: Arc<ExecutionQueueManager>, // ← NEW // ... existing fields } pub async fn enforce_and_wait( &self, action_id: Id, execution_id: Id, pack_id: Option<Id>, ) -> Result<()> { // Check if policy would be violated if let Some(violation) = self.check_policies(action_id, pack_id).await? { match violation { PolicyViolation::ConcurrencyLimitExceeded { .. } => { // Enqueue and wait for turn let position = self.queue_manager.enqueue(action_id, execution_id).await; info!("Execution {} queued at position {}", execution_id, position); self.queue_manager.wait_for_turn(action_id, execution_id).await?; info!("Execution {} proceeding after queue wait", execution_id); } _ => { // Other policy types: retry with backoff self.retry_with_backoff(action_id, pack_id).await?; } } } Ok(()) } -
Update Scheduler to Use Queue
// In scheduler.rs async fn process_execution_requested( pool: &PgPool, publisher: &Publisher, policy_enforcer: &PolicyEnforcer, // ← NEW parameter envelope: &MessageEnvelope<ExecutionRequestedPayload>, ) -> Result<()> { let execution_id = envelope.payload.execution_id; let execution = ExecutionRepository::find_by_id(pool, execution_id).await?; let action = Self::get_action_for_execution(pool, &execution).await?; // Enforce policies with queueing policy_enforcer.enforce_and_wait( action.id, execution_id, Some(action.pack), ).await?; // Now proceed with scheduling let worker = Self::select_worker(pool, &action).await?; // ... } -
Add Completion Notification
// Worker must notify when execution completes // In worker/src/executor.rs async fn handle_execution_success( &self, execution_id: i64, action_id: i64, result: &ExecutionResult, ) -> Result<()> { // Update database ExecutionRepository::update(...).await?; // Notify queue manager (via message queue) let payload = ExecutionCompletedPayload { execution_id, action_id, status: ExecutionStatus::Completed, }; self.publisher.publish("execution.completed", payload).await?; Ok(()) } -
Add Queue Monitoring API
// New endpoint in API service /// GET /api/v1/actions/:id/queue-stats async fn get_action_queue_stats( State(state): State<Arc<AppState>>, Path(action_id): Path<i64>, ) -> Result<Json<ApiResponse<QueueStats>>> { let stats = state.queue_manager.get_queue_stats(action_id).await; Ok(Json(ApiResponse::success(stats))) } #[derive(Serialize)] pub struct QueueStats { pub waiting: usize, pub running: usize, pub limit: usize, pub avg_wait_time_seconds: Option<f64>, }
IMPLEMENTATION PRIORITY: CRITICAL
- This affects correctness and fairness of the system
- Must be implemented before production use
- Should be addressed in Phase 3 (Executor Service completion)
Testing Requirements
Unit Tests:
- Queue maintains FIFO order
- Multiple executions enqueue correctly
- Dequeue happens in order
- Notify wakes correct waiting execution
- Concurrent enqueue/dequeue operations safe
Integration Tests:
- End-to-end execution ordering with policies
- Three executions with limit=1 execute in order
- Queue stats reflect actual state
- Worker completion notification releases queue slot
Load Tests:
- 1000 concurrent delayed executions
- Correct ordering maintained under load
- No missed notifications or deadlocks
Summary of Critical Issues
| Issue | Severity | Status | Must Fix Before v1.0 |
|---|---|---|---|
| 1. Action Coupling | ✅ Good | Avoided | No |
| 2. Type Safety | ✅ Excellent | Avoided | No |
| 3. Language Ecosystems | ⚠️ Moderate | Partial | Yes - Implement pack installation |
| 4. Dependency Hell | 🔴 Critical | Vulnerable | Yes - Implement venv isolation |
| 5. Secret Security | 🔴 Critical | Vulnerable | Yes - Use stdin/files for secrets |
| 6. Log Storage | ⚠️ Moderate | Good Design | Yes - Add size limits |
| 7. Policy Execution Order | 🔴 Critical | Missing | Yes - Implement FIFO queue |
Recommended Implementation Order
Phase 1: Security & Correctness Fixes (Sprint 1 - Week 1-3)
Priority: CRITICAL - Block All Other Work
-
Fix secret passing vulnerability (Issue 5)
- Implement stdin-based secret injection
- Remove secrets from environment variables
- Update Python/Shell runtime wrappers
- Add security documentation
-
Implement execution queue for policies (Issue 7) NEW
- FIFO queue per action
- Notify mechanism for slot availability
- Integration with PolicyEnforcer
- Queue monitoring API
Phase 2: Runtime Isolation (Sprint 2 - Week 4-5)
Priority: HIGH - Required for Production
-
Implement per-pack virtual environments (Issue 4)
- Python venv creation per pack
- Dependency installation service
- Runtime version management
-
Add pack installation service (Issue 3)
- Pack setup/teardown lifecycle
- Dependency resolution
- Installation status tracking
Phase 3: Operational Hardening (Sprint 3 - Week 6-7)
Priority: MEDIUM - Quality of Life
-
Implement log size limits (Issue 6)
- Streaming log collection
- Size-based truncation
- Configuration options
-
Add log rotation and compression
- Multi-file logs
- Automatic compression
- Retention policies
Phase 4: Advanced Features (v1.1+)
Priority: LOW - Future Enhancement
- Container-based runtimes
- Multi-version runtime support
- Advanced dependency management
- Log streaming API
- Pack marketplace/registry
Testing Checklist
Before marking issues as resolved, verify:
Issue 5 (Secret Security)
- Secrets not visible in
ps auxwwe - Secrets not readable from
/proc/{pid}/environ - Actions can successfully read secrets from stdin/file
- Python wrapper script reads secrets securely
- Shell wrapper script reads secrets securely
- Documentation updated with secure patterns
Issue 7 (Policy Execution Order) NEW
- Execution queue maintains FIFO order
- Three executions with limit=1 execute in correct order
- Queue stats API returns accurate counts
- Worker completion notification releases queue slot
- No race conditions under concurrent load
- Correct ordering with 1000 delayed executions
Issue 4 (Dependency Isolation)
- Each pack gets isolated venv
- Installing pack A dependencies doesn't affect pack B
- Upgrading system Python doesn't break existing packs
- Runtime version can be specified per pack
- Multiple Python versions can coexist
Issue 3 (Language Support)
- Python packs can declare dependencies in metadata
pip installruns during pack installation- Node.js packs supported with npm install
- Pack installation status tracked
- Failed installations reported with logs
Issue 6 (Log Limits)
- Logs truncated at configured size limit
- Worker doesn't OOM on large output
- Truncation is clearly marked in logs
- Multiple log files created for rotation
- Old logs cleaned up per retention policy
Architecture Decision Records
ADR-001: Use Stdin for Secret Injection
Decision: Pass secrets via stdin as JSON instead of environment variables.
Rationale:
- Environment variables visible in
/proc/{pid}/environ - stdin content not exposed to other processes
- Follows principle of least privilege
- Industry best practice (used by Kubernetes, HashiCorp Vault)
Consequences:
- Requires wrapper script modifications
- Actions must explicitly read from stdin
- Slight increase in complexity
- Major security improvement
ADR-002: Per-Pack Virtual Environments
Decision: Each pack gets isolated Python virtual environment.
Rationale:
- Prevents dependency conflicts between packs
- Allows different Python versions per pack
- Protects against system Python upgrades
- Standard practice in Python ecosystem
Consequences:
- Increased disk usage (one venv per pack)
- Pack installation takes longer
- Worker must manage venv lifecycle
- Eliminates dependency hell
ADR-003: Filesystem-Based Log Storage
Decision: Store logs in filesystem, not database.
Rationale:
- Database not designed for large blob storage
- Filesystem handles large files efficiently
- Easy to implement rotation and compression
- Can stream logs without loading entire file
Consequences:
- Logs separate from structured execution data
- Need backup strategy for log directory
- Cleanup/retention requires separate process
- Avoids database bloat and failures
References
- StackStorm Lessons Learned:
work-summary/StackStorm-Lessons-Learned.md - Current Worker Implementation:
crates/worker/src/ - Runtime Abstraction:
crates/worker/src/runtime/ - Secret Management:
crates/worker/src/secrets.rs - Artifact Storage:
crates/worker/src/artifacts.rs - Database Schema:
migrations/20240101000004_create_runtime_worker.sql
Next Steps
- Review this analysis with team - Discuss priorities and timeline
- Create GitHub issues - One issue per critical problem
- Update TODO.md - Add tasks from Implementation Order section
- Begin Phase 1 - Security fixes first, before any other work
- Schedule security review - After Phase 1 completion
Document Status: Complete - Ready for Review
Author: AI Assistant
Reviewers Needed: Security Team, Architecture Team, DevOps Lead