Files
attune/work-summary/phases/StackStorm-Pitfalls-Analysis.md
2026-02-04 17:46:30 -06:00

991 lines
31 KiB
Markdown

# StackStorm Pitfalls Analysis: Current Implementation Review
**Date:** 2024-01-02
**Status:** Analysis Complete - Action Items Identified
## Executive Summary
This document analyzes the current Attune implementation against the StackStorm lessons learned to identify replicated pitfalls and propose solutions. The analysis reveals **3 critical issues** and **2 moderate concerns** that need to be addressed before production deployment.
---
## 1. HIGH COUPLING WITH CUSTOM ACTIONS ✅ AVOIDED
### StackStorm Problem
- Custom actions are tightly coupled to st2 services
- Minimal documentation around action/sensor service interfaces
- Actions must import st2 libraries and inherit from st2 classes
### Current Attune Status: **GOOD**
- ✅ Actions are executed as standalone processes via `tokio::process::Command`
- ✅ No Attune-specific imports or base classes required
- ✅ Runtime abstraction layer in `worker/src/runtime/` is well-designed
- ✅ Actions receive data via environment variables and stdin (code execution)
### Recommendations
- **Keep current approach** - the runtime abstraction is solid
- Consider documenting the runtime interface contract for pack developers
- Add examples of "pure" Python/Shell/Node.js actions that work without any Attune dependencies
---
## 2. TYPE SAFETY AND DOCUMENTATION ✅ AVOIDED
### StackStorm Problem
- Python with minimal type hints
- Runtime property injection makes types hard to determine
- Poor documentation of service interfaces
### Current Attune Status: **EXCELLENT**
- ✅ Built in Rust with full compile-time type checking
- ✅ All models in `common/src/models.rs` are strongly typed with SQLx
- ✅ Clear type definitions for `ExecutionContext`, `ExecutionResult`, `RuntimeError`
- ✅ Repository pattern enforces type contracts
### Recommendations
- **No changes needed** - Rust's type system provides the safety we need
- Continue documenting public APIs in `docs/` folder
- Consider generating OpenAPI specs from Axum routes for external consumers
---
## 3. LIMITED LANGUAGE ECOSYSTEM SUPPORT ⚠️ PARTIALLY ADDRESSED
### StackStorm Problem
- Only Python packs natively supported
- Other languages require custom installation logic
- No standard way to declare dependencies per language ecosystem
### Current Attune Status: **NEEDS WORK**
#### What's Good
- ✅ Runtime abstraction supports multiple languages (Python, Shell, Node.js planned)
-`Pack` model has `runtime_deps: Vec<String>` field for dependencies
-`Runtime` table has `distributions` JSONB and `installation` JSONB fields
#### Problems Identified
**Problem 3.1: No Dependency Installation Implementation**
```rust
// In crates/common/src/models.rs
pub struct Pack {
// ...
pub runtime_deps: Vec<String>, // ← DEFINED BUT NOT USED
// ...
}
pub struct Runtime {
// ...
pub distributions: JsonDict, // ← NO INSTALLATION LOGIC
pub installation: Option<JsonDict>,
// ...
}
```
**Problem 3.2: No Pack Installation/Setup Service**
- No code exists to process `runtime_deps` field
- No integration with pip, npm, cargo, etc.
- No isolation of dependencies between packs
**Problem 3.3: Runtime Detection is Naive**
```rust
// In crates/worker/src/runtime/python.rs:279
fn can_execute(&self, context: &ExecutionContext) -> bool {
// Only checks file extension - doesn't verify runtime availability
context.action_ref.contains(".py")
|| context.entry_point.ends_with(".py")
// ...
}
```
### Recommendations
**IMMEDIATE (Before Production):**
1. **Implement Pack Installation Service**
- Create `attune-packman` service or add to `attune-api`
- Support installing Python deps via `pip install -r requirements.txt`
- Support installing Node.js deps via `npm install`
- Store pack code in isolated directories: `/var/lib/attune/packs/{pack_ref}/`
2. **Enhance Runtime Model**
- Add `installation_status` enum: `not_installed`, `installing`, `installed`, `failed`
- Add `installed_at` timestamp
- Add `installation_log` field for troubleshooting
3. **Implement Dependency Isolation**
- Python: Use `venv` per pack in `/var/lib/attune/packs/{pack_ref}/.venv/`
- Node.js: Use local `node_modules` per pack
- Document in pack schema: how to declare dependencies
**FUTURE (v2.0):**
4. **Container-based Runtime**
- Each pack gets its own container image
- Dependencies baked into image
- Complete isolation from Attune system
---
## 4. DEPENDENCY HELL AND SYSTEM COUPLING 🔴 CRITICAL ISSUE
### StackStorm Problem
- st2 services run on Python 2.7/3.6 (EOL)
- Upgrading st2 system breaks user actions
- User actions are coupled to st2 Python version
### Current Attune Status: **VULNERABLE**
#### Problems Identified
**Problem 4.1: Shared System Python Runtime**
```rust
// In crates/worker/src/runtime/python.rs:19
pub fn new() -> Self {
Self {
python_path: PathBuf::from("python3"), // ← SYSTEM PYTHON!
// ...
}
}
```
- Currently uses system-wide `python3`
- If Attune upgrades system Python, user actions may break
- No version pinning or isolation
**Problem 4.2: No Runtime Version Management**
- No way to specify Python 3.9 vs 3.11 vs 3.12
- Runtime table has `name` field but it's not used for version selection
- Shell runtime hardcoded to `/bin/bash`
**Problem 4.3: Attune System Dependencies Could Conflict**
- If Attune worker needs a Python library (e.g., for parsing), it could conflict with action deps
- No separation between "Attune system dependencies" and "action dependencies"
### Recommendations
**CRITICAL (Must Fix Before v1.0):**
1. **Implement Per-Pack Virtual Environments**
```rust
// Pseudocode for python.rs enhancement
pub struct PythonRuntime {
python_path: PathBuf, // System python3 for venv creation
venv_base: PathBuf, // /var/lib/attune/packs/
default_python_version: String, // "3.11"
}
impl PythonRuntime {
async fn get_or_create_venv(&self, pack_ref: &str) -> Result<PathBuf> {
let venv_path = self.venv_base.join(pack_ref).join(".venv");
if !venv_path.exists() {
self.create_venv(&venv_path).await?;
self.install_pack_deps(pack_ref, &venv_path).await?;
}
Ok(venv_path.join("bin/python"))
}
}
```
2. **Support Multiple Runtime Versions**
- Store available Python versions: `/opt/attune/runtimes/python-3.9/`, `.../python-3.11/`
- Pack declares required version in metadata: `"runtime_version": "3.11"`
- Worker selects appropriate runtime based on pack requirements
3. **Decouple Attune System from Action Execution**
- Attune services (API, executor, worker) remain in Rust - no Python coupling
- Actions run in isolated environments
- Clear boundary: Attune communicates with actions only via stdin/stdout/env/files
**DESIGN PRINCIPLE:**
> "Upgrading Attune system dependencies should NEVER break existing user actions."
---
## 5. INSECURE SECRET PASSING 🔴 CRITICAL SECURITY ISSUE
### StackStorm Problem
- Secrets passed as environment variables or CLI arguments
- Visible to all users with login access via `ps`, `/proc/{pid}/environ`
- Major security vulnerability
### Current Attune Status: **VULNERABLE**
#### Problems Identified
**Problem 5.1: Secrets Exposed in Environment Variables**
```rust
// In crates/worker/src/secrets.rs:142
pub fn prepare_secret_env(&self, secrets: &HashMap<String, String>)
-> HashMap<String, String> {
secrets
.iter()
.map(|(name, value)| {
let env_name = format!("SECRET_{}", name.to_uppercase().replace('-', "_"));
(env_name, value.clone()) // ← EXPOSED IN PROCESS ENV!
})
.collect()
}
// In crates/worker/src/executor.rs:228
env.extend(secret_env); // ← Secrets added to env vars
```
**Problem 5.2: Secrets Visible in Process Table**
```rust
// In crates/worker/src/runtime/python.rs:122
let mut cmd = Command::new(&self.python_path);
cmd.arg("-c").arg(&script)
.stdin(Stdio::null()) // ← NOT USING STDIN!
// ...
for (key, value) in env {
cmd.env(key, value); // ← Secrets visible via /proc/{pid}/environ
}
```
**Problem 5.3: Parameters Also Exposed (Lower Risk)**
```rust
// In crates/worker/src/runtime/shell.rs:49
for (key, value) in &context.parameters {
script.push_str(&format!(
"export PARAM_{}='{}'\n", // ← Parameters visible in env
key.to_uppercase(),
value_str
));
}
```
### Security Impact
- **HIGH**: Any user with shell access can view secrets via:
- `ps auxwwe` - shows environment variables
- `cat /proc/{pid}/environ` - shows full environment
- `strings /proc/{pid}/environ` - extracts secret values
- **MEDIUM**: Short-lived processes reduce exposure window, but still vulnerable
### Recommendations
**CRITICAL (Must Fix Before v1.0):**
1. **Pass Secrets via Stdin (Preferred Method)**
```rust
// Enhanced approach for python.rs
async fn execute_python_code(
&self,
script: String,
secrets: &HashMap<String, String>,
parameters: &HashMap<String, serde_json::Value>,
env: &HashMap<String, String>, // Only non-secret env vars
timeout_secs: Option<u64>,
) -> RuntimeResult<ExecutionResult> {
// Create secrets JSON file
let secrets_json = serde_json::to_string(&serde_json::json!({
"secrets": secrets,
"parameters": parameters,
}))?;
let mut cmd = Command::new(&self.python_path);
cmd.arg("-c").arg(&script)
.stdin(Stdio::piped()) // ← Use stdin!
.stdout(Stdio::piped())
.stderr(Stdio::piped());
// Only add non-secret env vars
for (key, value) in env {
if !key.starts_with("SECRET_") {
cmd.env(key, value);
}
}
let mut child = cmd.spawn()?;
// Write secrets to stdin and close
if let Some(mut stdin) = child.stdin.take() {
stdin.write_all(secrets_json.as_bytes()).await?;
drop(stdin); // Close stdin
}
let output = child.wait_with_output().await?;
// ...
}
```
2. **Alternative: Use Temporary Secret Files**
```rust
// Create secure temporary file (0600 permissions)
let secrets_file = format!("/tmp/attune-secrets-{}-{}.json",
execution_id, uuid::Uuid::new_v4());
let mut file = OpenOptions::new()
.create_new(true)
.write(true)
.mode(0o600) // Read/write for owner only
.open(&secrets_file).await?;
file.write_all(serde_json::to_string(secrets)?.as_bytes()).await?;
file.sync_all().await?;
drop(file);
// Pass file path via env (not the secrets themselves)
cmd.env("ATTUNE_SECRETS_FILE", &secrets_file);
// Clean up after execution
tokio::fs::remove_file(&secrets_file).await?;
```
3. **Update Python Wrapper Script**
```python
# Modified wrapper script generator
def main():
import sys, json
# Read secrets and parameters from stdin
input_data = json.load(sys.stdin)
secrets = input_data.get('secrets', {})
parameters = input_data.get('parameters', {})
# Secrets available in code but not in environment
# ...
```
4. **Document Secure Secret Access Pattern**
- Create `docs/secure-secret-handling.md`
- Provide action templates that read from stdin
- Add security best practices guide for pack developers
**IMPLEMENTATION PRIORITY: IMMEDIATE**
- This is a security vulnerability that must be fixed before any production use
- Should be addressed in Phase 3 (Worker Service completion)
---
## 6. STDERR DATABASE STORAGE CAUSING FAILURES ⚠️ MODERATE ISSUE
### StackStorm Problem
- stderr output stored directly in database
- Excessive logging can exceed database field limits
- Jobs fail unexpectedly due to log size
### Current Attune Status: **GOOD APPROACH, NEEDS LIMITS**
#### What's Good
✅ **Attune uses filesystem storage for logs**
```rust
// In crates/worker/src/artifacts.rs:72
pub async fn store_logs(
&self,
execution_id: i64,
stdout: &str,
stderr: &str,
) -> Result<Vec<Artifact>> {
// Stores to files: /tmp/attune/artifacts/execution_{id}/stdout.log
// /tmp/attune/artifacts/execution_{id}/stderr.log
// NOT stored in database!
}
```
✅ **Database only stores result JSON**
```rust
// In crates/worker/src/executor.rs:331
let input = UpdateExecutionInput {
status: Some(ExecutionStatus::Completed),
result: result.result.clone(), // ← Only structured result, not logs
executor: None,
};
```
#### Problems Identified
**Problem 6.1: No Size Limits on Log Files**
```rust
// In artifacts.rs - no size checks!
file.write_all(stdout.as_bytes()).await?; // ← Could be gigabytes!
```
**Problem 6.2: No Log Rotation**
- Single file per execution
- If action produces GB of logs, file grows unbounded
- Could fill disk
**Problem 6.3: In-Memory Log Collection**
```rust
// In python.rs and shell.rs
let output = execution_future.await?;
let stdout = String::from_utf8_lossy(&output.stdout).to_string(); // ← ALL in memory!
let stderr = String::from_utf8_lossy(&output.stderr).to_string();
```
- If action produces 1GB of output, worker could OOM
### Recommendations
**HIGH PRIORITY (Before Production):**
1. **Implement Streaming Log Collection**
```rust
// Replace `.output()` with streaming approach
use tokio::io::{AsyncBufReadExt, BufReader};
async fn execute_with_streaming_logs(
&self,
mut cmd: Command,
execution_id: i64,
max_log_size: usize, // e.g., 10MB
) -> RuntimeResult<ExecutionResult> {
let mut child = cmd.spawn()?;
// Stream stdout to file with size limit
if let Some(stdout) = child.stdout.take() {
let reader = BufReader::new(stdout);
let mut lines = reader.lines();
let mut total_size = 0;
let log_file = /* open stdout.log */;
while let Some(line) = lines.next_line().await? {
total_size += line.len();
if total_size > max_log_size {
// Truncate and add warning
write!(log_file, "\n[TRUNCATED: Log exceeded {}MB]",
max_log_size / 1024 / 1024).await?;
break;
}
writeln!(log_file, "{}", line).await?;
}
}
// Similar for stderr
// ...
}
```
2. **Add Configuration Limits**
```yaml
# config.yaml
worker:
log_limits:
max_stdout_size: 10485760 # 10MB
max_stderr_size: 10485760 # 10MB
max_total_size: 20971520 # 20MB
truncate_on_exceed: true
```
3. **Implement Log Rotation Per Execution**
```
/var/lib/attune/artifacts/
execution_123/
stdout.0.log (first 10MB)
stdout.1.log (next 10MB)
stdout.2.log (final chunk)
stderr.0.log
result.json
```
4. **Add Log Streaming API Endpoint**
- API endpoint: `GET /api/v1/executions/{id}/logs/stdout?follow=true`
- Stream logs to client as execution progresses
- Similar to `docker logs --follow`
**MEDIUM PRIORITY (v1.1):**
5. **Implement Log Compression**
- Compress logs after execution completes
- Save disk space for long-term retention
- Decompress on-demand for viewing
---
## 7. POLICY EXECUTION ORDERING 🔴 CRITICAL ISSUE
### Problem Statement
When multiple executions are delayed due to policy enforcement (e.g., concurrency limits), there is no guaranteed ordering for when they will be scheduled once resources become available.
### Current Implementation Status: **MISSING CRITICAL FEATURE**
#### What Exists
✅ **Policy enforcement framework**
```rust
// In crates/executor/src/policy_enforcer.rs:428
pub async fn wait_for_policy_compliance(
&self,
action_id: Id,
pack_id: Option<Id>,
max_wait_seconds: u32,
) -> Result<bool> {
// Polls until policies allow execution
// BUT: No queue management!
}
```
✅ **Concurrency and rate limiting**
```rust
// Can detect when limits are exceeded
PolicyViolation::ConcurrencyLimitExceeded { limit: 5, current_count: 7 }
```
#### Problems Identified
**Problem 7.1: Non-Deterministic Scheduling Order**
**Scenario:**
```
Action has concurrency limit: 2
Time 0: E1 requested → starts (slot 1/2)
Time 1: E2 requested → starts (slot 2/2)
Time 2: E3 requested → DELAYED (no slots)
Time 3: E4 requested → DELAYED (no slots)
Time 4: E5 requested → DELAYED (no slots)
Time 5: E1 completes → which delayed execution runs?
Current behavior: UNDEFINED ORDER (possibly E5, then E3, then E4)
Expected behavior: FIFO - E3, then E4, then E5
```
**Problem 7.2: No Queue Data Structure**
```rust
// Current implementation in policy_enforcer.rs
// Only polls for compliance - no queue!
loop {
if self.check_policies(action_id, pack_id).await?.is_none() {
return Ok(true); // ← Just returns true, no coordination
}
tokio::time::sleep(Duration::from_secs(1)).await;
}
```
**Problem 7.3: Race Conditions**
- Multiple delayed executions poll simultaneously
- When slot opens, multiple executions might see it
- First to update wins, others keep waiting
- No fairness guarantee
**Problem 7.4: No Visibility into Queue**
- Can't see how many executions are waiting
- Can't see position in queue
- No way to estimate wait time
- Difficult to debug policy issues
### Business Impact
**Fairness Issues:**
- Later requests might execute before earlier ones
- Violates user expectations (FIFO is standard)
- Unpredictable execution order
**Workflow Dependencies:**
- Workflow step B requested after step A
- Step B might execute before A completes
- Data dependencies violated
- Incorrect results or failures
**Testing/Debugging:**
- Non-deterministic behavior hard to reproduce
- Integration tests become flaky
- Production issues difficult to diagnose
**Performance:**
- Polling wastes CPU cycles
- Multiple executions wake up unnecessarily
- Database load from repeated policy checks
### Recommendations
**CRITICAL (Must Fix Before v1.0):**
1. **Implement Per-Action Execution Queue**
```rust
// New file: crates/executor/src/execution_queue.rs
use std::collections::{HashMap, VecDeque};
use tokio::sync::{Mutex, Notify};
/// Manages FIFO queues of delayed executions per action
pub struct ExecutionQueueManager {
/// Queue per action_id
queues: Arc<Mutex<HashMap<i64, ActionQueue>>>,
}
struct ActionQueue {
/// FIFO queue of waiting execution IDs
waiting: VecDeque<i64>,
/// Notify when slot becomes available
notify: Arc<Notify>,
/// Current running count
running_count: u32,
/// Concurrency limit for this action
limit: u32,
}
impl ExecutionQueueManager {
/// Enqueue an execution (returns position in queue)
pub async fn enqueue(&self, action_id: i64, execution_id: i64) -> usize {
let mut queues = self.queues.lock().await;
let queue = queues.entry(action_id).or_insert_with(ActionQueue::new);
queue.waiting.push_back(execution_id);
queue.waiting.len()
}
/// Wait for turn (blocks until this execution can proceed)
pub async fn wait_for_turn(&self, action_id: i64, execution_id: i64) -> Result<()> {
loop {
// Check if it's our turn
let notify = {
let mut queues = self.queues.lock().await;
let queue = queues.get_mut(&action_id).unwrap();
// Are we at the front AND is there capacity?
if queue.waiting.front() == Some(&execution_id)
&& queue.running_count < queue.limit {
// It's our turn!
queue.waiting.pop_front();
queue.running_count += 1;
return Ok(());
}
queue.notify.clone()
};
// Not our turn, wait for notification
notify.notified().await;
}
}
/// Mark execution as complete (frees up slot)
pub async fn complete(&self, action_id: i64, execution_id: i64) {
let mut queues = self.queues.lock().await;
if let Some(queue) = queues.get_mut(&action_id) {
queue.running_count = queue.running_count.saturating_sub(1);
queue.notify.notify_one(); // Wake next waiting execution
}
}
/// Get queue stats for monitoring
pub async fn get_queue_stats(&self, action_id: i64) -> QueueStats {
let queues = self.queues.lock().await;
if let Some(queue) = queues.get(&action_id) {
QueueStats {
waiting: queue.waiting.len(),
running: queue.running_count as usize,
limit: queue.limit as usize,
}
} else {
QueueStats::default()
}
}
}
```
2. **Integrate with PolicyEnforcer**
```rust
// Update policy_enforcer.rs
pub struct PolicyEnforcer {
pool: PgPool,
queue_manager: Arc<ExecutionQueueManager>, // ← NEW
// ... existing fields
}
pub async fn enforce_and_wait(
&self,
action_id: Id,
execution_id: Id,
pack_id: Option<Id>,
) -> Result<()> {
// Check if policy would be violated
if let Some(violation) = self.check_policies(action_id, pack_id).await? {
match violation {
PolicyViolation::ConcurrencyLimitExceeded { .. } => {
// Enqueue and wait for turn
let position = self.queue_manager.enqueue(action_id, execution_id).await;
info!("Execution {} queued at position {}", execution_id, position);
self.queue_manager.wait_for_turn(action_id, execution_id).await?;
info!("Execution {} proceeding after queue wait", execution_id);
}
_ => {
// Other policy types: retry with backoff
self.retry_with_backoff(action_id, pack_id).await?;
}
}
}
Ok(())
}
```
3. **Update Scheduler to Use Queue**
```rust
// In scheduler.rs
async fn process_execution_requested(
pool: &PgPool,
publisher: &Publisher,
policy_enforcer: &PolicyEnforcer, // ← NEW parameter
envelope: &MessageEnvelope<ExecutionRequestedPayload>,
) -> Result<()> {
let execution_id = envelope.payload.execution_id;
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
let action = Self::get_action_for_execution(pool, &execution).await?;
// Enforce policies with queueing
policy_enforcer.enforce_and_wait(
action.id,
execution_id,
Some(action.pack),
).await?;
// Now proceed with scheduling
let worker = Self::select_worker(pool, &action).await?;
// ...
}
```
4. **Add Completion Notification**
```rust
// Worker must notify when execution completes
// In worker/src/executor.rs
async fn handle_execution_success(
&self,
execution_id: i64,
action_id: i64,
result: &ExecutionResult,
) -> Result<()> {
// Update database
ExecutionRepository::update(...).await?;
// Notify queue manager (via message queue)
let payload = ExecutionCompletedPayload {
execution_id,
action_id,
status: ExecutionStatus::Completed,
};
self.publisher.publish("execution.completed", payload).await?;
Ok(())
}
```
5. **Add Queue Monitoring API**
```rust
// New endpoint in API service
/// GET /api/v1/actions/:id/queue-stats
async fn get_action_queue_stats(
State(state): State<Arc<AppState>>,
Path(action_id): Path<i64>,
) -> Result<Json<ApiResponse<QueueStats>>> {
let stats = state.queue_manager.get_queue_stats(action_id).await;
Ok(Json(ApiResponse::success(stats)))
}
#[derive(Serialize)]
pub struct QueueStats {
pub waiting: usize,
pub running: usize,
pub limit: usize,
pub avg_wait_time_seconds: Option<f64>,
}
```
**IMPLEMENTATION PRIORITY: CRITICAL**
- This affects correctness and fairness of the system
- Must be implemented before production use
- Should be addressed in Phase 3 (Executor Service completion)
### Testing Requirements
**Unit Tests:**
- [ ] Queue maintains FIFO order
- [ ] Multiple executions enqueue correctly
- [ ] Dequeue happens in order
- [ ] Notify wakes correct waiting execution
- [ ] Concurrent enqueue/dequeue operations safe
**Integration Tests:**
- [ ] End-to-end execution ordering with policies
- [ ] Three executions with limit=1 execute in order
- [ ] Queue stats reflect actual state
- [ ] Worker completion notification releases queue slot
**Load Tests:**
- [ ] 1000 concurrent delayed executions
- [ ] Correct ordering maintained under load
- [ ] No missed notifications or deadlocks
---
## Summary of Critical Issues
| Issue | Severity | Status | Must Fix Before v1.0 |
|-------|----------|--------|---------------------|
| 1. Action Coupling | ✅ Good | Avoided | No |
| 2. Type Safety | ✅ Excellent | Avoided | No |
| 3. Language Ecosystems | ⚠️ Moderate | Partial | **Yes** - Implement pack installation |
| 4. Dependency Hell | 🔴 Critical | Vulnerable | **Yes** - Implement venv isolation |
| 5. Secret Security | 🔴 Critical | Vulnerable | **Yes** - Use stdin/files for secrets |
| 6. Log Storage | ⚠️ Moderate | Good Design | **Yes** - Add size limits |
| 7. Policy Execution Order | 🔴 Critical | Missing | **Yes** - Implement FIFO queue |
---
## Recommended Implementation Order
### Phase 1: Security & Correctness Fixes (Sprint 1 - Week 1-3)
**Priority: CRITICAL - Block All Other Work**
1. Fix secret passing vulnerability (Issue 5)
- Implement stdin-based secret injection
- Remove secrets from environment variables
- Update Python/Shell runtime wrappers
- Add security documentation
2. Implement execution queue for policies (Issue 7) **NEW**
- FIFO queue per action
- Notify mechanism for slot availability
- Integration with PolicyEnforcer
- Queue monitoring API
### Phase 2: Runtime Isolation (Sprint 2 - Week 4-5)
**Priority: HIGH - Required for Production**
3. Implement per-pack virtual environments (Issue 4)
- Python venv creation per pack
- Dependency installation service
- Runtime version management
4. Add pack installation service (Issue 3)
- Pack setup/teardown lifecycle
- Dependency resolution
- Installation status tracking
### Phase 3: Operational Hardening (Sprint 3 - Week 6-7)
**Priority: MEDIUM - Quality of Life**
5. Implement log size limits (Issue 6)
- Streaming log collection
- Size-based truncation
- Configuration options
6. Add log rotation and compression
- Multi-file logs
- Automatic compression
- Retention policies
### Phase 4: Advanced Features (v1.1+)
**Priority: LOW - Future Enhancement**
6. Container-based runtimes
7. Multi-version runtime support
8. Advanced dependency management
9. Log streaming API
10. Pack marketplace/registry
---
## Testing Checklist
Before marking issues as resolved, verify:
### Issue 5 (Secret Security)
- [ ] Secrets not visible in `ps auxwwe`
- [ ] Secrets not readable from `/proc/{pid}/environ`
- [ ] Actions can successfully read secrets from stdin/file
- [ ] Python wrapper script reads secrets securely
- [ ] Shell wrapper script reads secrets securely
- [ ] Documentation updated with secure patterns
### Issue 7 (Policy Execution Order) **NEW**
- [ ] Execution queue maintains FIFO order
- [ ] Three executions with limit=1 execute in correct order
- [ ] Queue stats API returns accurate counts
- [ ] Worker completion notification releases queue slot
- [ ] No race conditions under concurrent load
- [ ] Correct ordering with 1000 delayed executions
### Issue 4 (Dependency Isolation)
- [ ] Each pack gets isolated venv
- [ ] Installing pack A dependencies doesn't affect pack B
- [ ] Upgrading system Python doesn't break existing packs
- [ ] Runtime version can be specified per pack
- [ ] Multiple Python versions can coexist
### Issue 3 (Language Support)
- [ ] Python packs can declare dependencies in metadata
- [ ] `pip install` runs during pack installation
- [ ] Node.js packs supported with npm install
- [ ] Pack installation status tracked
- [ ] Failed installations reported with logs
### Issue 6 (Log Limits)
- [ ] Logs truncated at configured size limit
- [ ] Worker doesn't OOM on large output
- [ ] Truncation is clearly marked in logs
- [ ] Multiple log files created for rotation
- [ ] Old logs cleaned up per retention policy
---
## Architecture Decision Records
### ADR-001: Use Stdin for Secret Injection
**Decision:** Pass secrets via stdin as JSON instead of environment variables.
**Rationale:**
- Environment variables visible in `/proc/{pid}/environ`
- stdin content not exposed to other processes
- Follows principle of least privilege
- Industry best practice (used by Kubernetes, HashiCorp Vault)
**Consequences:**
- Requires wrapper script modifications
- Actions must explicitly read from stdin
- Slight increase in complexity
- **Major security improvement**
### ADR-002: Per-Pack Virtual Environments
**Decision:** Each pack gets isolated Python virtual environment.
**Rationale:**
- Prevents dependency conflicts between packs
- Allows different Python versions per pack
- Protects against system Python upgrades
- Standard practice in Python ecosystem
**Consequences:**
- Increased disk usage (one venv per pack)
- Pack installation takes longer
- Worker must manage venv lifecycle
- **Eliminates dependency hell**
### ADR-003: Filesystem-Based Log Storage
**Decision:** Store logs in filesystem, not database.
**Rationale:**
- Database not designed for large blob storage
- Filesystem handles large files efficiently
- Easy to implement rotation and compression
- Can stream logs without loading entire file
**Consequences:**
- Logs separate from structured execution data
- Need backup strategy for log directory
- Cleanup/retention requires separate process
- **Avoids database bloat and failures**
---
## References
- StackStorm Lessons Learned: `work-summary/StackStorm-Lessons-Learned.md`
- Current Worker Implementation: `crates/worker/src/`
- Runtime Abstraction: `crates/worker/src/runtime/`
- Secret Management: `crates/worker/src/secrets.rs`
- Artifact Storage: `crates/worker/src/artifacts.rs`
- Database Schema: `migrations/20240101000004_create_runtime_worker.sql`
---
## Next Steps
1. **Review this analysis with team** - Discuss priorities and timeline
2. **Create GitHub issues** - One issue per critical problem
3. **Update TODO.md** - Add tasks from Implementation Order section
4. **Begin Phase 1** - Security fixes first, before any other work
5. **Schedule security review** - After Phase 1 completion
---
**Document Status:** Complete - Ready for Review
**Author:** AI Assistant
**Reviewers Needed:** Security Team, Architecture Team, DevOps Lead