attune/work-summary/phases/StackStorm-Pitfalls-Analysis.md

# StackStorm Pitfalls Analysis: Current Implementation Review

**Date:** 2024-01-02
**Status:** Analysis Complete - Action Items Identified

## Executive Summary

This document analyzes the current Attune implementation against the StackStorm lessons learned to identify replicated pitfalls and propose solutions. The analysis reveals **3 critical issues** and **2 moderate concerns** that need to be addressed before production deployment.

---

## 1. HIGH COUPLING WITH CUSTOM ACTIONS ✅ AVOIDED

### StackStorm Problem
- Custom actions are tightly coupled to st2 services
- Minimal documentation around action/sensor service interfaces
- Actions must import st2 libraries and inherit from st2 classes

### Current Attune Status: **GOOD**
- ✅ Actions are executed as standalone processes via `tokio::process::Command`
- ✅ No Attune-specific imports or base classes required
- ✅ Runtime abstraction layer in `worker/src/runtime/` is well-designed
- ✅ Actions receive data via environment variables and stdin (code execution)

### Recommendations
- **Keep current approach** - the runtime abstraction is solid
- Consider documenting the runtime interface contract for pack developers
- Add examples of "pure" Python/Shell/Node.js actions that work without any Attune dependencies

---

## 2. TYPE SAFETY AND DOCUMENTATION ✅ AVOIDED

### StackStorm Problem
- Python with minimal type hints
- Runtime property injection makes types hard to determine
- Poor documentation of service interfaces

### Current Attune Status: **EXCELLENT**
- ✅ Built in Rust with full compile-time type checking
- ✅ All models in `common/src/models.rs` are strongly typed with SQLx
- ✅ Clear type definitions for `ExecutionContext`, `ExecutionResult`, `RuntimeError`
- ✅ Repository pattern enforces type contracts

### Recommendations
- **No changes needed** - Rust's type system provides the safety we need
- Continue documenting public APIs in `docs/` folder
- Consider generating OpenAPI specs from Axum routes for external consumers

---

## 3. LIMITED LANGUAGE ECOSYSTEM SUPPORT ⚠️ PARTIALLY ADDRESSED

### StackStorm Problem
- Only Python packs natively supported
- Other languages require custom installation logic
- No standard way to declare dependencies per language ecosystem

### Current Attune Status: **NEEDS WORK**

#### What's Good
- ✅ Runtime abstraction supports multiple languages (Python, Shell, Node.js planned)
- ✅ `Pack` model has `runtime_deps: Vec<String>` field for dependencies
- ✅ `Runtime` table has `distributions` JSONB and `installation` JSONB fields

#### Problems Identified

**Problem 3.1: No Dependency Installation Implementation**
```rust
// In crates/common/src/models.rs
pub struct Pack {
    // ...
    pub runtime_deps: Vec<String>,  // ← DEFINED BUT NOT USED
    // ...
}

pub struct Runtime {
    // ...
    pub distributions: JsonDict,     // ← NO INSTALLATION LOGIC
    pub installation: Option<JsonDict>,
    // ...
}
```

**Problem 3.2: No Pack Installation/Setup Service**
- No code exists to process `runtime_deps` field
- No integration with pip, npm, cargo, etc.
- No isolation of dependencies between packs

**Problem 3.3: Runtime Detection is Naive**
```rust
// In crates/worker/src/runtime/python.rs:279
fn can_execute(&self, context: &ExecutionContext) -> bool {
    // Only checks file extension - doesn't verify runtime availability
    context.action_ref.contains(".py")
        || context.entry_point.ends_with(".py")
        // ...
}
```

### Recommendations

**IMMEDIATE (Before Production):**
1. **Implement Pack Installation Service**
   - Create `attune-packman` service or add to `attune-api`
   - Support installing Python deps via `pip install -r requirements.txt`
   - Support installing Node.js deps via `npm install`
   - Store pack code in isolated directories: `/var/lib/attune/packs/{pack_ref}/`

2. **Enhance Runtime Model**
   - Add `installation_status` enum: `not_installed`, `installing`, `installed`, `failed`
   - Add `installed_at` timestamp
   - Add `installation_log` field for troubleshooting

3. **Implement Dependency Isolation**
   - Python: Use `venv` per pack in `/var/lib/attune/packs/{pack_ref}/.venv/`
   - Node.js: Use local `node_modules` per pack
   - Document in pack schema: how to declare dependencies

**FUTURE (v2.0):**
4. **Container-based Runtime**
   - Each pack gets its own container image
   - Dependencies baked into image
   - Complete isolation from Attune system

---

## 4. DEPENDENCY HELL AND SYSTEM COUPLING 🔴 CRITICAL ISSUE

### StackStorm Problem
- st2 services run on Python 2.7/3.6 (EOL)
- Upgrading st2 system breaks user actions
- User actions are coupled to st2 Python version

### Current Attune Status: **VULNERABLE**

#### Problems Identified

**Problem 4.1: Shared System Python Runtime**
```rust
// In crates/worker/src/runtime/python.rs:19
pub fn new() -> Self {
    Self {
        python_path: PathBuf::from("python3"),  // ← SYSTEM PYTHON!
        // ...
    }
}
```
- Currently uses system-wide `python3`
- If Attune upgrades system Python, user actions may break
- No version pinning or isolation

**Problem 4.2: No Runtime Version Management**
- No way to specify Python 3.9 vs 3.11 vs 3.12
- Runtime table has `name` field but it's not used for version selection
- Shell runtime hardcoded to `/bin/bash`

**Problem 4.3: Attune System Dependencies Could Conflict**
- If Attune worker needs a Python library (e.g., for parsing), it could conflict with action deps
- No separation between "Attune system dependencies" and "action dependencies"

### Recommendations

**CRITICAL (Must Fix Before v1.0):**

1. **Implement Per-Pack Virtual Environments**
   ```rust
   // Pseudocode for python.rs enhancement
   pub struct PythonRuntime {
       python_path: PathBuf,          // System python3 for venv creation
       venv_base: PathBuf,             // /var/lib/attune/packs/
       default_python_version: String, // "3.11"
   }

   impl PythonRuntime {
       async fn get_or_create_venv(&self, pack_ref: &str) -> Result<PathBuf> {
           let venv_path = self.venv_base.join(pack_ref).join(".venv");
           if !venv_path.exists() {
               self.create_venv(&venv_path).await?;
               self.install_pack_deps(pack_ref, &venv_path).await?;
           }
           Ok(venv_path.join("bin/python"))
       }
   }
   ```

2. **Support Multiple Runtime Versions**
   - Store available Python versions: `/opt/attune/runtimes/python-3.9/`, `.../python-3.11/`
   - Pack declares required version in metadata: `"runtime_version": "3.11"`
   - Worker selects appropriate runtime based on pack requirements

3. **Decouple Attune System from Action Execution**
   - Attune services (API, executor, worker) remain in Rust - no Python coupling
   - Actions run in isolated environments
   - Clear boundary: Attune communicates with actions only via stdin/stdout/env/files

**DESIGN PRINCIPLE:**
> "Upgrading Attune system dependencies should NEVER break existing user actions."

---

## 5. INSECURE SECRET PASSING 🔴 CRITICAL SECURITY ISSUE

### StackStorm Problem
- Secrets passed as environment variables or CLI arguments
- Visible to all users with login access via `ps`, `/proc/{pid}/environ`
- Major security vulnerability

### Current Attune Status: **VULNERABLE**

#### Problems Identified

**Problem 5.1: Secrets Exposed in Environment Variables**
```rust
// In crates/worker/src/secrets.rs:142
pub fn prepare_secret_env(&self, secrets: &HashMap<String, String>)
    -> HashMap<String, String> {
    secrets
        .iter()
        .map(|(name, value)| {
            let env_name = format!("SECRET_{}", name.to_uppercase().replace('-', "_"));
            (env_name, value.clone())  // ← EXPOSED IN PROCESS ENV!
        })
        .collect()
}

// In crates/worker/src/executor.rs:228
env.extend(secret_env);  // ← Secrets added to env vars
```

**Problem 5.2: Secrets Visible in Process Table**
```rust
// In crates/worker/src/runtime/python.rs:122
let mut cmd = Command::new(&self.python_path);
cmd.arg("-c").arg(&script)
    .stdin(Stdio::null())  // ← NOT USING STDIN!
    // ...
for (key, value) in env {
    cmd.env(key, value);  // ← Secrets visible via /proc/{pid}/environ
}
```

**Problem 5.3: Parameters Also Exposed (Lower Risk)**
```rust
// In crates/worker/src/runtime/shell.rs:49
for (key, value) in &context.parameters {
    script.push_str(&format!(
        "export PARAM_{}='{}'\n",  // ← Parameters visible in env
        key.to_uppercase(),
        value_str
    ));
}
```

### Security Impact
- **HIGH**: Any user with shell access can view secrets via:
  - `ps auxwwe` - shows environment variables
  - `cat /proc/{pid}/environ` - shows full environment
  - `strings /proc/{pid}/environ` - extracts secret values
- **MEDIUM**: Short-lived processes reduce exposure window, but still vulnerable

### Recommendations

**CRITICAL (Must Fix Before v1.0):**

1. **Pass Secrets via Stdin (Preferred Method)**
   ```rust
   // Enhanced approach for python.rs
   async fn execute_python_code(
       &self,
       script: String,
       secrets: &HashMap<String, String>,
       parameters: &HashMap<String, serde_json::Value>,
       env: &HashMap<String, String>,  // Only non-secret env vars
       timeout_secs: Option<u64>,
   ) -> RuntimeResult<ExecutionResult> {
       // Create secrets JSON file
       let secrets_json = serde_json::to_string(&serde_json::json!({
           "secrets": secrets,
           "parameters": parameters,
       }))?;

       let mut cmd = Command::new(&self.python_path);
       cmd.arg("-c").arg(&script)
           .stdin(Stdio::piped())   // ← Use stdin!
           .stdout(Stdio::piped())
           .stderr(Stdio::piped());

       // Only add non-secret env vars
       for (key, value) in env {
           if !key.starts_with("SECRET_") {
               cmd.env(key, value);
           }
       }

       let mut child = cmd.spawn()?;

       // Write secrets to stdin and close
       if let Some(mut stdin) = child.stdin.take() {
           stdin.write_all(secrets_json.as_bytes()).await?;
           drop(stdin);  // Close stdin
       }

       let output = child.wait_with_output().await?;
       // ...
   }
   ```

2. **Alternative: Use Temporary Secret Files**
   ```rust
   // Create secure temporary file (0600 permissions)
   let secrets_file = format!("/tmp/attune-secrets-{}-{}.json",
                               execution_id, uuid::Uuid::new_v4());
   let mut file = OpenOptions::new()
       .create_new(true)
       .write(true)
       .mode(0o600)  // Read/write for owner only
       .open(&secrets_file).await?;

   file.write_all(serde_json::to_string(secrets)?.as_bytes()).await?;
   file.sync_all().await?;
   drop(file);

   // Pass file path via env (not the secrets themselves)
   cmd.env("ATTUNE_SECRETS_FILE", &secrets_file);

   // Clean up after execution
   tokio::fs::remove_file(&secrets_file).await?;
   ```

3. **Update Python Wrapper Script**
   ```python
   # Modified wrapper script generator
   def main():
       import sys, json

       # Read secrets and parameters from stdin
       input_data = json.load(sys.stdin)
       secrets = input_data.get('secrets', {})
       parameters = input_data.get('parameters', {})

       # Secrets available in code but not in environment
       # ...
   ```

4. **Document Secure Secret Access Pattern**
   - Create `docs/secure-secret-handling.md`
   - Provide action templates that read from stdin
   - Add security best practices guide for pack developers

**IMPLEMENTATION PRIORITY: IMMEDIATE**
- This is a security vulnerability that must be fixed before any production use
- Should be addressed in Phase 3 (Worker Service completion)

---

## 6. STDERR DATABASE STORAGE CAUSING FAILURES ⚠️ MODERATE ISSUE

### StackStorm Problem
- stderr output stored directly in database
- Excessive logging can exceed database field limits
- Jobs fail unexpectedly due to log size

### Current Attune Status: **GOOD APPROACH, NEEDS LIMITS**

#### What's Good
✅ **Attune uses filesystem storage for logs**
```rust
// In crates/worker/src/artifacts.rs:72
pub async fn store_logs(
    &self,
    execution_id: i64,
    stdout: &str,
    stderr: &str,
) -> Result<Vec<Artifact>> {
    // Stores to files: /tmp/attune/artifacts/execution_{id}/stdout.log
    //                  /tmp/attune/artifacts/execution_{id}/stderr.log
    // NOT stored in database!
}
```

✅ **Database only stores result JSON**
```rust
// In crates/worker/src/executor.rs:331
let input = UpdateExecutionInput {
    status: Some(ExecutionStatus::Completed),
    result: result.result.clone(),  // ← Only structured result, not logs
    executor: None,
};
```

#### Problems Identified

**Problem 6.1: No Size Limits on Log Files**
```rust
// In artifacts.rs - no size checks!
file.write_all(stdout.as_bytes()).await?;  // ← Could be gigabytes!
```

**Problem 6.2: No Log Rotation**
- Single file per execution
- If action produces GB of logs, file grows unbounded
- Could fill disk

**Problem 6.3: In-Memory Log Collection**
```rust
// In python.rs and shell.rs
let output = execution_future.await?;
let stdout = String::from_utf8_lossy(&output.stdout).to_string();  // ← ALL in memory!
let stderr = String::from_utf8_lossy(&output.stderr).to_string();
```
- If action produces 1GB of output, worker could OOM

### Recommendations

**HIGH PRIORITY (Before Production):**

1. **Implement Streaming Log Collection**
   ```rust
   // Replace `.output()` with streaming approach
   use tokio::io::{AsyncBufReadExt, BufReader};

   async fn execute_with_streaming_logs(
       &self,
       mut cmd: Command,
       execution_id: i64,
       max_log_size: usize,  // e.g., 10MB
   ) -> RuntimeResult<ExecutionResult> {
       let mut child = cmd.spawn()?;

       // Stream stdout to file with size limit
       if let Some(stdout) = child.stdout.take() {
           let reader = BufReader::new(stdout);
           let mut lines = reader.lines();
           let mut total_size = 0;
           let log_file = /* open stdout.log */;

           while let Some(line) = lines.next_line().await? {
               total_size += line.len();
               if total_size > max_log_size {
                   // Truncate and add warning
                   write!(log_file, "\n[TRUNCATED: Log exceeded {}MB]",
                          max_log_size / 1024 / 1024).await?;
                   break;
               }
               writeln!(log_file, "{}", line).await?;
           }
       }

       // Similar for stderr
       // ...
   }
   ```

2. **Add Configuration Limits**
   ```yaml
   # config.yaml
   worker:
     log_limits:
       max_stdout_size: 10485760  # 10MB
       max_stderr_size: 10485760  # 10MB
       max_total_size: 20971520   # 20MB
       truncate_on_exceed: true
   ```

3. **Implement Log Rotation Per Execution**
   ```
   /var/lib/attune/artifacts/
     execution_123/
       stdout.0.log      (first 10MB)
       stdout.1.log      (next 10MB)
       stdout.2.log      (final chunk)
       stderr.0.log
       result.json
   ```

4. **Add Log Streaming API Endpoint**
   - API endpoint: `GET /api/v1/executions/{id}/logs/stdout?follow=true`
   - Stream logs to client as execution progresses
   - Similar to `docker logs --follow`

**MEDIUM PRIORITY (v1.1):**

5. **Implement Log Compression**
   - Compress logs after execution completes
   - Save disk space for long-term retention
   - Decompress on-demand for viewing

---

## 7. POLICY EXECUTION ORDERING 🔴 CRITICAL ISSUE

### Problem Statement
When multiple executions are delayed due to policy enforcement (e.g., concurrency limits), there is no guaranteed ordering for when they will be scheduled once resources become available.

### Current Implementation Status: **MISSING CRITICAL FEATURE**

#### What Exists
✅ **Policy enforcement framework**
```rust
// In crates/executor/src/policy_enforcer.rs:428
pub async fn wait_for_policy_compliance(
    &self,
    action_id: Id,
    pack_id: Option<Id>,
    max_wait_seconds: u32,
) -> Result<bool> {
    // Polls until policies allow execution
    // BUT: No queue management!
}
```

✅ **Concurrency and rate limiting**
```rust
// Can detect when limits are exceeded
PolicyViolation::ConcurrencyLimitExceeded { limit: 5, current_count: 7 }
```

#### Problems Identified

**Problem 7.1: Non-Deterministic Scheduling Order**

**Scenario:**
```
Action has concurrency limit: 2
Time 0: E1 requested → starts (slot 1/2)
Time 1: E2 requested → starts (slot 2/2)
Time 2: E3 requested → DELAYED (no slots)
Time 3: E4 requested → DELAYED (no slots)
Time 4: E5 requested → DELAYED (no slots)
Time 5: E1 completes → which delayed execution runs?

Current behavior: UNDEFINED ORDER (possibly E5, then E3, then E4)
Expected behavior: FIFO - E3, then E4, then E5
```

**Problem 7.2: No Queue Data Structure**
```rust
// Current implementation in policy_enforcer.rs
// Only polls for compliance - no queue!
loop {
    if self.check_policies(action_id, pack_id).await?.is_none() {
        return Ok(true);  // ← Just returns true, no coordination
    }
    tokio::time::sleep(Duration::from_secs(1)).await;
}
```

**Problem 7.3: Race Conditions**
- Multiple delayed executions poll simultaneously
- When slot opens, multiple executions might see it
- First to update wins, others keep waiting
- No fairness guarantee

**Problem 7.4: No Visibility into Queue**
- Can't see how many executions are waiting
- Can't see position in queue
- No way to estimate wait time
- Difficult to debug policy issues

### Business Impact

**Fairness Issues:**
- Later requests might execute before earlier ones
- Violates user expectations (FIFO is standard)
- Unpredictable execution order

**Workflow Dependencies:**
- Workflow step B requested after step A
- Step B might execute before A completes
- Data dependencies violated
- Incorrect results or failures

**Testing/Debugging:**
- Non-deterministic behavior hard to reproduce
- Integration tests become flaky
- Production issues difficult to diagnose

**Performance:**
- Polling wastes CPU cycles
- Multiple executions wake up unnecessarily
- Database load from repeated policy checks

### Recommendations

**CRITICAL (Must Fix Before v1.0):**

1. **Implement Per-Action Execution Queue**
   ```rust
   // New file: crates/executor/src/execution_queue.rs

   use std::collections::{HashMap, VecDeque};
   use tokio::sync::{Mutex, Notify};

   /// Manages FIFO queues of delayed executions per action
   pub struct ExecutionQueueManager {
       /// Queue per action_id
       queues: Arc<Mutex<HashMap<i64, ActionQueue>>>,
   }

   struct ActionQueue {
       /// FIFO queue of waiting execution IDs
       waiting: VecDeque<i64>,
       /// Notify when slot becomes available
       notify: Arc<Notify>,
       /// Current running count
       running_count: u32,
       /// Concurrency limit for this action
       limit: u32,
   }

   impl ExecutionQueueManager {
       /// Enqueue an execution (returns position in queue)
       pub async fn enqueue(&self, action_id: i64, execution_id: i64) -> usize {
           let mut queues = self.queues.lock().await;
           let queue = queues.entry(action_id).or_insert_with(ActionQueue::new);
           queue.waiting.push_back(execution_id);
           queue.waiting.len()
       }

       /// Wait for turn (blocks until this execution can proceed)
       pub async fn wait_for_turn(&self, action_id: i64, execution_id: i64) -> Result<()> {
           loop {
               // Check if it's our turn
               let notify = {
                   let mut queues = self.queues.lock().await;
                   let queue = queues.get_mut(&action_id).unwrap();

                   // Are we at the front AND is there capacity?
                   if queue.waiting.front() == Some(&execution_id)
                       && queue.running_count < queue.limit {
                       // It's our turn!
                       queue.waiting.pop_front();
                       queue.running_count += 1;
                       return Ok(());
                   }

                   queue.notify.clone()
               };

               // Not our turn, wait for notification
               notify.notified().await;
           }
       }

       /// Mark execution as complete (frees up slot)
       pub async fn complete(&self, action_id: i64, execution_id: i64) {
           let mut queues = self.queues.lock().await;
           if let Some(queue) = queues.get_mut(&action_id) {
               queue.running_count = queue.running_count.saturating_sub(1);
               queue.notify.notify_one();  // Wake next waiting execution
           }
       }

       /// Get queue stats for monitoring
       pub async fn get_queue_stats(&self, action_id: i64) -> QueueStats {
           let queues = self.queues.lock().await;
           if let Some(queue) = queues.get(&action_id) {
               QueueStats {
                   waiting: queue.waiting.len(),
                   running: queue.running_count as usize,
                   limit: queue.limit as usize,
               }
           } else {
               QueueStats::default()
           }
       }
   }
   ```

2. **Integrate with PolicyEnforcer**
   ```rust
   // Update policy_enforcer.rs
   pub struct PolicyEnforcer {
       pool: PgPool,
       queue_manager: Arc<ExecutionQueueManager>,  // ← NEW
       // ... existing fields
   }

   pub async fn enforce_and_wait(
       &self,
       action_id: Id,
       execution_id: Id,
       pack_id: Option<Id>,
   ) -> Result<()> {
       // Check if policy would be violated
       if let Some(violation) = self.check_policies(action_id, pack_id).await? {
           match violation {
               PolicyViolation::ConcurrencyLimitExceeded { .. } => {
                   // Enqueue and wait for turn
                   let position = self.queue_manager.enqueue(action_id, execution_id).await;
                   info!("Execution {} queued at position {}", execution_id, position);

                   self.queue_manager.wait_for_turn(action_id, execution_id).await?;

                   info!("Execution {} proceeding after queue wait", execution_id);
               }
               _ => {
                   // Other policy types: retry with backoff
                   self.retry_with_backoff(action_id, pack_id).await?;
               }
           }
       }
       Ok(())
   }
   ```

3. **Update Scheduler to Use Queue**
   ```rust
   // In scheduler.rs
   async fn process_execution_requested(
       pool: &PgPool,
       publisher: &Publisher,
       policy_enforcer: &PolicyEnforcer,  // ← NEW parameter
       envelope: &MessageEnvelope<ExecutionRequestedPayload>,
   ) -> Result<()> {
       let execution_id = envelope.payload.execution_id;
       let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
       let action = Self::get_action_for_execution(pool, &execution).await?;

       // Enforce policies with queueing
       policy_enforcer.enforce_and_wait(
           action.id,
           execution_id,
           Some(action.pack),
       ).await?;

       // Now proceed with scheduling
       let worker = Self::select_worker(pool, &action).await?;
       // ...
   }
   ```

4. **Add Completion Notification**
   ```rust
   // Worker must notify when execution completes
   // In worker/src/executor.rs

   async fn handle_execution_success(
       &self,
       execution_id: i64,
       action_id: i64,
       result: &ExecutionResult,
   ) -> Result<()> {
       // Update database
       ExecutionRepository::update(...).await?;

       // Notify queue manager (via message queue)
       let payload = ExecutionCompletedPayload {
           execution_id,
           action_id,
           status: ExecutionStatus::Completed,
       };
       self.publisher.publish("execution.completed", payload).await?;

       Ok(())
   }
   ```

5. **Add Queue Monitoring API**
   ```rust
   // New endpoint in API service
   /// GET /api/v1/actions/:id/queue-stats
   async fn get_action_queue_stats(
       State(state): State<Arc<AppState>>,
       Path(action_id): Path<i64>,
   ) -> Result<Json<ApiResponse<QueueStats>>> {
       let stats = state.queue_manager.get_queue_stats(action_id).await;
       Ok(Json(ApiResponse::success(stats)))
   }

   #[derive(Serialize)]
   pub struct QueueStats {
       pub waiting: usize,
       pub running: usize,
       pub limit: usize,
       pub avg_wait_time_seconds: Option<f64>,
   }
   ```

**IMPLEMENTATION PRIORITY: CRITICAL**
- This affects correctness and fairness of the system
- Must be implemented before production use
- Should be addressed in Phase 3 (Executor Service completion)

### Testing Requirements

**Unit Tests:**
- [ ] Queue maintains FIFO order
- [ ] Multiple executions enqueue correctly
- [ ] Dequeue happens in order
- [ ] Notify wakes correct waiting execution
- [ ] Concurrent enqueue/dequeue operations safe

**Integration Tests:**
- [ ] End-to-end execution ordering with policies
- [ ] Three executions with limit=1 execute in order
- [ ] Queue stats reflect actual state
- [ ] Worker completion notification releases queue slot

**Load Tests:**
- [ ] 1000 concurrent delayed executions
- [ ] Correct ordering maintained under load
- [ ] No missed notifications or deadlocks

---

## Summary of Critical Issues

| Issue | Severity | Status | Must Fix Before v1.0 |
|-------|----------|--------|---------------------|
| 1. Action Coupling | ✅ Good | Avoided | No |
| 2. Type Safety | ✅ Excellent | Avoided | No |
| 3. Language Ecosystems | ⚠️ Moderate | Partial | **Yes** - Implement pack installation |
| 4. Dependency Hell | 🔴 Critical | Vulnerable | **Yes** - Implement venv isolation |
| 5. Secret Security | 🔴 Critical | Vulnerable | **Yes** - Use stdin/files for secrets |
| 6. Log Storage | ⚠️ Moderate | Good Design | **Yes** - Add size limits |
| 7. Policy Execution Order | 🔴 Critical | Missing | **Yes** - Implement FIFO queue |

---

## Recommended Implementation Order

### Phase 1: Security & Correctness Fixes (Sprint 1 - Week 1-3)
**Priority: CRITICAL - Block All Other Work**

1. Fix secret passing vulnerability (Issue 5)
   - Implement stdin-based secret injection
   - Remove secrets from environment variables
   - Update Python/Shell runtime wrappers
   - Add security documentation

2. Implement execution queue for policies (Issue 7) **NEW**
   - FIFO queue per action
   - Notify mechanism for slot availability
   - Integration with PolicyEnforcer
   - Queue monitoring API

### Phase 2: Runtime Isolation (Sprint 2 - Week 4-5)
**Priority: HIGH - Required for Production**

3. Implement per-pack virtual environments (Issue 4)
   - Python venv creation per pack
   - Dependency installation service
   - Runtime version management

4. Add pack installation service (Issue 3)
   - Pack setup/teardown lifecycle
   - Dependency resolution
   - Installation status tracking

### Phase 3: Operational Hardening (Sprint 3 - Week 6-7)
**Priority: MEDIUM - Quality of Life**

5. Implement log size limits (Issue 6)
   - Streaming log collection
   - Size-based truncation
   - Configuration options

6. Add log rotation and compression
   - Multi-file logs
   - Automatic compression
   - Retention policies

### Phase 4: Advanced Features (v1.1+)
**Priority: LOW - Future Enhancement**

6. Container-based runtimes
7. Multi-version runtime support
8. Advanced dependency management
9. Log streaming API
10. Pack marketplace/registry

---

## Testing Checklist

Before marking issues as resolved, verify:

### Issue 5 (Secret Security)
- [ ] Secrets not visible in `ps auxwwe`
- [ ] Secrets not readable from `/proc/{pid}/environ`
- [ ] Actions can successfully read secrets from stdin/file
- [ ] Python wrapper script reads secrets securely
- [ ] Shell wrapper script reads secrets securely
- [ ] Documentation updated with secure patterns

### Issue 7 (Policy Execution Order) **NEW**
- [ ] Execution queue maintains FIFO order
- [ ] Three executions with limit=1 execute in correct order
- [ ] Queue stats API returns accurate counts
- [ ] Worker completion notification releases queue slot
- [ ] No race conditions under concurrent load
- [ ] Correct ordering with 1000 delayed executions

### Issue 4 (Dependency Isolation)
- [ ] Each pack gets isolated venv
- [ ] Installing pack A dependencies doesn't affect pack B
- [ ] Upgrading system Python doesn't break existing packs
- [ ] Runtime version can be specified per pack
- [ ] Multiple Python versions can coexist

### Issue 3 (Language Support)
- [ ] Python packs can declare dependencies in metadata
- [ ] `pip install` runs during pack installation
- [ ] Node.js packs supported with npm install
- [ ] Pack installation status tracked
- [ ] Failed installations reported with logs

### Issue 6 (Log Limits)
- [ ] Logs truncated at configured size limit
- [ ] Worker doesn't OOM on large output
- [ ] Truncation is clearly marked in logs
- [ ] Multiple log files created for rotation
- [ ] Old logs cleaned up per retention policy

---

## Architecture Decision Records

### ADR-001: Use Stdin for Secret Injection
**Decision:** Pass secrets via stdin as JSON instead of environment variables.

**Rationale:**
- Environment variables visible in `/proc/{pid}/environ`
- stdin content not exposed to other processes
- Follows principle of least privilege
- Industry best practice (used by Kubernetes, HashiCorp Vault)

**Consequences:**
- Requires wrapper script modifications
- Actions must explicitly read from stdin
- Slight increase in complexity
- **Major security improvement**

### ADR-002: Per-Pack Virtual Environments
**Decision:** Each pack gets isolated Python virtual environment.

**Rationale:**
- Prevents dependency conflicts between packs
- Allows different Python versions per pack
- Protects against system Python upgrades
- Standard practice in Python ecosystem

**Consequences:**
- Increased disk usage (one venv per pack)
- Pack installation takes longer
- Worker must manage venv lifecycle
- **Eliminates dependency hell**

### ADR-003: Filesystem-Based Log Storage
**Decision:** Store logs in filesystem, not database.

**Rationale:**
- Database not designed for large blob storage
- Filesystem handles large files efficiently
- Easy to implement rotation and compression
- Can stream logs without loading entire file

**Consequences:**
- Logs separate from structured execution data
- Need backup strategy for log directory
- Cleanup/retention requires separate process
- **Avoids database bloat and failures**

---

## References

- StackStorm Lessons Learned: `work-summary/StackStorm-Lessons-Learned.md`
- Current Worker Implementation: `crates/worker/src/`
- Runtime Abstraction: `crates/worker/src/runtime/`
- Secret Management: `crates/worker/src/secrets.rs`
- Artifact Storage: `crates/worker/src/artifacts.rs`
- Database Schema: `migrations/20240101000004_create_runtime_worker.sql`

---

## Next Steps

1. **Review this analysis with team** - Discuss priorities and timeline
2. **Create GitHub issues** - One issue per critical problem
3. **Update TODO.md** - Add tasks from Implementation Order section
4. **Begin Phase 1** - Security fixes first, before any other work
5. **Schedule security review** - After Phase 1 completion

---

**Document Status:** Complete - Ready for Review
**Author:** AI Assistant
**Reviewers Needed:** Security Team, Architecture Team, DevOps Lead