attune/work-summary/phases/session-2024-01-02-stackstorm-analysis.md

# Session Summary: StackStorm Pitfall Analysis

**Date:** 2024-01-02
**Duration:** ~2 hours
**Focus:** Analysis of StackStorm lessons learned and identification of replicated pitfalls in current Attune implementation

---

## Session Objectives

1. Review StackStorm lessons learned document
2. Analyze current Attune implementation against known pitfalls
3. Identify security vulnerabilities and architectural issues
4. Create comprehensive remediation plan
5. Document findings without beginning implementation

---

## Work Completed

### 1. Comprehensive Pitfall Analysis
**File Created:** `work-summary/StackStorm-Pitfalls-Analysis.md` (659 lines)

**Key Findings:**
- ✅ **2 Issues Avoided**: Action coupling, type safety (Rust's strong typing prevents these)
- ⚠️ **2 Moderate Issues**: Language ecosystem support, log size limits
- 🔴 **3 Critical Issues**: Dependency hell, insecure secret passing, policy execution ordering

**Critical Security Vulnerability Identified:**
```rust
// CURRENT IMPLEMENTATION - INSECURE!
env.insert("SECRET_API_KEY", "my-secret-value");  // ← Visible in /proc/pid/environ
cmd.env("SECRET_API_KEY", "my-secret-value");     // ← Visible in ps auxwwe
```

Any user with shell access can view secrets via:
- `ps auxwwe` - shows environment variables
- `cat /proc/{pid}/environ` - shows full environment
- Process table inspection tools

### 2. Detailed Resolution Plan
**File Created:** `work-summary/Pitfall-Resolution-Plan.md` (1,153 lines)

**Implementation Phases Defined:**
1. **Phase 1: Security Critical** (3-5 days) - Fix secret passing via stdin
2. **Phase 2: Dependency Isolation** (7-10 days) - Per-pack virtual environments
3. **Phase 3: Language Support** (5-7 days) - Multi-language dependency management
4. **Phase 4: Log Limits** (3-4 days) - Streaming logs with size limits

**Total Estimated Effort:** 18-26 days (3.5-5 weeks)

### 3. Updated TODO Roadmap
**File Modified:** `work-summary/TODO.md`

Added new Phase 0 (StackStorm Pitfall Remediation) as CRITICAL priority, blocking production deployment.

---

## Critical Issues Discovered

### Issue P5: Insecure Secret Passing (🔴 CRITICAL - P0)

**Current Implementation:**
- Secrets passed as environment variables
- Visible in process table (`ps`, `/proc/pid/environ`)
- Major security vulnerability

**Proposed Solution:**
- Pass secrets via stdin as JSON payload
- Separate secrets from environment variables
- Update Python/Shell runtime wrappers to read from stdin
- Add security tests to verify secrets not exposed

**Files Affected:**
- `crates/worker/src/secrets.rs`
- `crates/worker/src/executor.rs`
- `crates/worker/src/runtime/python.rs`
- `crates/worker/src/runtime/shell.rs`
- `crates/worker/src/runtime/mod.rs`

**Security Test Requirements:**
```rust
#[test]
fn test_secrets_not_in_process_env() {
    // Verify secrets not readable from /proc/pid/environ
}

#[test]
fn test_secrets_not_visible_in_ps() {
    // Verify secrets not in ps output
}
```

### Issue P7: Policy Execution Ordering (🔴 CRITICAL - P0) **NEW**

**Current Implementation:**
```rust
// In policy_enforcer.rs - only polls, no queue!
pub async fn wait_for_policy_compliance(...) -> Result<bool> {
    loop {
        if self.check_policies(action_id, pack_id).await?.is_none() {
            return Ok(true);  // ← Just returns, no coordination!
        }
        tokio::time::sleep(Duration::from_secs(1)).await;
    }
}
```

**Problems:**
- No queue data structure for delayed executions
- Multiple executions poll simultaneously
- Non-deterministic order when slot opens
- Race conditions - first to update wins
- Violates FIFO expectations

**Business Scenario:**
```
Action with concurrency limit: 2
Time 0: E1 requested → starts (slot 1/2)
Time 1: E2 requested → starts (slot 2/2)
Time 2: E3 requested → DELAYED
Time 3: E4 requested → DELAYED
Time 4: E5 requested → DELAYED
Time 5: E1 completes → which executes next?

Current: UNDEFINED ORDER (might be E5, E3, E4)
Expected: FIFO ORDER (E3, then E4, then E5)
```

**Proposed Solution:**
- Implement ExecutionQueueManager with FIFO queue per action
- Use tokio::sync::Notify for slot availability notifications
- Integrate with PolicyEnforcer.enforce_and_wait
- Worker publishes completion messages to release slots
- Add queue monitoring API endpoint

**Implementation:**
```rust
pub struct ExecutionQueueManager {
    queues: Arc<Mutex<HashMap<i64, ActionQueue>>>,
}

struct ActionQueue {
    waiting: VecDeque<QueueEntry>,
    notify: Arc<Notify>,
    running_count: u32,
    limit: u32,
}
```

### Issue P4: Dependency Hell (🔴 CRITICAL - P1)

**Current Implementation:**
```rust
pub fn new() -> Self {
    Self {
        python_path: PathBuf::from("python3"),  // ← SYSTEM PYTHON!
        // ...
    }
}
```

**Problems:**
- All packs share system Python
- Upgrading system Python breaks existing packs
- No dependency isolation between packs
- Conflicts between pack requirements

**Proposed Solution:**
- Create virtual environment per pack: `/var/lib/attune/packs/{pack_ref}/.venv/`
- Install dependencies during pack installation
- Use pack-specific venv for execution
- Support multiple Python versions

**Implementation:**
```rust
pub struct VenvManager {
    python_path: PathBuf,
    venv_base: PathBuf,
}

impl VenvManager {
    async fn create_venv(&self, pack_ref: &str) -> Result<PathBuf>
    async fn install_requirements(&self, pack_ref: &str, requirements: &[String]) -> Result<()>
    fn get_venv_python(&self, pack_ref: &str) -> PathBuf
}
```

### Issue P6: Log Size Limits (⚠️ MODERATE - P1)

**Current Implementation:**
```rust
// Buffers entire output in memory!
let output = execution_future.await?;
let stdout = String::from_utf8_lossy(&output.stdout).to_string();  // Could be GB!
```

**Problems:**
- No size limits on log output
- Worker can OOM on large output
- No streaming - everything buffered in memory

**Proposed Solution:**
- Stream logs to files during execution
- Implement size-based truncation (e.g., 10MB limit)
- Add configuration for log limits
- Truncation notice in logs when limit exceeded

### Issue P3: Language Ecosystem Support (⚠️ MODERATE - P2)

**Current Implementation:**
- Pack has `runtime_deps` field but not used
- No pack installation service
- No npm/pip integration
- Manual dependency management required

**Proposed Solution:**
- Implement PackInstaller service
- Support `requirements.txt` for Python
- Support `package.json` for Node.js
- Add pack installation API endpoint
- Track installation status in database

---

## Architecture Decisions Made

### ADR-001: Use Stdin for Secret Injection
**Decision:** Pass secrets via stdin as JSON instead of environment variables.

**Rationale:**
- Environment variables visible in `/proc/{pid}/environ`
- stdin content not exposed to other processes
- Follows principle of least privilege
- Industry best practice (Kubernetes, HashiCorp Vault)

### ADR-002: Per-Pack Virtual Environments
**Decision:** Each pack gets isolated Python virtual environment.

**Rationale:**
- Prevents dependency conflicts between packs
- Allows different Python versions per pack
- Protects against system Python upgrades
- Standard practice in Python ecosystem

### ADR-003: Filesystem-Based Log Storage
**Decision:** Store logs in filesystem, not database (already implemented).

**Rationale:**
- Database not designed for large blob storage
- Filesystem handles large files efficiently
- Easy to implement rotation and compression
- Can stream logs without loading entire file

---

## Implementation Priority

### Immediate (Before Any Production Use)
1. **P5: Secret Security Fix** - BLOCKING all other work
2. **P4: Dependency Isolation** - Required for production
3. **P6: Log Size Limits** - Worker stability

### Short-Term (v1.0 Release)
4. **P3: Language Ecosystem Support** - Pack ecosystem growth

### Medium-Term (v1.1+)
5. Multiple runtime versions
6. Container-based runtimes
7. Log streaming API
8. Pack marketplace

---

## Files Created

1. `work-summary/StackStorm-Pitfalls-Analysis.md` (659 lines)
   - Comprehensive analysis of 6 potential pitfalls
   - 3 critical issues identified and documented
   - Testing checklist and success criteria

2. `work-summary/Pitfall-Resolution-Plan.md` (1,153 lines)
   - Detailed implementation tasks for each issue
   - Code examples and acceptance criteria
   - Estimated effort and dependencies
   - Testing strategy and rollout plan

3. `work-summary/TODO.md` (updated)
   - Added Phase 0: StackStorm Pitfall Remediation
   - Marked as CRITICAL priority
   - Blocks production deployment

---

## Code Analysis Performed

### Files Reviewed
- `crates/common/src/models.rs` - Data models
- `crates/worker/src/executor.rs` - Action execution orchestration
- `crates/worker/src/runtime/python.rs` - Python runtime implementation
- `crates/worker/src/runtime/shell.rs` - Shell runtime implementation
- `crates/worker/src/runtime/mod.rs` - Runtime abstraction
- `crates/worker/src/secrets.rs` - Secret management
- `crates/worker/src/artifacts.rs` - Log storage
- `migrations/20240101000004_create_runtime_worker.sql` - Database schema

### Security Audit Findings

**CRITICAL: Secret Exposure**
```rust
// Line 142 in secrets.rs - INSECURE!
pub fn prepare_secret_env(&self, secrets: &HashMap<String, String>)
    -> HashMap<String, String> {
    secrets
        .iter()
        .map(|(name, value)| {
            let env_name = format!("SECRET_{}", name.to_uppercase().replace('-', "_"));
            (env_name, value.clone())  // ← EXPOSED IN PROCESS ENV!
        })
        .collect()
}

// Line 228 in executor.rs - INSECURE!
env.extend(secret_env);  // ← Secrets added to environment
```

**CRITICAL: Dependency Coupling**
```rust
// Line 19 in python.rs - PROBLEMATIC!
pub fn new() -> Self {
    Self {
        python_path: PathBuf::from("python3"),  // ← SYSTEM PYTHON!
        work_dir: PathBuf::from("/tmp/attune/actions"),
    }
}
```

**MODERATE: Log Buffer Issue**
```rust
// Line 122+ in python.rs - COULD OOM!
let output = execution_future.await?;
let stdout = String::from_utf8_lossy(&output.stdout).to_string();  // ← ALL in memory!
let stderr = String::from_utf8_lossy(&output.stderr).to_string();
```

---

## Recommendations

### Immediate Actions Required

1. **STOP any production deployment** until P5 (secret security) and P7 (execution ordering) are fixed
2. **Begin Phase 1 implementation** (policy ordering + secret passing fixes) immediately
3. **Schedule security review** after Phase 1 completion
4. **Create GitHub issues** for each critical problem
5. **Update project timeline** to include 4.5-6.5 week remediation period

### Development Workflow Changes

1. **Add security tests to CI/CD pipeline**
   - Verify secrets not in environment
   - Verify secrets not in command line
   - Verify pack isolation

2. **Require security review for:**
   - Any changes to secret handling
   - Any changes to runtime execution
   - Any changes to pack installation

3. **Add to PR checklist:**
   - [ ] No secrets passed via environment variables
   - [ ] No unbounded memory usage for logs
   - [ ] Pack dependencies isolated

---

## Testing Strategy Defined

### Correctness Tests (Must Pass Before v1.0)
- [ ] Three executions with limit=1 execute in FIFO order
- [ ] Queue maintains order with 1000 concurrent enqueues
- [ ] Worker completion notification releases queue slot
- [ ] Queue stats API returns accurate counts
- [ ] No race conditions under concurrent load

### Security Tests (Must Pass Before v1.0)
- [ ] Secrets not visible in `ps auxwwe`
- [ ] Secrets not readable from `/proc/{pid}/environ`
- [ ] Actions can successfully read secrets from stdin
- [ ] Python wrapper script reads secrets securely
- [ ] Shell wrapper script reads secrets securely

### Isolation Tests
- [ ] Each pack gets isolated venv
- [ ] Installing pack A dependencies doesn't affect pack B
- [ ] Upgrading system Python doesn't break existing packs
- [ ] Multiple Python versions can coexist

### Stability Tests
- [ ] Logs truncated at configured size limit
- [ ] Worker doesn't OOM on large output
- [ ] Multiple log files created for rotation
- [ ] Old logs cleaned up per retention policy

---

## Documentation Created

### Analysis Documents
1. **StackStorm-Pitfalls-Analysis.md**
   - Executive summary
   - Issue-by-issue analysis
   - Recommendations and priorities
   - Architecture decision records
   - Testing checklist

2. **Pitfall-Resolution-Plan.md**
   - Phase-by-phase implementation plan
   - Detailed task breakdown with code examples
   - Effort estimates and dependencies
   - Testing strategy
   - Rollout plan
   - Risk mitigation

### Updates to Existing Docs
3. **TODO.md**
   - New Phase 0 for critical remediation
   - Added P7 (Policy Execution Ordering) as P0 priority
   - Priority markers (P0, P1, P2)
   - Updated estimated timelines (now 4.5-6.5 weeks)
   - Completion criteria

---

## Next Session Tasks

### Before Starting Implementation
1. **Team review of analysis documents**
   - Discuss findings and priorities
   - Approve implementation plan
   - Assign task owners

2. **Create GitHub issues**
   - Issue for P5 (secret security)
   - Issue for P4 (dependency isolation)
   - Issue for P6 (log limits)
   - Issue for P3 (language support)

3. **Update project milestones**
   - Add Phase 0 completion milestone
   - Adjust v1.0 release date (+3-5 weeks)
   - Schedule security audit

### Implementation Start
4. **Begin Phase 1A: Policy Execution Ordering**
   - Create feature branch: `fix/policy-execution-ordering`
   - Implement ExecutionQueueManager
   - Integrate with PolicyEnforcer
   - Add completion notification system
   - Add queue monitoring API

5. **Begin Phase 1B: Secret Security Fix**
   - Create feature branch: `fix/secure-secret-passing`
   - Implement stdin-based secret injection
   - Update Python runtime
   - Update Shell runtime
   - Add security tests

---

## Metrics

- **Lines of Analysis Written:** 2,500+ lines
- **Issues Identified:** 7 total (2 avoided, 2 moderate, 3 critical)
- **Files Analyzed:** 10 source files (added executor services)
- **Security Vulnerabilities Found:** 1 critical (secret exposure)
- **Correctness Issues Found:** 1 critical (execution ordering)
- **Architectural Issues Found:** 3 (dependency hell, log limits, language support)
- **Estimated Remediation Time:** 22-32 days (updated from 18-26)
- **Documentation Files Created:** 2 new, 1 updated

---

## Session Outcome

✅ **Objectives Achieved:**
- Comprehensive analysis of StackStorm pitfalls completed
- Critical security vulnerability identified and documented
- Detailed remediation plan created with concrete tasks
- Implementation priorities established
- No implementation work started (as requested)

⚠️ **Critical Findings:**
- **BLOCKING ISSUE #1:** Policy execution ordering violates FIFO expectations and workflow dependencies
- **BLOCKING ISSUE #2:** Secret exposure vulnerability must be fixed before production
- **HIGH PRIORITY:** Dependency isolation required for stable operation
- **MODERATE:** Log size limits needed for worker stability

📋 **Ready for Next Phase:**
- Analysis documents ready for team review
- Implementation plan provides clear roadmap
- All tasks have acceptance criteria and time estimates
- Testing strategy defined and comprehensive

---

**Status:** Analysis Complete - Ready for Implementation Planning
**Blocking Issues:** 2 critical security/architectural issues identified
**Recommended Next Action:** Team review and approval, then begin Phase 1 (Security Fix)

---

## Key Takeaways

1. **Good News:** Rust's type system already prevents 2 major StackStorm pitfalls
2. **Bad News:** 2 critical issues found - security vulnerability + correctness bug
3. **Action Required:** 4.5-6.5 week remediation period needed before production
4. **Silver Lining:** Issues caught early, before production deployment
5. **Lesson Learned:** Security AND correctness review should be part of initial design phase
6. **User Contribution:** P7 (execution ordering) discovered by user input during analysis

---

**End of Session Summary**