524 lines
16 KiB
Markdown
524 lines
16 KiB
Markdown
# Session Summary: StackStorm Pitfall Analysis
|
|
|
|
**Date:** 2024-01-02
|
|
**Duration:** ~2 hours
|
|
**Focus:** Analysis of StackStorm lessons learned and identification of replicated pitfalls in current Attune implementation
|
|
|
|
---
|
|
|
|
## Session Objectives
|
|
|
|
1. Review StackStorm lessons learned document
|
|
2. Analyze current Attune implementation against known pitfalls
|
|
3. Identify security vulnerabilities and architectural issues
|
|
4. Create comprehensive remediation plan
|
|
5. Document findings without beginning implementation
|
|
|
|
---
|
|
|
|
## Work Completed
|
|
|
|
### 1. Comprehensive Pitfall Analysis
|
|
**File Created:** `work-summary/StackStorm-Pitfalls-Analysis.md` (659 lines)
|
|
|
|
**Key Findings:**
|
|
- ✅ **2 Issues Avoided**: Action coupling, type safety (Rust's strong typing prevents these)
|
|
- ⚠️ **2 Moderate Issues**: Language ecosystem support, log size limits
|
|
- 🔴 **3 Critical Issues**: Dependency hell, insecure secret passing, policy execution ordering
|
|
|
|
**Critical Security Vulnerability Identified:**
|
|
```rust
|
|
// CURRENT IMPLEMENTATION - INSECURE!
|
|
env.insert("SECRET_API_KEY", "my-secret-value"); // ← Visible in /proc/pid/environ
|
|
cmd.env("SECRET_API_KEY", "my-secret-value"); // ← Visible in ps auxwwe
|
|
```
|
|
|
|
Any user with shell access can view secrets via:
|
|
- `ps auxwwe` - shows environment variables
|
|
- `cat /proc/{pid}/environ` - shows full environment
|
|
- Process table inspection tools
|
|
|
|
### 2. Detailed Resolution Plan
|
|
**File Created:** `work-summary/Pitfall-Resolution-Plan.md` (1,153 lines)
|
|
|
|
**Implementation Phases Defined:**
|
|
1. **Phase 1: Security Critical** (3-5 days) - Fix secret passing via stdin
|
|
2. **Phase 2: Dependency Isolation** (7-10 days) - Per-pack virtual environments
|
|
3. **Phase 3: Language Support** (5-7 days) - Multi-language dependency management
|
|
4. **Phase 4: Log Limits** (3-4 days) - Streaming logs with size limits
|
|
|
|
**Total Estimated Effort:** 18-26 days (3.5-5 weeks)
|
|
|
|
### 3. Updated TODO Roadmap
|
|
**File Modified:** `work-summary/TODO.md`
|
|
|
|
Added new Phase 0 (StackStorm Pitfall Remediation) as CRITICAL priority, blocking production deployment.
|
|
|
|
---
|
|
|
|
## Critical Issues Discovered
|
|
|
|
### Issue P5: Insecure Secret Passing (🔴 CRITICAL - P0)
|
|
|
|
**Current Implementation:**
|
|
- Secrets passed as environment variables
|
|
- Visible in process table (`ps`, `/proc/pid/environ`)
|
|
- Major security vulnerability
|
|
|
|
**Proposed Solution:**
|
|
- Pass secrets via stdin as JSON payload
|
|
- Separate secrets from environment variables
|
|
- Update Python/Shell runtime wrappers to read from stdin
|
|
- Add security tests to verify secrets not exposed
|
|
|
|
**Files Affected:**
|
|
- `crates/worker/src/secrets.rs`
|
|
- `crates/worker/src/executor.rs`
|
|
- `crates/worker/src/runtime/python.rs`
|
|
- `crates/worker/src/runtime/shell.rs`
|
|
- `crates/worker/src/runtime/mod.rs`
|
|
|
|
**Security Test Requirements:**
|
|
```rust
|
|
#[test]
|
|
fn test_secrets_not_in_process_env() {
|
|
// Verify secrets not readable from /proc/pid/environ
|
|
}
|
|
|
|
#[test]
|
|
fn test_secrets_not_visible_in_ps() {
|
|
// Verify secrets not in ps output
|
|
}
|
|
```
|
|
|
|
### Issue P7: Policy Execution Ordering (🔴 CRITICAL - P0) **NEW**
|
|
|
|
**Current Implementation:**
|
|
```rust
|
|
// In policy_enforcer.rs - only polls, no queue!
|
|
pub async fn wait_for_policy_compliance(...) -> Result<bool> {
|
|
loop {
|
|
if self.check_policies(action_id, pack_id).await?.is_none() {
|
|
return Ok(true); // ← Just returns, no coordination!
|
|
}
|
|
tokio::time::sleep(Duration::from_secs(1)).await;
|
|
}
|
|
}
|
|
```
|
|
|
|
**Problems:**
|
|
- No queue data structure for delayed executions
|
|
- Multiple executions poll simultaneously
|
|
- Non-deterministic order when slot opens
|
|
- Race conditions - first to update wins
|
|
- Violates FIFO expectations
|
|
|
|
**Business Scenario:**
|
|
```
|
|
Action with concurrency limit: 2
|
|
Time 0: E1 requested → starts (slot 1/2)
|
|
Time 1: E2 requested → starts (slot 2/2)
|
|
Time 2: E3 requested → DELAYED
|
|
Time 3: E4 requested → DELAYED
|
|
Time 4: E5 requested → DELAYED
|
|
Time 5: E1 completes → which executes next?
|
|
|
|
Current: UNDEFINED ORDER (might be E5, E3, E4)
|
|
Expected: FIFO ORDER (E3, then E4, then E5)
|
|
```
|
|
|
|
**Proposed Solution:**
|
|
- Implement ExecutionQueueManager with FIFO queue per action
|
|
- Use tokio::sync::Notify for slot availability notifications
|
|
- Integrate with PolicyEnforcer.enforce_and_wait
|
|
- Worker publishes completion messages to release slots
|
|
- Add queue monitoring API endpoint
|
|
|
|
**Implementation:**
|
|
```rust
|
|
pub struct ExecutionQueueManager {
|
|
queues: Arc<Mutex<HashMap<i64, ActionQueue>>>,
|
|
}
|
|
|
|
struct ActionQueue {
|
|
waiting: VecDeque<QueueEntry>,
|
|
notify: Arc<Notify>,
|
|
running_count: u32,
|
|
limit: u32,
|
|
}
|
|
```
|
|
|
|
### Issue P4: Dependency Hell (🔴 CRITICAL - P1)
|
|
|
|
**Current Implementation:**
|
|
```rust
|
|
pub fn new() -> Self {
|
|
Self {
|
|
python_path: PathBuf::from("python3"), // ← SYSTEM PYTHON!
|
|
// ...
|
|
}
|
|
}
|
|
```
|
|
|
|
**Problems:**
|
|
- All packs share system Python
|
|
- Upgrading system Python breaks existing packs
|
|
- No dependency isolation between packs
|
|
- Conflicts between pack requirements
|
|
|
|
**Proposed Solution:**
|
|
- Create virtual environment per pack: `/var/lib/attune/packs/{pack_ref}/.venv/`
|
|
- Install dependencies during pack installation
|
|
- Use pack-specific venv for execution
|
|
- Support multiple Python versions
|
|
|
|
**Implementation:**
|
|
```rust
|
|
pub struct VenvManager {
|
|
python_path: PathBuf,
|
|
venv_base: PathBuf,
|
|
}
|
|
|
|
impl VenvManager {
|
|
async fn create_venv(&self, pack_ref: &str) -> Result<PathBuf>
|
|
async fn install_requirements(&self, pack_ref: &str, requirements: &[String]) -> Result<()>
|
|
fn get_venv_python(&self, pack_ref: &str) -> PathBuf
|
|
}
|
|
```
|
|
|
|
### Issue P6: Log Size Limits (⚠️ MODERATE - P1)
|
|
|
|
**Current Implementation:**
|
|
```rust
|
|
// Buffers entire output in memory!
|
|
let output = execution_future.await?;
|
|
let stdout = String::from_utf8_lossy(&output.stdout).to_string(); // Could be GB!
|
|
```
|
|
|
|
**Problems:**
|
|
- No size limits on log output
|
|
- Worker can OOM on large output
|
|
- No streaming - everything buffered in memory
|
|
|
|
**Proposed Solution:**
|
|
- Stream logs to files during execution
|
|
- Implement size-based truncation (e.g., 10MB limit)
|
|
- Add configuration for log limits
|
|
- Truncation notice in logs when limit exceeded
|
|
|
|
### Issue P3: Language Ecosystem Support (⚠️ MODERATE - P2)
|
|
|
|
**Current Implementation:**
|
|
- Pack has `runtime_deps` field but not used
|
|
- No pack installation service
|
|
- No npm/pip integration
|
|
- Manual dependency management required
|
|
|
|
**Proposed Solution:**
|
|
- Implement PackInstaller service
|
|
- Support `requirements.txt` for Python
|
|
- Support `package.json` for Node.js
|
|
- Add pack installation API endpoint
|
|
- Track installation status in database
|
|
|
|
---
|
|
|
|
## Architecture Decisions Made
|
|
|
|
### ADR-001: Use Stdin for Secret Injection
|
|
**Decision:** Pass secrets via stdin as JSON instead of environment variables.
|
|
|
|
**Rationale:**
|
|
- Environment variables visible in `/proc/{pid}/environ`
|
|
- stdin content not exposed to other processes
|
|
- Follows principle of least privilege
|
|
- Industry best practice (Kubernetes, HashiCorp Vault)
|
|
|
|
### ADR-002: Per-Pack Virtual Environments
|
|
**Decision:** Each pack gets isolated Python virtual environment.
|
|
|
|
**Rationale:**
|
|
- Prevents dependency conflicts between packs
|
|
- Allows different Python versions per pack
|
|
- Protects against system Python upgrades
|
|
- Standard practice in Python ecosystem
|
|
|
|
### ADR-003: Filesystem-Based Log Storage
|
|
**Decision:** Store logs in filesystem, not database (already implemented).
|
|
|
|
**Rationale:**
|
|
- Database not designed for large blob storage
|
|
- Filesystem handles large files efficiently
|
|
- Easy to implement rotation and compression
|
|
- Can stream logs without loading entire file
|
|
|
|
---
|
|
|
|
## Implementation Priority
|
|
|
|
### Immediate (Before Any Production Use)
|
|
1. **P5: Secret Security Fix** - BLOCKING all other work
|
|
2. **P4: Dependency Isolation** - Required for production
|
|
3. **P6: Log Size Limits** - Worker stability
|
|
|
|
### Short-Term (v1.0 Release)
|
|
4. **P3: Language Ecosystem Support** - Pack ecosystem growth
|
|
|
|
### Medium-Term (v1.1+)
|
|
5. Multiple runtime versions
|
|
6. Container-based runtimes
|
|
7. Log streaming API
|
|
8. Pack marketplace
|
|
|
|
---
|
|
|
|
## Files Created
|
|
|
|
1. `work-summary/StackStorm-Pitfalls-Analysis.md` (659 lines)
|
|
- Comprehensive analysis of 6 potential pitfalls
|
|
- 3 critical issues identified and documented
|
|
- Testing checklist and success criteria
|
|
|
|
2. `work-summary/Pitfall-Resolution-Plan.md` (1,153 lines)
|
|
- Detailed implementation tasks for each issue
|
|
- Code examples and acceptance criteria
|
|
- Estimated effort and dependencies
|
|
- Testing strategy and rollout plan
|
|
|
|
3. `work-summary/TODO.md` (updated)
|
|
- Added Phase 0: StackStorm Pitfall Remediation
|
|
- Marked as CRITICAL priority
|
|
- Blocks production deployment
|
|
|
|
---
|
|
|
|
## Code Analysis Performed
|
|
|
|
### Files Reviewed
|
|
- `crates/common/src/models.rs` - Data models
|
|
- `crates/worker/src/executor.rs` - Action execution orchestration
|
|
- `crates/worker/src/runtime/python.rs` - Python runtime implementation
|
|
- `crates/worker/src/runtime/shell.rs` - Shell runtime implementation
|
|
- `crates/worker/src/runtime/mod.rs` - Runtime abstraction
|
|
- `crates/worker/src/secrets.rs` - Secret management
|
|
- `crates/worker/src/artifacts.rs` - Log storage
|
|
- `migrations/20240101000004_create_runtime_worker.sql` - Database schema
|
|
|
|
### Security Audit Findings
|
|
|
|
**CRITICAL: Secret Exposure**
|
|
```rust
|
|
// Line 142 in secrets.rs - INSECURE!
|
|
pub fn prepare_secret_env(&self, secrets: &HashMap<String, String>)
|
|
-> HashMap<String, String> {
|
|
secrets
|
|
.iter()
|
|
.map(|(name, value)| {
|
|
let env_name = format!("SECRET_{}", name.to_uppercase().replace('-', "_"));
|
|
(env_name, value.clone()) // ← EXPOSED IN PROCESS ENV!
|
|
})
|
|
.collect()
|
|
}
|
|
|
|
// Line 228 in executor.rs - INSECURE!
|
|
env.extend(secret_env); // ← Secrets added to environment
|
|
```
|
|
|
|
**CRITICAL: Dependency Coupling**
|
|
```rust
|
|
// Line 19 in python.rs - PROBLEMATIC!
|
|
pub fn new() -> Self {
|
|
Self {
|
|
python_path: PathBuf::from("python3"), // ← SYSTEM PYTHON!
|
|
work_dir: PathBuf::from("/tmp/attune/actions"),
|
|
}
|
|
}
|
|
```
|
|
|
|
**MODERATE: Log Buffer Issue**
|
|
```rust
|
|
// Line 122+ in python.rs - COULD OOM!
|
|
let output = execution_future.await?;
|
|
let stdout = String::from_utf8_lossy(&output.stdout).to_string(); // ← ALL in memory!
|
|
let stderr = String::from_utf8_lossy(&output.stderr).to_string();
|
|
```
|
|
|
|
---
|
|
|
|
## Recommendations
|
|
|
|
### Immediate Actions Required
|
|
|
|
1. **STOP any production deployment** until P5 (secret security) and P7 (execution ordering) are fixed
|
|
2. **Begin Phase 1 implementation** (policy ordering + secret passing fixes) immediately
|
|
3. **Schedule security review** after Phase 1 completion
|
|
4. **Create GitHub issues** for each critical problem
|
|
5. **Update project timeline** to include 4.5-6.5 week remediation period
|
|
|
|
### Development Workflow Changes
|
|
|
|
1. **Add security tests to CI/CD pipeline**
|
|
- Verify secrets not in environment
|
|
- Verify secrets not in command line
|
|
- Verify pack isolation
|
|
|
|
2. **Require security review for:**
|
|
- Any changes to secret handling
|
|
- Any changes to runtime execution
|
|
- Any changes to pack installation
|
|
|
|
3. **Add to PR checklist:**
|
|
- [ ] No secrets passed via environment variables
|
|
- [ ] No unbounded memory usage for logs
|
|
- [ ] Pack dependencies isolated
|
|
|
|
---
|
|
|
|
## Testing Strategy Defined
|
|
|
|
### Correctness Tests (Must Pass Before v1.0)
|
|
- [ ] Three executions with limit=1 execute in FIFO order
|
|
- [ ] Queue maintains order with 1000 concurrent enqueues
|
|
- [ ] Worker completion notification releases queue slot
|
|
- [ ] Queue stats API returns accurate counts
|
|
- [ ] No race conditions under concurrent load
|
|
|
|
### Security Tests (Must Pass Before v1.0)
|
|
- [ ] Secrets not visible in `ps auxwwe`
|
|
- [ ] Secrets not readable from `/proc/{pid}/environ`
|
|
- [ ] Actions can successfully read secrets from stdin
|
|
- [ ] Python wrapper script reads secrets securely
|
|
- [ ] Shell wrapper script reads secrets securely
|
|
|
|
### Isolation Tests
|
|
- [ ] Each pack gets isolated venv
|
|
- [ ] Installing pack A dependencies doesn't affect pack B
|
|
- [ ] Upgrading system Python doesn't break existing packs
|
|
- [ ] Multiple Python versions can coexist
|
|
|
|
### Stability Tests
|
|
- [ ] Logs truncated at configured size limit
|
|
- [ ] Worker doesn't OOM on large output
|
|
- [ ] Multiple log files created for rotation
|
|
- [ ] Old logs cleaned up per retention policy
|
|
|
|
---
|
|
|
|
## Documentation Created
|
|
|
|
### Analysis Documents
|
|
1. **StackStorm-Pitfalls-Analysis.md**
|
|
- Executive summary
|
|
- Issue-by-issue analysis
|
|
- Recommendations and priorities
|
|
- Architecture decision records
|
|
- Testing checklist
|
|
|
|
2. **Pitfall-Resolution-Plan.md**
|
|
- Phase-by-phase implementation plan
|
|
- Detailed task breakdown with code examples
|
|
- Effort estimates and dependencies
|
|
- Testing strategy
|
|
- Rollout plan
|
|
- Risk mitigation
|
|
|
|
### Updates to Existing Docs
|
|
3. **TODO.md**
|
|
- New Phase 0 for critical remediation
|
|
- Added P7 (Policy Execution Ordering) as P0 priority
|
|
- Priority markers (P0, P1, P2)
|
|
- Updated estimated timelines (now 4.5-6.5 weeks)
|
|
- Completion criteria
|
|
|
|
---
|
|
|
|
## Next Session Tasks
|
|
|
|
### Before Starting Implementation
|
|
1. **Team review of analysis documents**
|
|
- Discuss findings and priorities
|
|
- Approve implementation plan
|
|
- Assign task owners
|
|
|
|
2. **Create GitHub issues**
|
|
- Issue for P5 (secret security)
|
|
- Issue for P4 (dependency isolation)
|
|
- Issue for P6 (log limits)
|
|
- Issue for P3 (language support)
|
|
|
|
3. **Update project milestones**
|
|
- Add Phase 0 completion milestone
|
|
- Adjust v1.0 release date (+3-5 weeks)
|
|
- Schedule security audit
|
|
|
|
### Implementation Start
|
|
4. **Begin Phase 1A: Policy Execution Ordering**
|
|
- Create feature branch: `fix/policy-execution-ordering`
|
|
- Implement ExecutionQueueManager
|
|
- Integrate with PolicyEnforcer
|
|
- Add completion notification system
|
|
- Add queue monitoring API
|
|
|
|
5. **Begin Phase 1B: Secret Security Fix**
|
|
- Create feature branch: `fix/secure-secret-passing`
|
|
- Implement stdin-based secret injection
|
|
- Update Python runtime
|
|
- Update Shell runtime
|
|
- Add security tests
|
|
|
|
---
|
|
|
|
## Metrics
|
|
|
|
- **Lines of Analysis Written:** 2,500+ lines
|
|
- **Issues Identified:** 7 total (2 avoided, 2 moderate, 3 critical)
|
|
- **Files Analyzed:** 10 source files (added executor services)
|
|
- **Security Vulnerabilities Found:** 1 critical (secret exposure)
|
|
- **Correctness Issues Found:** 1 critical (execution ordering)
|
|
- **Architectural Issues Found:** 3 (dependency hell, log limits, language support)
|
|
- **Estimated Remediation Time:** 22-32 days (updated from 18-26)
|
|
- **Documentation Files Created:** 2 new, 1 updated
|
|
|
|
---
|
|
|
|
## Session Outcome
|
|
|
|
✅ **Objectives Achieved:**
|
|
- Comprehensive analysis of StackStorm pitfalls completed
|
|
- Critical security vulnerability identified and documented
|
|
- Detailed remediation plan created with concrete tasks
|
|
- Implementation priorities established
|
|
- No implementation work started (as requested)
|
|
|
|
⚠️ **Critical Findings:**
|
|
- **BLOCKING ISSUE #1:** Policy execution ordering violates FIFO expectations and workflow dependencies
|
|
- **BLOCKING ISSUE #2:** Secret exposure vulnerability must be fixed before production
|
|
- **HIGH PRIORITY:** Dependency isolation required for stable operation
|
|
- **MODERATE:** Log size limits needed for worker stability
|
|
|
|
📋 **Ready for Next Phase:**
|
|
- Analysis documents ready for team review
|
|
- Implementation plan provides clear roadmap
|
|
- All tasks have acceptance criteria and time estimates
|
|
- Testing strategy defined and comprehensive
|
|
|
|
---
|
|
|
|
**Status:** Analysis Complete - Ready for Implementation Planning
|
|
**Blocking Issues:** 2 critical security/architectural issues identified
|
|
**Recommended Next Action:** Team review and approval, then begin Phase 1 (Security Fix)
|
|
|
|
---
|
|
|
|
## Key Takeaways
|
|
|
|
1. **Good News:** Rust's type system already prevents 2 major StackStorm pitfalls
|
|
2. **Bad News:** 2 critical issues found - security vulnerability + correctness bug
|
|
3. **Action Required:** 4.5-6.5 week remediation period needed before production
|
|
4. **Silver Lining:** Issues caught early, before production deployment
|
|
5. **Lesson Learned:** Security AND correctness review should be part of initial design phase
|
|
6. **User Contribution:** P7 (execution ordering) discovered by user input during analysis
|
|
|
|
---
|
|
|
|
**End of Session Summary** |