re-uploading work
This commit is contained in:
327
work-summary/phases/2025-01-workflow-performance-analysis.md
Normal file
327
work-summary/phases/2025-01-workflow-performance-analysis.md
Normal file
@@ -0,0 +1,327 @@
|
||||
# Workflow Performance Analysis Session Summary
|
||||
|
||||
**Date**: 2025-01-17
|
||||
**Session Focus**: Performance analysis of workflow list iteration patterns
|
||||
**Status**: ✅ Analysis Complete - Implementation Required
|
||||
|
||||
---
|
||||
|
||||
## Session Overview
|
||||
|
||||
Conducted comprehensive performance analysis of Attune's workflow execution engine in response to concerns about quadratic/exponential computation issues similar to those found in StackStorm/Orquesta's workflow implementation. The analysis focused on list iteration patterns (`with-items`) and identified critical performance bottlenecks.
|
||||
|
||||
---
|
||||
|
||||
## Key Findings
|
||||
|
||||
### 1. Critical Issue: O(N*C) Context Cloning
|
||||
|
||||
**Location**: `crates/executor/src/workflow/task_executor.rs:453-581`
|
||||
|
||||
**Problem Identified**:
|
||||
- When processing lists with `with-items`, each item receives a full clone of the `WorkflowContext`
|
||||
- `WorkflowContext` contains `task_results` HashMap that grows with every completed task
|
||||
- As workflow progresses: more task results → larger context → more expensive clones
|
||||
|
||||
**Complexity Analysis**:
|
||||
```
|
||||
For N items with M completed tasks:
|
||||
- Item 1: Clone context with M results
|
||||
- Item 2: Clone context with M results
|
||||
- ...
|
||||
- Item N: Clone context with M results
|
||||
|
||||
Total cost: O(N * M * avg_result_size)
|
||||
```
|
||||
|
||||
**Real-World Impact**:
|
||||
- Workflow with 100 completed tasks (1MB context)
|
||||
- Processing 1000-item list
|
||||
- Result: **1GB of cloning operations**
|
||||
|
||||
This is the same issue documented in StackStorm/Orquesta.
|
||||
|
||||
---
|
||||
|
||||
### 2. Secondary Issues Identified
|
||||
|
||||
#### Mutex Lock Contention (Medium Priority)
|
||||
- `on_task_completion()` locks/unlocks mutex once per next task
|
||||
- Creates contention with high concurrent task completions
|
||||
- Not quadratic, but reduces parallelism
|
||||
|
||||
#### Polling Loop Overhead (Low Priority)
|
||||
- Main execution loop polls every 100ms
|
||||
- Could use event-driven approach with channels
|
||||
- Adds 0-100ms latency to completion
|
||||
|
||||
#### Per-Task State Persistence (Medium Priority)
|
||||
- Database write after every task completion
|
||||
- High concurrent tasks = DB contention
|
||||
- Should batch state updates
|
||||
|
||||
---
|
||||
|
||||
### 3. Graph Algorithm Analysis
|
||||
|
||||
**Good News**: Core graph algorithms are optimal
|
||||
- `compute_inbound_edges()`: O(N * T) - optimal for graph construction
|
||||
- `next_tasks()`: O(1) - optimal lookup
|
||||
- `get_inbound_tasks()`: O(1) - optimal lookup
|
||||
|
||||
Where N = tasks, T = avg transitions per task (1-3)
|
||||
|
||||
**No quadratic algorithms found in core workflow logic.**
|
||||
|
||||
---
|
||||
|
||||
## Recommended Solutions
|
||||
|
||||
### Priority 1: Arc-Based Context Sharing (CRITICAL)
|
||||
|
||||
**Current Structure**:
|
||||
```rust
|
||||
#[derive(Clone)]
|
||||
pub struct WorkflowContext {
|
||||
variables: HashMap<String, JsonValue>, // Cloned every iteration
|
||||
task_results: HashMap<String, JsonValue>, // Grows with workflow
|
||||
parameters: JsonValue, // Cloned every iteration
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
**Proposed Solution**:
|
||||
```rust
|
||||
#[derive(Clone)]
|
||||
pub struct WorkflowContext {
|
||||
// Shared immutable data (cheap to clone via Arc)
|
||||
parameters: Arc<JsonValue>,
|
||||
task_results: Arc<DashMap<String, JsonValue>>, // Thread-safe, shared
|
||||
variables: Arc<DashMap<String, JsonValue>>,
|
||||
|
||||
// Per-item data (minimal, cheap to clone)
|
||||
current_item: Option<JsonValue>,
|
||||
current_index: Option<usize>,
|
||||
}
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Clone operation becomes O(1) - just increment Arc reference counts
|
||||
- Zero memory duplication
|
||||
- DashMap provides concurrent access without locks
|
||||
- **Expected improvement**: 10-100x for large contexts
|
||||
|
||||
---
|
||||
|
||||
### Priority 2: Batch Lock Acquisitions (MEDIUM)
|
||||
|
||||
**Current Pattern**:
|
||||
```rust
|
||||
for next_task_name in next_tasks {
|
||||
let mut state = state.lock().await; // Lock per iteration
|
||||
// Process task
|
||||
} // Lock dropped
|
||||
```
|
||||
|
||||
**Proposed Pattern**:
|
||||
```rust
|
||||
let mut state = state.lock().await; // Lock once
|
||||
for next_task_name in next_tasks {
|
||||
// Process all tasks under single lock
|
||||
}
|
||||
// Lock dropped once
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Reduced lock contention
|
||||
- Better cache locality
|
||||
- Simpler consistency model
|
||||
|
||||
---
|
||||
|
||||
### Priority 3: Event-Driven Execution (LOW)
|
||||
|
||||
Replace polling loop with channels for task completion events.
|
||||
|
||||
**Benefits**:
|
||||
- Eliminates 100ms polling delay
|
||||
- More idiomatic async Rust
|
||||
- Better resource utilization
|
||||
|
||||
---
|
||||
|
||||
### Priority 4: Batch State Persistence (MEDIUM)
|
||||
|
||||
Implement write-behind cache for workflow state.
|
||||
|
||||
**Benefits**:
|
||||
- Reduces DB writes by 10-100x
|
||||
- Better performance under load
|
||||
|
||||
**Trade-offs**:
|
||||
- Potential data loss on crash (needs recovery logic)
|
||||
|
||||
---
|
||||
|
||||
## Documentation Created
|
||||
|
||||
### Primary Document
|
||||
**`docs/performance-analysis-workflow-lists.md`** (414 lines)
|
||||
- Executive summary of findings
|
||||
- Detailed analysis of each performance issue
|
||||
- Algorithmic complexity breakdown
|
||||
- Complete solution proposals with code examples
|
||||
- Benchmarking recommendations
|
||||
- Implementation priority matrix
|
||||
- References and resources
|
||||
|
||||
### Updated Files
|
||||
**`work-summary/TODO.md`**
|
||||
- Added Phase 0.6: Workflow List Iteration Performance (P0 - BLOCKING)
|
||||
- 10 implementation tasks
|
||||
- Estimated 5-7 days
|
||||
- Marked as blocking for production
|
||||
|
||||
---
|
||||
|
||||
## Benchmarking Strategy
|
||||
|
||||
### Proposed Benchmarks
|
||||
|
||||
1. **Context Cloning Benchmark**
|
||||
- Measure clone time with varying numbers of task results (0, 10, 50, 100, 500)
|
||||
- Measure memory allocation
|
||||
- Compare before/after Arc implementation
|
||||
|
||||
2. **with-items Scaling Benchmark**
|
||||
- Test with 10, 100, 1000, 10000 items
|
||||
- Measure total execution time
|
||||
- Measure peak memory usage
|
||||
- Verify linear scaling after optimization
|
||||
|
||||
3. **Lock Contention Benchmark**
|
||||
- Simulate 100 concurrent task completions
|
||||
- Measure throughput before/after batching
|
||||
- Verify reduced lock acquisition count
|
||||
|
||||
---
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Preparation (1 day)
|
||||
- [ ] Set up benchmark infrastructure
|
||||
- [ ] Create baseline measurements
|
||||
- [ ] Document current performance characteristics
|
||||
|
||||
### Phase 2: Core Refactoring (3-4 days)
|
||||
- [ ] Implement Arc-based WorkflowContext
|
||||
- [ ] Update all context access patterns
|
||||
- [ ] Refactor execute_with_items to use shared context
|
||||
- [ ] Update template rendering for Arc-wrapped data
|
||||
|
||||
### Phase 3: Secondary Optimizations (1-2 days)
|
||||
- [ ] Batch lock acquisitions in on_task_completion
|
||||
- [ ] Add basic state persistence batching
|
||||
|
||||
### Phase 4: Validation (1 day)
|
||||
- [ ] Run all benchmarks
|
||||
- [ ] Verify 10-100x improvement
|
||||
- [ ] Run full test suite
|
||||
- [ ] Validate memory usage is constant
|
||||
|
||||
---
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
### Low Risk
|
||||
- Arc-based refactoring is well-understood pattern in Rust
|
||||
- DashMap is battle-tested crate
|
||||
- Changes are internal to executor service
|
||||
- No API changes required
|
||||
|
||||
### Potential Issues
|
||||
- Need careful handling of mutable context operations
|
||||
- DashMap API slightly different from HashMap
|
||||
- Template rendering may need adjustment for Arc-wrapped values
|
||||
|
||||
### Mitigation
|
||||
- Comprehensive test coverage
|
||||
- Benchmark validation at each step
|
||||
- Can use Cow<> as intermediate step if needed
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
1. ✅ Context clone is O(1) regardless of task_results size
|
||||
2. ✅ Memory usage remains constant across list iterations
|
||||
3. ✅ 1000-item list with 100 prior tasks completes efficiently
|
||||
4. ✅ All existing tests continue to pass
|
||||
5. ✅ Benchmarks show 10-100x improvement
|
||||
6. ✅ No breaking changes to workflow YAML syntax
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Immediate**: Get stakeholder approval for implementation approach
|
||||
2. **Week 1**: Implement Arc-based context and batch locking
|
||||
3. **Week 2**: Benchmarking, validation, and documentation
|
||||
4. **Deploy**: Performance improvements to staging environment
|
||||
5. **Monitor**: Validate improvements with real-world workflows
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- **Analysis Document**: `docs/performance-analysis-workflow-lists.md`
|
||||
- **TODO Entry**: Phase 0.6 in `work-summary/TODO.md`
|
||||
- **StackStorm Issue**: Similar O(N*C) issue documented in Orquesta
|
||||
- **Rust Arc**: https://doc.rust-lang.org/std/sync/struct.Arc.html
|
||||
- **DashMap**: https://docs.rs/dashmap/latest/dashmap/
|
||||
|
||||
---
|
||||
|
||||
## Technical Debt Identified
|
||||
|
||||
1. **Polling loop**: Should be event-driven (future improvement)
|
||||
2. **State persistence**: Should be batched (medium priority)
|
||||
3. **Error handling**: Some .unwrap() calls in with-items execution
|
||||
4. **Observability**: Need metrics for queue depth, execution time
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
### What Went Well
|
||||
- Comprehensive analysis prevented premature optimization
|
||||
- Clear identification of root cause (context cloning)
|
||||
- Found optimal solution (Arc) before implementing
|
||||
- Good documentation of problem and solution
|
||||
|
||||
### What Could Be Better
|
||||
- Should have had benchmarks from the start
|
||||
- Performance testing should be part of CI/CD
|
||||
|
||||
### Recommendations for Future
|
||||
- Add performance regression tests to CI
|
||||
- Set performance budgets for critical paths
|
||||
- Profile realistic workflows periodically
|
||||
- Document performance characteristics in code
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
The analysis successfully identified the critical performance bottleneck in workflow list iteration. The issue is **not** an algorithmic problem (no quadratic algorithms), but rather a **practical implementation issue** with context cloning creating O(N*C) behavior.
|
||||
|
||||
**The solution is straightforward and low-risk**: Use Arc<> to share immutable context data instead of cloning. This is a well-established Rust pattern that will provide dramatic performance improvements (10-100x) with minimal code changes.
|
||||
|
||||
**This work is marked as P0 (BLOCKING)** because it's the same issue that caused problems in StackStorm/Orquesta, and we should fix it before it impacts production users.
|
||||
|
||||
---
|
||||
|
||||
**Status**: ✅ Analysis Complete - Ready for Implementation
|
||||
**Blocking**: Production deployment
|
||||
**Estimated Implementation Time**: 5-7 days
|
||||
**Expected Performance Gain**: 10-100x for workflows with large contexts and lists
|
||||
Reference in New Issue
Block a user