9.4 KiB
Workflow Performance Analysis Session Summary
Date: 2025-01-17
Session Focus: Performance analysis of workflow list iteration patterns
Status: ✅ Analysis Complete - Implementation Required
Session Overview
Conducted comprehensive performance analysis of Attune's workflow execution engine in response to concerns about quadratic/exponential computation issues similar to those found in StackStorm/Orquesta's workflow implementation. The analysis focused on list iteration patterns (with-items) and identified critical performance bottlenecks.
Key Findings
1. Critical Issue: O(N*C) Context Cloning
Location: crates/executor/src/workflow/task_executor.rs:453-581
Problem Identified:
- When processing lists with
with-items, each item receives a full clone of theWorkflowContext WorkflowContextcontainstask_resultsHashMap that grows with every completed task- As workflow progresses: more task results → larger context → more expensive clones
Complexity Analysis:
For N items with M completed tasks:
- Item 1: Clone context with M results
- Item 2: Clone context with M results
- ...
- Item N: Clone context with M results
Total cost: O(N * M * avg_result_size)
Real-World Impact:
- Workflow with 100 completed tasks (1MB context)
- Processing 1000-item list
- Result: 1GB of cloning operations
This is the same issue documented in StackStorm/Orquesta.
2. Secondary Issues Identified
Mutex Lock Contention (Medium Priority)
on_task_completion()locks/unlocks mutex once per next task- Creates contention with high concurrent task completions
- Not quadratic, but reduces parallelism
Polling Loop Overhead (Low Priority)
- Main execution loop polls every 100ms
- Could use event-driven approach with channels
- Adds 0-100ms latency to completion
Per-Task State Persistence (Medium Priority)
- Database write after every task completion
- High concurrent tasks = DB contention
- Should batch state updates
3. Graph Algorithm Analysis
Good News: Core graph algorithms are optimal
compute_inbound_edges(): O(N * T) - optimal for graph constructionnext_tasks(): O(1) - optimal lookupget_inbound_tasks(): O(1) - optimal lookup
Where N = tasks, T = avg transitions per task (1-3)
No quadratic algorithms found in core workflow logic.
Recommended Solutions
Priority 1: Arc-Based Context Sharing (CRITICAL)
Current Structure:
#[derive(Clone)]
pub struct WorkflowContext {
variables: HashMap<String, JsonValue>, // Cloned every iteration
task_results: HashMap<String, JsonValue>, // Grows with workflow
parameters: JsonValue, // Cloned every iteration
// ...
}
Proposed Solution:
#[derive(Clone)]
pub struct WorkflowContext {
// Shared immutable data (cheap to clone via Arc)
parameters: Arc<JsonValue>,
task_results: Arc<DashMap<String, JsonValue>>, // Thread-safe, shared
variables: Arc<DashMap<String, JsonValue>>,
// Per-item data (minimal, cheap to clone)
current_item: Option<JsonValue>,
current_index: Option<usize>,
}
Benefits:
- Clone operation becomes O(1) - just increment Arc reference counts
- Zero memory duplication
- DashMap provides concurrent access without locks
- Expected improvement: 10-100x for large contexts
Priority 2: Batch Lock Acquisitions (MEDIUM)
Current Pattern:
for next_task_name in next_tasks {
let mut state = state.lock().await; // Lock per iteration
// Process task
} // Lock dropped
Proposed Pattern:
let mut state = state.lock().await; // Lock once
for next_task_name in next_tasks {
// Process all tasks under single lock
}
// Lock dropped once
Benefits:
- Reduced lock contention
- Better cache locality
- Simpler consistency model
Priority 3: Event-Driven Execution (LOW)
Replace polling loop with channels for task completion events.
Benefits:
- Eliminates 100ms polling delay
- More idiomatic async Rust
- Better resource utilization
Priority 4: Batch State Persistence (MEDIUM)
Implement write-behind cache for workflow state.
Benefits:
- Reduces DB writes by 10-100x
- Better performance under load
Trade-offs:
- Potential data loss on crash (needs recovery logic)
Documentation Created
Primary Document
docs/performance-analysis-workflow-lists.md (414 lines)
- Executive summary of findings
- Detailed analysis of each performance issue
- Algorithmic complexity breakdown
- Complete solution proposals with code examples
- Benchmarking recommendations
- Implementation priority matrix
- References and resources
Updated Files
work-summary/TODO.md
- Added Phase 0.6: Workflow List Iteration Performance (P0 - BLOCKING)
- 10 implementation tasks
- Estimated 5-7 days
- Marked as blocking for production
Benchmarking Strategy
Proposed Benchmarks
-
Context Cloning Benchmark
- Measure clone time with varying numbers of task results (0, 10, 50, 100, 500)
- Measure memory allocation
- Compare before/after Arc implementation
-
with-items Scaling Benchmark
- Test with 10, 100, 1000, 10000 items
- Measure total execution time
- Measure peak memory usage
- Verify linear scaling after optimization
-
Lock Contention Benchmark
- Simulate 100 concurrent task completions
- Measure throughput before/after batching
- Verify reduced lock acquisition count
Implementation Plan
Phase 1: Preparation (1 day)
- Set up benchmark infrastructure
- Create baseline measurements
- Document current performance characteristics
Phase 2: Core Refactoring (3-4 days)
- Implement Arc-based WorkflowContext
- Update all context access patterns
- Refactor execute_with_items to use shared context
- Update template rendering for Arc-wrapped data
Phase 3: Secondary Optimizations (1-2 days)
- Batch lock acquisitions in on_task_completion
- Add basic state persistence batching
Phase 4: Validation (1 day)
- Run all benchmarks
- Verify 10-100x improvement
- Run full test suite
- Validate memory usage is constant
Risk Assessment
Low Risk
- Arc-based refactoring is well-understood pattern in Rust
- DashMap is battle-tested crate
- Changes are internal to executor service
- No API changes required
Potential Issues
- Need careful handling of mutable context operations
- DashMap API slightly different from HashMap
- Template rendering may need adjustment for Arc-wrapped values
Mitigation
- Comprehensive test coverage
- Benchmark validation at each step
- Can use Cow<> as intermediate step if needed
Success Criteria
- ✅ Context clone is O(1) regardless of task_results size
- ✅ Memory usage remains constant across list iterations
- ✅ 1000-item list with 100 prior tasks completes efficiently
- ✅ All existing tests continue to pass
- ✅ Benchmarks show 10-100x improvement
- ✅ No breaking changes to workflow YAML syntax
Next Steps
- Immediate: Get stakeholder approval for implementation approach
- Week 1: Implement Arc-based context and batch locking
- Week 2: Benchmarking, validation, and documentation
- Deploy: Performance improvements to staging environment
- Monitor: Validate improvements with real-world workflows
References
- Analysis Document:
docs/performance-analysis-workflow-lists.md - TODO Entry: Phase 0.6 in
work-summary/TODO.md - StackStorm Issue: Similar O(N*C) issue documented in Orquesta
- Rust Arc: https://doc.rust-lang.org/std/sync/struct.Arc.html
- DashMap: https://docs.rs/dashmap/latest/dashmap/
Technical Debt Identified
- Polling loop: Should be event-driven (future improvement)
- State persistence: Should be batched (medium priority)
- Error handling: Some .unwrap() calls in with-items execution
- Observability: Need metrics for queue depth, execution time
Lessons Learned
What Went Well
- Comprehensive analysis prevented premature optimization
- Clear identification of root cause (context cloning)
- Found optimal solution (Arc) before implementing
- Good documentation of problem and solution
What Could Be Better
- Should have had benchmarks from the start
- Performance testing should be part of CI/CD
Recommendations for Future
- Add performance regression tests to CI
- Set performance budgets for critical paths
- Profile realistic workflows periodically
- Document performance characteristics in code
Conclusion
The analysis successfully identified the critical performance bottleneck in workflow list iteration. The issue is not an algorithmic problem (no quadratic algorithms), but rather a practical implementation issue with context cloning creating O(N*C) behavior.
The solution is straightforward and low-risk: Use Arc<> to share immutable context data instead of cloning. This is a well-established Rust pattern that will provide dramatic performance improvements (10-100x) with minimal code changes.
This work is marked as P0 (BLOCKING) because it's the same issue that caused problems in StackStorm/Orquesta, and we should fix it before it impacts production users.
Status: ✅ Analysis Complete - Ready for Implementation
Blocking: Production deployment
Estimated Implementation Time: 5-7 days
Expected Performance Gain: 10-100x for workflows with large contexts and lists