Files

David Culbreth 3b14c65998 re-uploading work

2026-02-04 17:46:30 -06:00

9.4 KiB

Raw Blame History

Workflow Performance Analysis Session Summary

Date: 2025-01-17
Session Focus: Performance analysis of workflow list iteration patterns
Status: ✅ Analysis Complete - Implementation Required

Session Overview

Conducted comprehensive performance analysis of Attune's workflow execution engine in response to concerns about quadratic/exponential computation issues similar to those found in StackStorm/Orquesta's workflow implementation. The analysis focused on list iteration patterns (with-items) and identified critical performance bottlenecks.

Key Findings

1. Critical Issue: O(N*C) Context Cloning

Location: crates/executor/src/workflow/task_executor.rs:453-581

Problem Identified:

When processing lists with with-items, each item receives a full clone of the WorkflowContext
WorkflowContext contains task_results HashMap that grows with every completed task
As workflow progresses: more task results → larger context → more expensive clones

Complexity Analysis:

For N items with M completed tasks:
- Item 1: Clone context with M results
- Item 2: Clone context with M results
- ...
- Item N: Clone context with M results

Total cost: O(N * M * avg_result_size)

Real-World Impact:

Workflow with 100 completed tasks (1MB context)
Processing 1000-item list
Result: 1GB of cloning operations

This is the same issue documented in StackStorm/Orquesta.

2. Secondary Issues Identified

Mutex Lock Contention (Medium Priority)

on_task_completion() locks/unlocks mutex once per next task
Creates contention with high concurrent task completions
Not quadratic, but reduces parallelism

Polling Loop Overhead (Low Priority)

Main execution loop polls every 100ms
Could use event-driven approach with channels
Adds 0-100ms latency to completion

Per-Task State Persistence (Medium Priority)

Database write after every task completion
High concurrent tasks = DB contention
Should batch state updates

3. Graph Algorithm Analysis

Good News: Core graph algorithms are optimal

compute_inbound_edges(): O(N * T) - optimal for graph construction
next_tasks(): O(1) - optimal lookup
get_inbound_tasks(): O(1) - optimal lookup

Where N = tasks, T = avg transitions per task (1-3)

No quadratic algorithms found in core workflow logic.

Recommended Solutions

Current Structure:

#[derive(Clone)]
pub struct WorkflowContext {
    variables: HashMap<String, JsonValue>,      // Cloned every iteration
    task_results: HashMap<String, JsonValue>,   // Grows with workflow
    parameters: JsonValue,                       // Cloned every iteration
    // ...
}

Proposed Solution:

#[derive(Clone)]
pub struct WorkflowContext {
    // Shared immutable data (cheap to clone via Arc)
    parameters: Arc<JsonValue>,
    task_results: Arc<DashMap<String, JsonValue>>,  // Thread-safe, shared
    variables: Arc<DashMap<String, JsonValue>>,
    
    // Per-item data (minimal, cheap to clone)
    current_item: Option<JsonValue>,
    current_index: Option<usize>,
}

Benefits:

Clone operation becomes O(1) - just increment Arc reference counts
Zero memory duplication
DashMap provides concurrent access without locks
Expected improvement: 10-100x for large contexts

Priority 2: Batch Lock Acquisitions (MEDIUM)

Current Pattern:

for next_task_name in next_tasks {
    let mut state = state.lock().await;  // Lock per iteration
    // Process task
}  // Lock dropped

Proposed Pattern:

let mut state = state.lock().await;  // Lock once
for next_task_name in next_tasks {
    // Process all tasks under single lock
}
// Lock dropped once

Benefits:

Reduced lock contention
Better cache locality
Simpler consistency model

Priority 3: Event-Driven Execution (LOW)

Replace polling loop with channels for task completion events.

Benefits:

Eliminates 100ms polling delay
More idiomatic async Rust
Better resource utilization

Priority 4: Batch State Persistence (MEDIUM)

Implement write-behind cache for workflow state.

Benefits:

Reduces DB writes by 10-100x
Better performance under load

Trade-offs:

Potential data loss on crash (needs recovery logic)

Documentation Created

Primary Document

docs/performance-analysis-workflow-lists.md (414 lines)

Executive summary of findings
Detailed analysis of each performance issue
Algorithmic complexity breakdown
Complete solution proposals with code examples
Benchmarking recommendations
Implementation priority matrix
References and resources

Updated Files

work-summary/TODO.md

Added Phase 0.6: Workflow List Iteration Performance (P0 - BLOCKING)
10 implementation tasks
Estimated 5-7 days
Marked as blocking for production

Benchmarking Strategy

Proposed Benchmarks

Context Cloning Benchmark
- Measure clone time with varying numbers of task results (0, 10, 50, 100, 500)
- Measure memory allocation
- Compare before/after Arc implementation
with-items Scaling Benchmark
- Test with 10, 100, 1000, 10000 items
- Measure total execution time
- Measure peak memory usage
- Verify linear scaling after optimization
Lock Contention Benchmark
- Simulate 100 concurrent task completions
- Measure throughput before/after batching
- Verify reduced lock acquisition count

Implementation Plan

Phase 1: Preparation (1 day)

Set up benchmark infrastructure
Create baseline measurements
Document current performance characteristics

Phase 2: Core Refactoring (3-4 days)

Implement Arc-based WorkflowContext
Update all context access patterns
Refactor execute_with_items to use shared context
Update template rendering for Arc-wrapped data

Phase 3: Secondary Optimizations (1-2 days)

Batch lock acquisitions in on_task_completion
Add basic state persistence batching

Phase 4: Validation (1 day)

Run all benchmarks
Verify 10-100x improvement
Run full test suite
Validate memory usage is constant

Risk Assessment

Low Risk

Arc-based refactoring is well-understood pattern in Rust
DashMap is battle-tested crate
Changes are internal to executor service
No API changes required

Potential Issues

Need careful handling of mutable context operations
DashMap API slightly different from HashMap
Template rendering may need adjustment for Arc-wrapped values

Mitigation

Comprehensive test coverage
Benchmark validation at each step
Can use Cow<> as intermediate step if needed

Success Criteria

✅ Context clone is O(1) regardless of task_results size
✅ Memory usage remains constant across list iterations
✅ 1000-item list with 100 prior tasks completes efficiently
✅ All existing tests continue to pass
✅ Benchmarks show 10-100x improvement
✅ No breaking changes to workflow YAML syntax

Next Steps

Immediate: Get stakeholder approval for implementation approach
Week 1: Implement Arc-based context and batch locking
Week 2: Benchmarking, validation, and documentation
Deploy: Performance improvements to staging environment
Monitor: Validate improvements with real-world workflows

References

Analysis Document: docs/performance-analysis-workflow-lists.md
TODO Entry: Phase 0.6 in work-summary/TODO.md
StackStorm Issue: Similar O(N*C) issue documented in Orquesta
Rust Arc: https://doc.rust-lang.org/std/sync/struct.Arc.html
DashMap: https://docs.rs/dashmap/latest/dashmap/

Technical Debt Identified

Polling loop: Should be event-driven (future improvement)
State persistence: Should be batched (medium priority)
Error handling: Some .unwrap() calls in with-items execution
Observability: Need metrics for queue depth, execution time

Lessons Learned

What Went Well

Comprehensive analysis prevented premature optimization
Clear identification of root cause (context cloning)
Found optimal solution (Arc) before implementing
Good documentation of problem and solution

What Could Be Better

Should have had benchmarks from the start
Performance testing should be part of CI/CD

Recommendations for Future

Add performance regression tests to CI
Set performance budgets for critical paths
Profile realistic workflows periodically
Document performance characteristics in code

Conclusion

The analysis successfully identified the critical performance bottleneck in workflow list iteration. The issue is not an algorithmic problem (no quadratic algorithms), but rather a practical implementation issue with context cloning creating O(N*C) behavior.

The solution is straightforward and low-risk: Use Arc<> to share immutable context data instead of cloning. This is a well-established Rust pattern that will provide dramatic performance improvements (10-100x) with minimal code changes.

This work is marked as P0 (BLOCKING) because it's the same issue that caused problems in StackStorm/Orquesta, and we should fix it before it impacts production users.

Status: ✅ Analysis Complete - Ready for Implementation
Blocking: Production deployment
Estimated Implementation Time: 5-7 days
Expected Performance Gain: 10-100x for workflows with large contexts and lists

9.4 KiB Raw Blame History

Workflow Performance Analysis Session Summary

Session Overview

Key Findings

1. Critical Issue: O(N*C) Context Cloning

2. Secondary Issues Identified

Mutex Lock Contention (Medium Priority)

Polling Loop Overhead (Low Priority)

Per-Task State Persistence (Medium Priority)

3. Graph Algorithm Analysis

Recommended Solutions

Priority 1: Arc-Based Context Sharing (CRITICAL)

Priority 2: Batch Lock Acquisitions (MEDIUM)

Priority 3: Event-Driven Execution (LOW)

Priority 4: Batch State Persistence (MEDIUM)

Documentation Created

Primary Document

Updated Files

Benchmarking Strategy

Proposed Benchmarks

Implementation Plan

Phase 1: Preparation (1 day)

Phase 2: Core Refactoring (3-4 days)

Phase 3: Secondary Optimizations (1-2 days)

Phase 4: Validation (1 day)

Risk Assessment

Low Risk

Potential Issues

Mitigation

Success Criteria

Next Steps

References

Technical Debt Identified

Lessons Learned

What Went Well

What Could Be Better

Recommendations for Future

Conclusion

9.4 KiB

Raw Blame History