Files
attune/work-summary/phases/2025-01-workflow-performance-analysis.md
2026-02-04 17:46:30 -06:00

9.4 KiB

Workflow Performance Analysis Session Summary

Date: 2025-01-17
Session Focus: Performance analysis of workflow list iteration patterns
Status: Analysis Complete - Implementation Required


Session Overview

Conducted comprehensive performance analysis of Attune's workflow execution engine in response to concerns about quadratic/exponential computation issues similar to those found in StackStorm/Orquesta's workflow implementation. The analysis focused on list iteration patterns (with-items) and identified critical performance bottlenecks.


Key Findings

1. Critical Issue: O(N*C) Context Cloning

Location: crates/executor/src/workflow/task_executor.rs:453-581

Problem Identified:

  • When processing lists with with-items, each item receives a full clone of the WorkflowContext
  • WorkflowContext contains task_results HashMap that grows with every completed task
  • As workflow progresses: more task results → larger context → more expensive clones

Complexity Analysis:

For N items with M completed tasks:
- Item 1: Clone context with M results
- Item 2: Clone context with M results
- ...
- Item N: Clone context with M results

Total cost: O(N * M * avg_result_size)

Real-World Impact:

  • Workflow with 100 completed tasks (1MB context)
  • Processing 1000-item list
  • Result: 1GB of cloning operations

This is the same issue documented in StackStorm/Orquesta.


2. Secondary Issues Identified

Mutex Lock Contention (Medium Priority)

  • on_task_completion() locks/unlocks mutex once per next task
  • Creates contention with high concurrent task completions
  • Not quadratic, but reduces parallelism

Polling Loop Overhead (Low Priority)

  • Main execution loop polls every 100ms
  • Could use event-driven approach with channels
  • Adds 0-100ms latency to completion

Per-Task State Persistence (Medium Priority)

  • Database write after every task completion
  • High concurrent tasks = DB contention
  • Should batch state updates

3. Graph Algorithm Analysis

Good News: Core graph algorithms are optimal

  • compute_inbound_edges(): O(N * T) - optimal for graph construction
  • next_tasks(): O(1) - optimal lookup
  • get_inbound_tasks(): O(1) - optimal lookup

Where N = tasks, T = avg transitions per task (1-3)

No quadratic algorithms found in core workflow logic.


Priority 1: Arc-Based Context Sharing (CRITICAL)

Current Structure:

#[derive(Clone)]
pub struct WorkflowContext {
    variables: HashMap<String, JsonValue>,      // Cloned every iteration
    task_results: HashMap<String, JsonValue>,   // Grows with workflow
    parameters: JsonValue,                       // Cloned every iteration
    // ...
}

Proposed Solution:

#[derive(Clone)]
pub struct WorkflowContext {
    // Shared immutable data (cheap to clone via Arc)
    parameters: Arc<JsonValue>,
    task_results: Arc<DashMap<String, JsonValue>>,  // Thread-safe, shared
    variables: Arc<DashMap<String, JsonValue>>,
    
    // Per-item data (minimal, cheap to clone)
    current_item: Option<JsonValue>,
    current_index: Option<usize>,
}

Benefits:

  • Clone operation becomes O(1) - just increment Arc reference counts
  • Zero memory duplication
  • DashMap provides concurrent access without locks
  • Expected improvement: 10-100x for large contexts

Priority 2: Batch Lock Acquisitions (MEDIUM)

Current Pattern:

for next_task_name in next_tasks {
    let mut state = state.lock().await;  // Lock per iteration
    // Process task
}  // Lock dropped

Proposed Pattern:

let mut state = state.lock().await;  // Lock once
for next_task_name in next_tasks {
    // Process all tasks under single lock
}
// Lock dropped once

Benefits:

  • Reduced lock contention
  • Better cache locality
  • Simpler consistency model

Priority 3: Event-Driven Execution (LOW)

Replace polling loop with channels for task completion events.

Benefits:

  • Eliminates 100ms polling delay
  • More idiomatic async Rust
  • Better resource utilization

Priority 4: Batch State Persistence (MEDIUM)

Implement write-behind cache for workflow state.

Benefits:

  • Reduces DB writes by 10-100x
  • Better performance under load

Trade-offs:

  • Potential data loss on crash (needs recovery logic)

Documentation Created

Primary Document

docs/performance-analysis-workflow-lists.md (414 lines)

  • Executive summary of findings
  • Detailed analysis of each performance issue
  • Algorithmic complexity breakdown
  • Complete solution proposals with code examples
  • Benchmarking recommendations
  • Implementation priority matrix
  • References and resources

Updated Files

work-summary/TODO.md

  • Added Phase 0.6: Workflow List Iteration Performance (P0 - BLOCKING)
  • 10 implementation tasks
  • Estimated 5-7 days
  • Marked as blocking for production

Benchmarking Strategy

Proposed Benchmarks

  1. Context Cloning Benchmark

    • Measure clone time with varying numbers of task results (0, 10, 50, 100, 500)
    • Measure memory allocation
    • Compare before/after Arc implementation
  2. with-items Scaling Benchmark

    • Test with 10, 100, 1000, 10000 items
    • Measure total execution time
    • Measure peak memory usage
    • Verify linear scaling after optimization
  3. Lock Contention Benchmark

    • Simulate 100 concurrent task completions
    • Measure throughput before/after batching
    • Verify reduced lock acquisition count

Implementation Plan

Phase 1: Preparation (1 day)

  • Set up benchmark infrastructure
  • Create baseline measurements
  • Document current performance characteristics

Phase 2: Core Refactoring (3-4 days)

  • Implement Arc-based WorkflowContext
  • Update all context access patterns
  • Refactor execute_with_items to use shared context
  • Update template rendering for Arc-wrapped data

Phase 3: Secondary Optimizations (1-2 days)

  • Batch lock acquisitions in on_task_completion
  • Add basic state persistence batching

Phase 4: Validation (1 day)

  • Run all benchmarks
  • Verify 10-100x improvement
  • Run full test suite
  • Validate memory usage is constant

Risk Assessment

Low Risk

  • Arc-based refactoring is well-understood pattern in Rust
  • DashMap is battle-tested crate
  • Changes are internal to executor service
  • No API changes required

Potential Issues

  • Need careful handling of mutable context operations
  • DashMap API slightly different from HashMap
  • Template rendering may need adjustment for Arc-wrapped values

Mitigation

  • Comprehensive test coverage
  • Benchmark validation at each step
  • Can use Cow<> as intermediate step if needed

Success Criteria

  1. Context clone is O(1) regardless of task_results size
  2. Memory usage remains constant across list iterations
  3. 1000-item list with 100 prior tasks completes efficiently
  4. All existing tests continue to pass
  5. Benchmarks show 10-100x improvement
  6. No breaking changes to workflow YAML syntax

Next Steps

  1. Immediate: Get stakeholder approval for implementation approach
  2. Week 1: Implement Arc-based context and batch locking
  3. Week 2: Benchmarking, validation, and documentation
  4. Deploy: Performance improvements to staging environment
  5. Monitor: Validate improvements with real-world workflows

References


Technical Debt Identified

  1. Polling loop: Should be event-driven (future improvement)
  2. State persistence: Should be batched (medium priority)
  3. Error handling: Some .unwrap() calls in with-items execution
  4. Observability: Need metrics for queue depth, execution time

Lessons Learned

What Went Well

  • Comprehensive analysis prevented premature optimization
  • Clear identification of root cause (context cloning)
  • Found optimal solution (Arc) before implementing
  • Good documentation of problem and solution

What Could Be Better

  • Should have had benchmarks from the start
  • Performance testing should be part of CI/CD

Recommendations for Future

  • Add performance regression tests to CI
  • Set performance budgets for critical paths
  • Profile realistic workflows periodically
  • Document performance characteristics in code

Conclusion

The analysis successfully identified the critical performance bottleneck in workflow list iteration. The issue is not an algorithmic problem (no quadratic algorithms), but rather a practical implementation issue with context cloning creating O(N*C) behavior.

The solution is straightforward and low-risk: Use Arc<> to share immutable context data instead of cloning. This is a well-established Rust pattern that will provide dramatic performance improvements (10-100x) with minimal code changes.

This work is marked as P0 (BLOCKING) because it's the same issue that caused problems in StackStorm/Orquesta, and we should fix it before it impacts production users.


Status: Analysis Complete - Ready for Implementation
Blocking: Production deployment
Estimated Implementation Time: 5-7 days
Expected Performance Gain: 10-100x for workflows with large contexts and lists