re-uploading work

This commit is contained in:
2026-02-04 17:46:30 -06:00
commit 3b14c65998
1388 changed files with 381262 additions and 0 deletions

View File

@@ -0,0 +1,414 @@
# Workflow List Iteration Performance Analysis
## Executive Summary
This document analyzes potential performance bottlenecks in Attune's workflow execution engine, particularly focusing on list iteration patterns (`with-items`). The analysis reveals that while the current implementation avoids truly quadratic algorithms, there is a **significant performance issue with context cloning** that creates O(N*C) complexity where N is the number of items and C is the context size.
**Key Finding**: As workflows progress and accumulate task results, the context grows linearly. When iterating over large lists, each item clones the entire context, leading to exponentially increasing memory allocation and cloning overhead.
---
## 1. Performance Issues Identified
### 1.1 Critical Issue: Context Cloning in with-items (O(N*C))
**Location**: `crates/executor/src/workflow/task_executor.rs:453-581`
**The Problem**:
```rust
for (item_idx, item) in batch.iter().enumerate() {
let global_idx = batch_idx * batch_size + item_idx;
let permit = semaphore.clone().acquire_owned().await.unwrap();
let executor = TaskExecutor::new(self.db_pool.clone(), self.mq.clone());
let task = task.clone();
let mut item_context = context.clone(); // ⚠️ EXPENSIVE CLONE
item_context.set_current_item(item.clone(), global_idx);
// ...
}
```
**Why This is Problematic**:
The `WorkflowContext` structure (in `crates/executor/src/workflow/context.rs`) contains:
- `variables: HashMap<String, JsonValue>` - grows with workflow progress
- `task_results: HashMap<String, JsonValue>` - **grows with each completed task**
- `parameters: JsonValue` - fixed size
- `system: HashMap<String, JsonValue>` - fixed size
When processing a list of N items in a workflow that has already completed M tasks:
- Item 1 clones context with M task results
- Item 2 clones context with M task results
- ...
- Item N clones context with M task results
**Total cloning cost**: O(N * M * avg_result_size)
**Worst Case Scenario**:
1. Long-running workflow with 100 completed tasks
2. Each task produces 10KB of result data
3. Context size = 1MB
4. Processing 1000 items = 1000 * 1MB = **1GB of cloning operations**
This is similar to the performance issue documented in StackStorm/Orquesta.
---
### 1.2 Secondary Issue: Mutex Lock Pattern in Task Completion
**Location**: `crates/executor/src/workflow/coordinator.rs:593-659`
**The Problem**:
```rust
for next_task_name in next_tasks {
let mut state = state.lock().await; // ⚠️ Lock acquired per task
if state.scheduled_tasks.contains(&next_task_name) { /* ... */ }
// ...
// Lock dropped at end of loop iteration
}
```
**Why This Could Be Better**:
- The mutex is locked/unlocked once per next task
- With high concurrency (many tasks completing simultaneously), this creates lock contention
- Not quadratic, but reduces parallelism
**Impact**: Medium - mainly affects workflows with high fan-out/fan-in patterns
---
### 1.3 Minor Issue: Polling Loop Overhead
**Location**: `crates/executor/src/workflow/coordinator.rs:384-456`
**The Pattern**:
```rust
loop {
// Collect scheduled tasks
let tasks_to_spawn = { /* ... */ };
// Spawn tasks
for task_name in tasks_to_spawn { /* ... */ }
tokio::time::sleep(tokio::time::Duration::from_millis(100)).await; // ⚠️ Polling
// Check completion
if state.executing_tasks.is_empty() && state.scheduled_tasks.is_empty() {
break;
}
}
```
**Why This Could Be Better**:
- Polls every 100ms even when no work is scheduled
- Could use event-driven approach with channels or condition variables
- Adds 0-100ms latency to workflow completion
**Impact**: Low - acceptable for most workflows, but could be optimized
---
### 1.4 Minor Issue: State Persistence Per Task
**Location**: `crates/executor/src/workflow/coordinator.rs:580-581`
**The Pattern**:
```rust
// After each task completes:
coordinator
.update_workflow_execution_state(workflow_execution_id, &state)
.await?;
```
**Why This Could Be Better**:
- Database write after every task completion
- With 1000 concurrent tasks completing, this is 1000 sequential DB writes
- Creates database contention
**Impact**: Medium - could batch state updates or use write-behind caching
---
## 2. Algorithmic Complexity Analysis
### Graph Operations
| Operation | Current Complexity | Optimal | Assessment |
|-----------|-------------------|---------|------------|
| `compute_inbound_edges()` | O(N * T) | O(N * T) | ✅ Optimal |
| `next_tasks()` | O(1) | O(1) | ✅ Optimal |
| `get_inbound_tasks()` | O(1) | O(1) | ✅ Optimal |
Where:
- N = number of tasks in workflow
- T = average transitions per task (typically 1-3)
### Execution Operations
| Operation | Current Complexity | Issue |
|-----------|-------------------|-------|
| `execute_with_items()` | O(N * C) | ❌ Context cloning |
| `on_task_completion()` | O(T) with mutex | ⚠️ Lock contention |
| `execute()` main loop | O(T) per poll | ⚠️ Polling overhead |
Where:
- N = number of items in list
- C = size of workflow context
- T = number of next tasks
---
## 3. Recommended Solutions
### 3.1 High Priority: Optimize Context Cloning
**Solution 1: Use Arc for Immutable Data**
```rust
#[derive(Clone)]
pub struct WorkflowContext {
// Shared immutable data
parameters: Arc<JsonValue>,
task_results: Arc<DashMap<String, JsonValue>>, // Thread-safe, copy-on-write
variables: Arc<DashMap<String, JsonValue>>,
// Per-item data (cheap to clone)
current_item: Option<JsonValue>,
current_index: Option<usize>,
}
```
**Benefits**:
- Cloning only increments reference counts - O(1)
- Shared data accessed via Arc - no copies
- DashMap allows concurrent reads without locks
**Trade-offs**:
- Slightly more complex API
- Need to handle mutability carefully
---
**Solution 2: Context-on-Demand (Lazy Evaluation)**
```rust
pub struct ItemContext {
parent_context: Arc<WorkflowContext>,
item: JsonValue,
index: usize,
}
impl ItemContext {
fn resolve(&self, expr: &str) -> ContextResult<JsonValue> {
// Check item-specific data first
if expr.starts_with("item") || expr == "index" {
// Return item data
} else {
// Delegate to parent context
self.parent_context.resolve(expr)
}
}
}
```
**Benefits**:
- Zero cloning - parent context is shared via Arc
- Item-specific data is minimal (just item + index)
- Clear separation of concerns
**Trade-offs**:
- More complex implementation
- Need to refactor template rendering
---
### 3.2 Medium Priority: Optimize Task Completion Locking
**Solution: Batch Lock Acquisitions**
```rust
async fn on_task_completion(...) -> Result<()> {
let next_tasks = graph.next_tasks(&completed_task, success);
// Acquire lock once, process all next tasks
let mut state = state.lock().await;
for next_task_name in next_tasks {
if state.scheduled_tasks.contains(&next_task_name) { /* ... */ }
// All processing done under single lock
}
// Lock released once at end
Ok(())
}
```
**Benefits**:
- Reduced lock contention
- Better cache locality
- Simpler reasoning about state consistency
---
### 3.3 Low Priority: Event-Driven Execution
**Solution: Replace Polling with Channels**
```rust
pub async fn execute(&self) -> Result<WorkflowExecutionResult> {
let (tx, mut rx) = mpsc::channel(100);
// Schedule entry points
for task in &self.graph.entry_points {
self.spawn_task(task, tx.clone()).await;
}
// Wait for task completions
while let Some(event) = rx.recv().await {
match event {
TaskEvent::Completed { task, success } => {
self.on_task_completion(task, success, tx.clone()).await?;
}
TaskEvent::WorkflowComplete => break,
}
}
}
```
**Benefits**:
- Eliminates polling delay
- Event-driven is more idiomatic for async Rust
- Better resource utilization
---
### 3.4 Low Priority: Batch State Persistence
**Solution: Write-Behind Cache**
```rust
pub struct StateCache {
dirty_states: Arc<DashMap<Id, WorkflowExecutionState>>,
flush_interval: Duration,
}
impl StateCache {
async fn flush_periodically(&self) {
loop {
sleep(self.flush_interval).await;
self.flush_to_db().await;
}
}
async fn flush_to_db(&self) {
// Batch update all dirty states
let states: Vec<_> = self.dirty_states.iter()
.map(|entry| entry.clone())
.collect();
// Single transaction for all updates
db::batch_update_states(&states).await;
}
}
```
**Benefits**:
- Reduces database write operations by 10-100x
- Better database performance under high load
**Trade-offs**:
- Potential data loss if process crashes
- Need careful crash recovery logic
---
## 4. Benchmarking Recommendations
To validate these issues and solutions, implement benchmarks for:
### 4.1 Context Cloning Benchmark
```rust
#[bench]
fn bench_context_clone_with_growing_results(b: &mut Bencher) {
let mut ctx = WorkflowContext::new(json!({}), HashMap::new());
// Simulate 100 completed tasks
for i in 0..100 {
ctx.set_task_result(&format!("task_{}", i),
json!({"data": vec![0u8; 10240]})); // 10KB per task
}
// Measure clone time
b.iter(|| ctx.clone());
}
```
### 4.2 with-items Scaling Benchmark
```rust
#[bench]
fn bench_with_items_scaling(b: &mut Bencher) {
// Test with 10, 100, 1000, 10000 items
for item_count in [10, 100, 1000, 10000] {
let items = vec![json!({"value": 1}); item_count];
b.iter(|| {
// Measure time to process all items
executor.execute_with_items(&task, &mut context, items).await
});
}
}
```
### 4.3 Lock Contention Benchmark
```rust
#[bench]
fn bench_concurrent_task_completions(b: &mut Bencher) {
// Simulate 100 tasks completing simultaneously
let handles: Vec<_> = (0..100).map(|i| {
tokio::spawn(async move {
on_task_completion(state.clone(), graph.clone(),
format!("task_{}", i), true).await
})
}).collect();
b.iter(|| join_all(handles).await);
}
```
---
## 5. Implementation Priority
| Issue | Priority | Effort | Impact | Recommendation |
|-------|----------|--------|--------|----------------|
| Context cloning (1.1) | 🔴 Critical | High | Very High | Implement Arc-based solution |
| Lock contention (1.2) | 🟡 Medium | Low | Medium | Quick win - refactor locking |
| Polling overhead (1.3) | 🟢 Low | Medium | Low | Future improvement |
| State persistence (1.4) | 🟡 Medium | Medium | Medium | Implement after Arc solution |
---
## 6. Conclusion
The Attune workflow engine's current implementation is **algorithmically sound** - there are no truly quadratic or exponential algorithms in the core logic. However, the **context cloning pattern in with-items execution** creates a practical O(N*C) complexity that manifests as exponential-like behavior in real-world workflows with large contexts and long lists.
**Immediate Action**: Implement Arc-based context sharing to eliminate the cloning overhead. This single change will provide 10-100x performance improvement for workflows with large lists and many task results.
**Next Steps**:
1. Create benchmarks to measure current performance
2. Implement Arc<> wrapper for WorkflowContext immutable data
3. Refactor execute_with_items to use shared context
4. Re-run benchmarks to validate improvements
5. Consider event-driven execution model for future optimization
---
## 7. References
- StackStorm Orquesta Performance Issues: https://github.com/StackStorm/orquesta/issues
- Rust Arc Documentation: https://doc.rust-lang.org/std/sync/struct.Arc.html
- DashMap (concurrent HashMap): https://docs.rs/dashmap/latest/dashmap/
- Tokio Sync Primitives: https://docs.rs/tokio/latest/tokio/sync/
---
**Document Version**: 1.0
**Date**: 2025-01-17
**Author**: Performance Analysis Team