9.1 KiB
Orquesta-Style Workflow Refactoring Plan
Goal
Refactor the workflow execution engine from a dependency-based DAG model to a transition-based graph traversal model inspired by StackStorm's Orquesta engine. This will simplify the code and naturally support workflow cycles.
Current Problems
- Over-engineered: Computing dependencies, levels, and topological sort that we never actually use
- Not using transitions: We parse
nexttransitions but execute based on dependencies instead - Artificial DAG restriction: Prevents legitimate use cases like monitoring loops
- Polling-based: Continuously polls for "ready tasks" instead of reacting to completions
Orquesta Model Benefits
- Simpler: Pure graph traversal following transitions
- Event-driven: Task completions trigger next task scheduling
- Naturally supports cycles: Workflows terminate when transitions stop scheduling tasks
- Intuitive: Follow the
nextarrows in the workflow definition
Implementation Plan
Phase 1: Documentation Updates
Files to modify:
docs/workflow-execution-engine.mdwork-summary/TODO.md
Changes:
- Remove references to DAG and topological sort
- Document transition-based execution model
- Add examples of cyclic workflows (monitoring loops)
- Document join semantics clearly
- Document workflow termination conditions
Phase 2: Refactor Graph Module (crates/executor/src/workflow/graph.rs)
Remove:
CircularDependencyerror variant (cycles are now valid)NoEntryPointerror variant (can have workflows with all tasks having inbound edges if manually started)levelfield fromTaskNodeexecution_orderfield fromTaskGraphcompute_levels()method (not needed)- Topological sort logic in
From<GraphBuilder> for TaskGraph
Keep/Modify:
entry_points- still useful as default starting tasks- Renamed
dependenciestoinbound_edges- needed for entry point detection and join tracking - Renamed
dependentstooutbound_edges- needed for identifying edges next_tasks()- KEY METHOD - evaluates transitions- Simplified
compute_dependencies()tocompute_inbound_edges()- only tracks inbound edges - Updated
TaskNode.dependenciestoTaskNode.inbound_tasks
Add:
get_inbound_tasks(&self, task_name: &str) -> Vec<String>- returns all tasks that can transition to this task- Documentation explaining that cycles are supported
Phase 3: Enhance Transition Evaluation
Files to modify:
crates/executor/src/workflow/graph.rs
Changes:
next_tasks()already returns task names based on success/failure- Add support for evaluating
whenconditions (deferred - needs context) - Consider returning a struct with task name + transition info instead of just String (deferred)
Phase 4: Add Join Tracking (crates/executor/src/workflow/coordinator.rs)
Add to WorkflowExecutionState:
scheduled_tasks: HashSet<String>- tasks scheduled but not yet executingjoin_state: HashMap<String, HashSet<String>>- track which predecessors completed for each join task- Renamed
current_taskstoexecuting_tasksfor clarity
Add methods:
- Join checking logic implemented in
on_task_completion()method- Checks if join conditions are met
- Returns true immediately if no join specified
- Returns true if join count reached
Phase 5: Refactor Workflow Coordinator
Files to modify:
crates/executor/src/workflow/coordinator.rs
Major refactor of WorkflowExecutionHandle::execute():
// NEW EXECUTION MODEL:
// 1. Schedule entry point tasks
// 2. Wait for task completions
// 3. On completion, evaluate transitions and schedule next tasks
// 4. Terminate when nothing executing and nothing scheduled
Changes:
- Replaced polling ready_tasks with checking scheduled_tasks
- Start execution by scheduling all entry point tasks
- Removed
graph.ready_tasks()call - Added
spawn_task_execution()method that:- Spawns task execution from main loop
- Modified
execute_task_async()to:- Move task from scheduled to executing when starting
- On completion, evaluate
graph.next_tasks() - Call
on_task_completion()to schedule next tasks - Handle join state updates
- Updated termination condition:
scheduled_tasks.is_empty() && executing_tasks.is_empty()
Specific implementation steps:
- Added
spawn_task_execution()method - Added
on_task_completion()method that evaluates transitions - Refactored
execute()to start with entry points - Changed main loop to spawn scheduled tasks and check for completion
- Updated
execute_task_async()to callon_task_completion()at the end - Implemented join barrier logic in
on_task_completion()
Phase 6: Update Tests
Files to modify:
crates/executor/src/workflow/graph.rs(tests module)crates/executor/src/workflow/coordinator.rs(tests module)- Add new test files if needed
Test cases to add:
- Simple cycle (task transitions to itself) - test_cycle_support
- Complex cycle (task A -> B -> C -> A)
- Cycle with termination condition (monitoring loop that exits)
- Join with 2 parallel tasks
- Join with N tasks (where join = 2 of 3)
- Multiple entry points
- Workflow with no entry points (all tasks have inbound edges) - test_cycle_support covers this
- Task that transitions to multiple next tasks - test_parallel_entry_points covers this
Test cases to update:
- Updated existing tests to work with new model
- Removed dependency on circular dependency errors
Phase 7: Add Cycle Protection
Safety mechanisms to add:
- Workflow execution timeout (max total execution time)
- Task iteration limit (max times a single task can execute in one workflow)
- Add to config:
max_workflow_duration_seconds - Add to config:
max_task_iterations_per_workflow - Track iteration count per task in WorkflowExecutionState
Phase 8: Update Workflow YAML Examples
Files to create/update:
- Add example workflows demonstrating cycles
docs/examples/monitoring-loop.yamldocs/examples/retry-with-cycle.yamldocs/examples/conditional-loop.yaml
Phase 9: Final Documentation
Update:
README.md- mention cycle supportdocs/workflow-execution-engine.md- complete rewrite of execution model sectiondocs/testing-status.md- add new test requirementsCHANGELOG.md- document the breaking change
Testing Strategy
- Unit Tests: Test graph building, transition evaluation, join logic
- Integration Tests: Test full workflow execution with cycles
- Manual Testing: Run example workflows with monitoring loops
- Performance Testing: Ensure cycle detection doesn't cause performance issues
Migration Notes
Breaking Changes:
- Workflows that relied on implicit execution order from levels may behave differently
- Cycles that were previously errors are now valid
- Entry point detection behavior may change slightly
Backwards Compatibility:
- All valid DAG workflows should continue to work
- The transition model is more explicit and should be more predictable
Estimated Effort
- Phase 1 (Docs): 1 hour (DEFERRED)
- Phase 2 (Graph refactor): 2-3 hours ✅ COMPLETE
- Phase 3 (Transition enhancement): 1 hour (PARTIAL - basic implementation done)
- Phase 4 (Join tracking): 1-2 hours ✅ COMPLETE
- Phase 5 (Coordinator refactor): 3-4 hours ✅ COMPLETE
- Phase 6 (Tests): 2-3 hours (PARTIAL - basic tests updated, more needed)
- Phase 7 (Cycle protection): 1-2 hours (DEFERRED - not critical for now)
- Phase 8 (Examples): 1 hour (TODO)
- Phase 9 (Final docs): 1 hour (TODO)
Total: 13-19 hours Completed so far: ~6-8 hours
Success Criteria
- All existing tests pass ✅
- New cycle tests pass ✅
- Example monitoring loop workflow executes successfully
- Documentation is complete and accurate
- No performance regression (not tested yet)
- Code is simpler than before (fewer lines, less complexity) ✅
Core Implementation Complete ✅
The fundamental refactoring from DAG to transition-based graph traversal is complete:
- Removed all cycle detection code
- Refactored graph building to use inbound/outbound edges
- Implemented transition-based task scheduling
- Added join barrier support
- Updated tests to validate cycle support
Remaining work is primarily documentation and additional examples.
Implementation Order
Execute phases in order 1-9, completing all tasks in each phase before moving to the next. Commit after each phase for easy rollback if needed.
Notes from Orquesta Documentation
Key insights:
- Tasks are nodes, transitions are edges
- Entry points are tasks with no inbound edges
- Workflow terminates when no tasks running AND no tasks scheduled
- Join creates a barrier - single instance waits for multiple inbound transitions
- Without join, task is invoked multiple times (once per inbound transition)
- Fail-fast: task failure with no transition terminates workflow
- Transitions evaluated in order, first matching transition wins