427 lines
14 KiB
Markdown
427 lines
14 KiB
Markdown
# Policy Execution Ordering Implementation Plan
|
|
|
|
**Date**: 2025-01-XX
|
|
**Status**: Planning
|
|
**Priority**: P0 - BLOCKING (Critical Correctness)
|
|
|
|
## Problem Statement
|
|
|
|
Currently, when execution policies (concurrency limits, delays) are enforced, there is **no guaranteed ordering** for which executions proceed when slots become available. This leads to:
|
|
|
|
1. **Fairness Violations**: Later requests can execute before earlier ones
|
|
2. **Non-deterministic Behavior**: Same workflow produces different orders across runs
|
|
3. **Workflow Dependencies Break**: Parent executions may proceed after children
|
|
4. **Poor User Experience**: Unpredictable queue behavior
|
|
|
|
### Current Flow (Broken)
|
|
```
|
|
Request A arrives → Policy blocks (concurrency=1, 1 running)
|
|
Request B arrives → Policy blocks (concurrency=1, 1 running)
|
|
Request C arrives → Policy blocks (concurrency=1, 1 running)
|
|
Running execution completes
|
|
→ A, B, or C might proceed (RANDOM, based on tokio scheduling)
|
|
```
|
|
|
|
### Desired Flow (FIFO)
|
|
```
|
|
Request A arrives → Enqueued at position 0
|
|
Request B arrives → Enqueued at position 1
|
|
Request C arrives → Enqueued at position 2
|
|
Running execution completes → Notify position 0 → A proceeds
|
|
A completes → Notify position 1 → B proceeds
|
|
B completes → Notify position 2 → C proceeds
|
|
```
|
|
|
|
## Architecture Design
|
|
|
|
### 1. ExecutionQueueManager
|
|
|
|
A new component that manages FIFO queues per action and provides slot-based synchronization.
|
|
|
|
**Key Features:**
|
|
- One queue per `action_id` (per-action concurrency control)
|
|
- FIFO ordering guarantee using `VecDeque`
|
|
- Tokio `Notify` for efficient async waiting
|
|
- Thread-safe with `Arc<Mutex<>>` or `DashMap`
|
|
- Queue statistics for monitoring
|
|
|
|
**Data Structures:**
|
|
```rust
|
|
struct QueueEntry {
|
|
execution_id: i64,
|
|
enqueued_at: DateTime<Utc>,
|
|
notifier: Arc<Notify>,
|
|
}
|
|
|
|
struct ActionQueue {
|
|
queue: VecDeque<QueueEntry>,
|
|
active_count: u32,
|
|
max_concurrent: u32,
|
|
}
|
|
|
|
struct ExecutionQueueManager {
|
|
queues: DashMap<i64, ActionQueue>, // key: action_id
|
|
}
|
|
```
|
|
|
|
### 2. Integration Points
|
|
|
|
#### A. EnforcementProcessor
|
|
- **Before**: Directly creates execution and publishes to scheduler
|
|
- **After**: Calls `queue_manager.enqueue_and_wait()` before creating execution
|
|
- **Change**: Async wait until queue allows execution
|
|
|
|
#### B. PolicyEnforcer
|
|
- **Before**: `wait_for_policy_compliance()` polls every 1 second
|
|
- **After**: `enforce_and_wait()` combines policy check + queue wait
|
|
- **Change**: More efficient, guaranteed ordering
|
|
|
|
#### C. ExecutionScheduler
|
|
- **No Change**: Receives ExecutionRequested messages as before
|
|
- **Note**: Queue happens before scheduling, not during
|
|
|
|
#### D. Worker → Executor Completion
|
|
- **New**: Worker publishes `execution.completed` message
|
|
- **New**: Executor's CompletionListener consumes these messages
|
|
- **New**: CompletionListener calls `queue_manager.notify_completion(action_id)`
|
|
|
|
### 3. Message Flow
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ EnforcementProcessor │
|
|
│ │
|
|
│ 1. Receive enforcement.created │
|
|
│ 2. queue_manager.enqueue_and_wait(action_id, execution_id) │
|
|
│ ├─ Check policy compliance │
|
|
│ ├─ Enqueue to action's FIFO queue │
|
|
│ ├─ Wait on notifier if queue full │
|
|
│ └─ Return when slot available │
|
|
│ 3. Create execution record │
|
|
│ 4. Publish execution.requested │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ ExecutionScheduler │
|
|
│ │
|
|
│ 5. Receive execution.requested │
|
|
│ 6. Select worker │
|
|
│ 7. Publish to worker queue │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Worker │
|
|
│ │
|
|
│ 8. Execute action │
|
|
│ 9. Publish execution.completed (NEW) │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ CompletionListener (NEW) │
|
|
│ │
|
|
│ 10. Receive execution.completed │
|
|
│ 11. queue_manager.notify_completion(action_id) │
|
|
│ └─ Notify next waiter in queue │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Implementation Steps
|
|
|
|
### Step 1: Create ExecutionQueueManager (2 days)
|
|
|
|
**Files to Create:**
|
|
- `crates/executor/src/queue_manager.rs`
|
|
|
|
**Implementation:**
|
|
```rust
|
|
pub struct ExecutionQueueManager {
|
|
queues: DashMap<i64, Arc<Mutex<ActionQueue>>>,
|
|
}
|
|
|
|
impl ExecutionQueueManager {
|
|
pub async fn enqueue_and_wait(
|
|
&self,
|
|
action_id: i64,
|
|
execution_id: i64,
|
|
max_concurrent: u32,
|
|
) -> Result<()>;
|
|
|
|
pub async fn notify_completion(&self, action_id: i64) -> Result<()>;
|
|
|
|
pub async fn get_queue_stats(&self, action_id: i64) -> QueueStats;
|
|
|
|
pub async fn cancel_execution(&self, execution_id: i64) -> Result<()>;
|
|
}
|
|
```
|
|
|
|
**Tests:**
|
|
- FIFO ordering with 3 concurrent enqueues, limit=1
|
|
- 1000 concurrent enqueues maintain order
|
|
- Completion notification releases correct waiter
|
|
- Multiple actions have independent queues
|
|
- Cancel removes from queue correctly
|
|
|
|
### Step 2: Integrate with PolicyEnforcer (1 day)
|
|
|
|
**Files to Modify:**
|
|
- `crates/executor/src/policy_enforcer.rs`
|
|
|
|
**Changes:**
|
|
- Add `queue_manager: Arc<ExecutionQueueManager>` field
|
|
- Create `enforce_and_wait()` method that combines:
|
|
1. Policy compliance check
|
|
2. Queue enqueue and wait
|
|
- Keep existing `check_policies()` for validation
|
|
|
|
**Tests:**
|
|
- Policy violation prevents queue entry
|
|
- Policy pass allows queue entry
|
|
- Queue respects concurrency limits
|
|
|
|
### Step 3: Update EnforcementProcessor (1 day)
|
|
|
|
**Files to Modify:**
|
|
- `crates/executor/src/enforcement_processor.rs`
|
|
|
|
**Changes:**
|
|
- Add `queue_manager: Arc<ExecutionQueueManager>` field
|
|
- In `create_execution()`, before creating execution record:
|
|
```rust
|
|
// Get action's concurrency limit from policy
|
|
let concurrency_limit = policy_enforcer
|
|
.get_concurrency_limit(rule.action)
|
|
.unwrap_or(u32::MAX);
|
|
|
|
// Wait for queue slot
|
|
queue_manager
|
|
.enqueue_and_wait(rule.action, enforcement.id, concurrency_limit)
|
|
.await?;
|
|
|
|
// Now create execution (we have a slot)
|
|
let execution = ExecutionRepository::create(pool, execution_input).await?;
|
|
```
|
|
|
|
**Tests:**
|
|
- Three executions with limit=1 execute in FIFO order
|
|
- Queue blocks until slot available
|
|
- Execution created only after queue allows
|
|
|
|
### Step 4: Create CompletionListener (1 day)
|
|
|
|
**Files to Create:**
|
|
- `crates/executor/src/completion_listener.rs`
|
|
|
|
**Implementation:**
|
|
- New component that consumes `execution.completed` messages
|
|
- Calls `queue_manager.notify_completion(action_id)`
|
|
- Updates execution status in database (if needed)
|
|
- Publishes notifications
|
|
|
|
**Message Type:**
|
|
```rust
|
|
// In attune_common/mq/messages.rs
|
|
pub struct ExecutionCompletedPayload {
|
|
pub execution_id: i64,
|
|
pub action_id: i64,
|
|
pub status: ExecutionStatus,
|
|
pub result: Option<JsonValue>,
|
|
}
|
|
```
|
|
|
|
**Tests:**
|
|
- Completion message triggers queue notification
|
|
- Correct action_id used for notification
|
|
- Database status updated correctly
|
|
|
|
### Step 5: Update Worker to Publish Completions (0.5 day)
|
|
|
|
**Files to Modify:**
|
|
- `crates/worker/src/executor.rs`
|
|
|
|
**Changes:**
|
|
- After execution completes (success or failure), publish `execution.completed`
|
|
- Include action_id in message payload
|
|
- Use reliable publishing (ensure message is sent)
|
|
|
|
**Tests:**
|
|
- Worker publishes on success
|
|
- Worker publishes on failure
|
|
- Worker publishes on timeout
|
|
- Worker publishes on cancel
|
|
|
|
### Step 6: Add Queue Stats API Endpoint (0.5 day)
|
|
|
|
**Files to Modify:**
|
|
- `crates/api/src/routes/actions.rs`
|
|
|
|
**New Endpoint:**
|
|
```
|
|
GET /api/v1/actions/:ref/queue-stats
|
|
|
|
Response:
|
|
{
|
|
"action_id": 123,
|
|
"action_ref": "core.echo",
|
|
"queue_length": 5,
|
|
"active_count": 2,
|
|
"max_concurrent": 3,
|
|
"oldest_enqueued_at": "2025-01-15T10:30:00Z"
|
|
}
|
|
```
|
|
|
|
**Tests:**
|
|
- Endpoint returns correct stats
|
|
- Queue stats update in real-time
|
|
- Non-existent action returns 404
|
|
|
|
### Step 7: Integration Testing (1 day)
|
|
|
|
**Test Scenarios:**
|
|
1. **FIFO Ordering**: 10 executions, limit=1, verify order
|
|
2. **Concurrent Actions**: Multiple actions don't interfere
|
|
3. **High Concurrency**: 1000 simultaneous enqueues
|
|
4. **Completion Handling**: Verify queue progresses on completion
|
|
5. **Failure Scenarios**: Worker crash, timeout, cancel
|
|
6. **Policy Integration**: Rate limit + queue interaction
|
|
7. **API Stats**: Verify queue stats are accurate
|
|
|
|
**Files:**
|
|
- `crates/executor/tests/queue_ordering_test.rs`
|
|
- `crates/executor/tests/queue_stress_test.rs`
|
|
|
|
### Step 8: Documentation (0.5 day)
|
|
|
|
**Files to Create/Update:**
|
|
- `docs/queue-architecture.md` - Queue design and behavior
|
|
- `docs/api-actions.md` - Add queue-stats endpoint
|
|
- `README.md` - Mention queue ordering guarantees
|
|
|
|
**Content:**
|
|
- How queues work per action
|
|
- FIFO guarantees
|
|
- Monitoring queue stats
|
|
- Performance characteristics
|
|
- Troubleshooting queue issues
|
|
|
|
## API Changes
|
|
|
|
### New Endpoint
|
|
- `GET /api/v1/actions/:ref/queue-stats` - View queue statistics
|
|
|
|
### Message Types
|
|
- `execution.completed` (new) - Worker notifies completion
|
|
|
|
## Database Changes
|
|
|
|
**None required** - All queue state is in-memory
|
|
|
|
## Configuration
|
|
|
|
Add to `ExecutorConfig`:
|
|
```yaml
|
|
executor:
|
|
queue:
|
|
max_queue_length: 10000 # Per-action queue limit
|
|
queue_timeout_seconds: 3600 # Max time in queue
|
|
enable_queue_metrics: true
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
1. **Memory Usage**: O(n) per queued execution
|
|
- Mitigation: `max_queue_length` config
|
|
- Typical: 100-1000 queued per action
|
|
|
|
2. **Lock Contention**: DashMap per action reduces contention
|
|
- Each action has independent lock
|
|
- Notify uses efficient futex-based waiting
|
|
|
|
3. **Message Overhead**: One additional message per execution
|
|
- `execution.completed` is lightweight
|
|
- Published async, no blocking
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests
|
|
- QueueManager FIFO behavior
|
|
- Notify mechanism correctness
|
|
- Queue stats accuracy
|
|
- Cancellation handling
|
|
|
|
### Integration Tests
|
|
- End-to-end execution ordering
|
|
- Multiple workers, one action
|
|
- Concurrent actions independent
|
|
- Stress test: 1000 concurrent enqueues
|
|
|
|
### Performance Tests
|
|
- Throughput with queuing enabled
|
|
- Latency impact of queuing
|
|
- Memory usage under load
|
|
|
|
## Migration & Rollout
|
|
|
|
### Phase 1: Deploy with Queue Disabled (Default)
|
|
- Deploy code with queue feature
|
|
- Queue disabled by default (concurrency_limit = None)
|
|
- Monitor for issues
|
|
|
|
### Phase 2: Enable for Select Actions
|
|
- Enable queue for specific high-concurrency actions
|
|
- Monitor ordering and performance
|
|
- Gather metrics
|
|
|
|
### Phase 3: Enable Globally
|
|
- Set default concurrency limits
|
|
- Enable queue for all actions
|
|
- Document behavior change
|
|
|
|
## Success Criteria
|
|
|
|
- [ ] All tests pass (unit, integration, performance)
|
|
- [ ] FIFO ordering guaranteed for same action
|
|
- [ ] Completion notification releases queue slot
|
|
- [ ] Queue stats API endpoint works
|
|
- [ ] Documentation complete
|
|
- [ ] No performance regression (< 5% latency increase)
|
|
- [ ] Zero race conditions under stress test
|
|
|
|
## Risks & Mitigations
|
|
|
|
| Risk | Impact | Mitigation |
|
|
|------|--------|------------|
|
|
| Memory exhaustion | HIGH | max_queue_length config |
|
|
| Deadlock in notify | CRITICAL | Timeout on queue wait |
|
|
| Worker crash loses completion | MEDIUM | Executor timeout cleanup |
|
|
| Race in queue state | HIGH | Careful lock ordering |
|
|
| Performance regression | MEDIUM | Benchmark before/after |
|
|
|
|
## Timeline
|
|
|
|
- **Total Estimate**: 6-7 days
|
|
- **Step 1 (QueueManager)**: 2 days
|
|
- **Step 2 (PolicyEnforcer)**: 1 day
|
|
- **Step 3 (EnforcementProcessor)**: 1 day
|
|
- **Step 4 (CompletionListener)**: 1 day
|
|
- **Step 5 (Worker updates)**: 0.5 day
|
|
- **Step 6 (API endpoint)**: 0.5 day
|
|
- **Step 7 (Integration tests)**: 1 day
|
|
- **Step 8 (Documentation)**: 0.5 day
|
|
|
|
## Next Steps
|
|
|
|
1. Review plan with team
|
|
2. Create `queue_manager.rs` with core data structures
|
|
3. Implement `enqueue_and_wait()` with tests
|
|
4. Integrate with policy enforcer
|
|
5. Continue with remaining steps
|
|
|
|
---
|
|
|
|
**Related Documents:**
|
|
- `work-summary/TODO.md` - Phase 0.1 task list
|
|
- `docs/architecture.md` - Overall system architecture
|
|
- `crates/executor/src/policy_enforcer.rs` - Current policy implementation |