re-uploading work
This commit is contained in:
427
work-summary/phases/2025-01-policy-ordering-plan.md
Normal file
427
work-summary/phases/2025-01-policy-ordering-plan.md
Normal file
@@ -0,0 +1,427 @@
|
||||
# Policy Execution Ordering Implementation Plan
|
||||
|
||||
**Date**: 2025-01-XX
|
||||
**Status**: Planning
|
||||
**Priority**: P0 - BLOCKING (Critical Correctness)
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Currently, when execution policies (concurrency limits, delays) are enforced, there is **no guaranteed ordering** for which executions proceed when slots become available. This leads to:
|
||||
|
||||
1. **Fairness Violations**: Later requests can execute before earlier ones
|
||||
2. **Non-deterministic Behavior**: Same workflow produces different orders across runs
|
||||
3. **Workflow Dependencies Break**: Parent executions may proceed after children
|
||||
4. **Poor User Experience**: Unpredictable queue behavior
|
||||
|
||||
### Current Flow (Broken)
|
||||
```
|
||||
Request A arrives → Policy blocks (concurrency=1, 1 running)
|
||||
Request B arrives → Policy blocks (concurrency=1, 1 running)
|
||||
Request C arrives → Policy blocks (concurrency=1, 1 running)
|
||||
Running execution completes
|
||||
→ A, B, or C might proceed (RANDOM, based on tokio scheduling)
|
||||
```
|
||||
|
||||
### Desired Flow (FIFO)
|
||||
```
|
||||
Request A arrives → Enqueued at position 0
|
||||
Request B arrives → Enqueued at position 1
|
||||
Request C arrives → Enqueued at position 2
|
||||
Running execution completes → Notify position 0 → A proceeds
|
||||
A completes → Notify position 1 → B proceeds
|
||||
B completes → Notify position 2 → C proceeds
|
||||
```
|
||||
|
||||
## Architecture Design
|
||||
|
||||
### 1. ExecutionQueueManager
|
||||
|
||||
A new component that manages FIFO queues per action and provides slot-based synchronization.
|
||||
|
||||
**Key Features:**
|
||||
- One queue per `action_id` (per-action concurrency control)
|
||||
- FIFO ordering guarantee using `VecDeque`
|
||||
- Tokio `Notify` for efficient async waiting
|
||||
- Thread-safe with `Arc<Mutex<>>` or `DashMap`
|
||||
- Queue statistics for monitoring
|
||||
|
||||
**Data Structures:**
|
||||
```rust
|
||||
struct QueueEntry {
|
||||
execution_id: i64,
|
||||
enqueued_at: DateTime<Utc>,
|
||||
notifier: Arc<Notify>,
|
||||
}
|
||||
|
||||
struct ActionQueue {
|
||||
queue: VecDeque<QueueEntry>,
|
||||
active_count: u32,
|
||||
max_concurrent: u32,
|
||||
}
|
||||
|
||||
struct ExecutionQueueManager {
|
||||
queues: DashMap<i64, ActionQueue>, // key: action_id
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Integration Points
|
||||
|
||||
#### A. EnforcementProcessor
|
||||
- **Before**: Directly creates execution and publishes to scheduler
|
||||
- **After**: Calls `queue_manager.enqueue_and_wait()` before creating execution
|
||||
- **Change**: Async wait until queue allows execution
|
||||
|
||||
#### B. PolicyEnforcer
|
||||
- **Before**: `wait_for_policy_compliance()` polls every 1 second
|
||||
- **After**: `enforce_and_wait()` combines policy check + queue wait
|
||||
- **Change**: More efficient, guaranteed ordering
|
||||
|
||||
#### C. ExecutionScheduler
|
||||
- **No Change**: Receives ExecutionRequested messages as before
|
||||
- **Note**: Queue happens before scheduling, not during
|
||||
|
||||
#### D. Worker → Executor Completion
|
||||
- **New**: Worker publishes `execution.completed` message
|
||||
- **New**: Executor's CompletionListener consumes these messages
|
||||
- **New**: CompletionListener calls `queue_manager.notify_completion(action_id)`
|
||||
|
||||
### 3. Message Flow
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ EnforcementProcessor │
|
||||
│ │
|
||||
│ 1. Receive enforcement.created │
|
||||
│ 2. queue_manager.enqueue_and_wait(action_id, execution_id) │
|
||||
│ ├─ Check policy compliance │
|
||||
│ ├─ Enqueue to action's FIFO queue │
|
||||
│ ├─ Wait on notifier if queue full │
|
||||
│ └─ Return when slot available │
|
||||
│ 3. Create execution record │
|
||||
│ 4. Publish execution.requested │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ ExecutionScheduler │
|
||||
│ │
|
||||
│ 5. Receive execution.requested │
|
||||
│ 6. Select worker │
|
||||
│ 7. Publish to worker queue │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Worker │
|
||||
│ │
|
||||
│ 8. Execute action │
|
||||
│ 9. Publish execution.completed (NEW) │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ CompletionListener (NEW) │
|
||||
│ │
|
||||
│ 10. Receive execution.completed │
|
||||
│ 11. queue_manager.notify_completion(action_id) │
|
||||
│ └─ Notify next waiter in queue │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### Step 1: Create ExecutionQueueManager (2 days)
|
||||
|
||||
**Files to Create:**
|
||||
- `crates/executor/src/queue_manager.rs`
|
||||
|
||||
**Implementation:**
|
||||
```rust
|
||||
pub struct ExecutionQueueManager {
|
||||
queues: DashMap<i64, Arc<Mutex<ActionQueue>>>,
|
||||
}
|
||||
|
||||
impl ExecutionQueueManager {
|
||||
pub async fn enqueue_and_wait(
|
||||
&self,
|
||||
action_id: i64,
|
||||
execution_id: i64,
|
||||
max_concurrent: u32,
|
||||
) -> Result<()>;
|
||||
|
||||
pub async fn notify_completion(&self, action_id: i64) -> Result<()>;
|
||||
|
||||
pub async fn get_queue_stats(&self, action_id: i64) -> QueueStats;
|
||||
|
||||
pub async fn cancel_execution(&self, execution_id: i64) -> Result<()>;
|
||||
}
|
||||
```
|
||||
|
||||
**Tests:**
|
||||
- FIFO ordering with 3 concurrent enqueues, limit=1
|
||||
- 1000 concurrent enqueues maintain order
|
||||
- Completion notification releases correct waiter
|
||||
- Multiple actions have independent queues
|
||||
- Cancel removes from queue correctly
|
||||
|
||||
### Step 2: Integrate with PolicyEnforcer (1 day)
|
||||
|
||||
**Files to Modify:**
|
||||
- `crates/executor/src/policy_enforcer.rs`
|
||||
|
||||
**Changes:**
|
||||
- Add `queue_manager: Arc<ExecutionQueueManager>` field
|
||||
- Create `enforce_and_wait()` method that combines:
|
||||
1. Policy compliance check
|
||||
2. Queue enqueue and wait
|
||||
- Keep existing `check_policies()` for validation
|
||||
|
||||
**Tests:**
|
||||
- Policy violation prevents queue entry
|
||||
- Policy pass allows queue entry
|
||||
- Queue respects concurrency limits
|
||||
|
||||
### Step 3: Update EnforcementProcessor (1 day)
|
||||
|
||||
**Files to Modify:**
|
||||
- `crates/executor/src/enforcement_processor.rs`
|
||||
|
||||
**Changes:**
|
||||
- Add `queue_manager: Arc<ExecutionQueueManager>` field
|
||||
- In `create_execution()`, before creating execution record:
|
||||
```rust
|
||||
// Get action's concurrency limit from policy
|
||||
let concurrency_limit = policy_enforcer
|
||||
.get_concurrency_limit(rule.action)
|
||||
.unwrap_or(u32::MAX);
|
||||
|
||||
// Wait for queue slot
|
||||
queue_manager
|
||||
.enqueue_and_wait(rule.action, enforcement.id, concurrency_limit)
|
||||
.await?;
|
||||
|
||||
// Now create execution (we have a slot)
|
||||
let execution = ExecutionRepository::create(pool, execution_input).await?;
|
||||
```
|
||||
|
||||
**Tests:**
|
||||
- Three executions with limit=1 execute in FIFO order
|
||||
- Queue blocks until slot available
|
||||
- Execution created only after queue allows
|
||||
|
||||
### Step 4: Create CompletionListener (1 day)
|
||||
|
||||
**Files to Create:**
|
||||
- `crates/executor/src/completion_listener.rs`
|
||||
|
||||
**Implementation:**
|
||||
- New component that consumes `execution.completed` messages
|
||||
- Calls `queue_manager.notify_completion(action_id)`
|
||||
- Updates execution status in database (if needed)
|
||||
- Publishes notifications
|
||||
|
||||
**Message Type:**
|
||||
```rust
|
||||
// In attune_common/mq/messages.rs
|
||||
pub struct ExecutionCompletedPayload {
|
||||
pub execution_id: i64,
|
||||
pub action_id: i64,
|
||||
pub status: ExecutionStatus,
|
||||
pub result: Option<JsonValue>,
|
||||
}
|
||||
```
|
||||
|
||||
**Tests:**
|
||||
- Completion message triggers queue notification
|
||||
- Correct action_id used for notification
|
||||
- Database status updated correctly
|
||||
|
||||
### Step 5: Update Worker to Publish Completions (0.5 day)
|
||||
|
||||
**Files to Modify:**
|
||||
- `crates/worker/src/executor.rs`
|
||||
|
||||
**Changes:**
|
||||
- After execution completes (success or failure), publish `execution.completed`
|
||||
- Include action_id in message payload
|
||||
- Use reliable publishing (ensure message is sent)
|
||||
|
||||
**Tests:**
|
||||
- Worker publishes on success
|
||||
- Worker publishes on failure
|
||||
- Worker publishes on timeout
|
||||
- Worker publishes on cancel
|
||||
|
||||
### Step 6: Add Queue Stats API Endpoint (0.5 day)
|
||||
|
||||
**Files to Modify:**
|
||||
- `crates/api/src/routes/actions.rs`
|
||||
|
||||
**New Endpoint:**
|
||||
```
|
||||
GET /api/v1/actions/:ref/queue-stats
|
||||
|
||||
Response:
|
||||
{
|
||||
"action_id": 123,
|
||||
"action_ref": "core.echo",
|
||||
"queue_length": 5,
|
||||
"active_count": 2,
|
||||
"max_concurrent": 3,
|
||||
"oldest_enqueued_at": "2025-01-15T10:30:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
**Tests:**
|
||||
- Endpoint returns correct stats
|
||||
- Queue stats update in real-time
|
||||
- Non-existent action returns 404
|
||||
|
||||
### Step 7: Integration Testing (1 day)
|
||||
|
||||
**Test Scenarios:**
|
||||
1. **FIFO Ordering**: 10 executions, limit=1, verify order
|
||||
2. **Concurrent Actions**: Multiple actions don't interfere
|
||||
3. **High Concurrency**: 1000 simultaneous enqueues
|
||||
4. **Completion Handling**: Verify queue progresses on completion
|
||||
5. **Failure Scenarios**: Worker crash, timeout, cancel
|
||||
6. **Policy Integration**: Rate limit + queue interaction
|
||||
7. **API Stats**: Verify queue stats are accurate
|
||||
|
||||
**Files:**
|
||||
- `crates/executor/tests/queue_ordering_test.rs`
|
||||
- `crates/executor/tests/queue_stress_test.rs`
|
||||
|
||||
### Step 8: Documentation (0.5 day)
|
||||
|
||||
**Files to Create/Update:**
|
||||
- `docs/queue-architecture.md` - Queue design and behavior
|
||||
- `docs/api-actions.md` - Add queue-stats endpoint
|
||||
- `README.md` - Mention queue ordering guarantees
|
||||
|
||||
**Content:**
|
||||
- How queues work per action
|
||||
- FIFO guarantees
|
||||
- Monitoring queue stats
|
||||
- Performance characteristics
|
||||
- Troubleshooting queue issues
|
||||
|
||||
## API Changes
|
||||
|
||||
### New Endpoint
|
||||
- `GET /api/v1/actions/:ref/queue-stats` - View queue statistics
|
||||
|
||||
### Message Types
|
||||
- `execution.completed` (new) - Worker notifies completion
|
||||
|
||||
## Database Changes
|
||||
|
||||
**None required** - All queue state is in-memory
|
||||
|
||||
## Configuration
|
||||
|
||||
Add to `ExecutorConfig`:
|
||||
```yaml
|
||||
executor:
|
||||
queue:
|
||||
max_queue_length: 10000 # Per-action queue limit
|
||||
queue_timeout_seconds: 3600 # Max time in queue
|
||||
enable_queue_metrics: true
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
1. **Memory Usage**: O(n) per queued execution
|
||||
- Mitigation: `max_queue_length` config
|
||||
- Typical: 100-1000 queued per action
|
||||
|
||||
2. **Lock Contention**: DashMap per action reduces contention
|
||||
- Each action has independent lock
|
||||
- Notify uses efficient futex-based waiting
|
||||
|
||||
3. **Message Overhead**: One additional message per execution
|
||||
- `execution.completed` is lightweight
|
||||
- Published async, no blocking
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
- QueueManager FIFO behavior
|
||||
- Notify mechanism correctness
|
||||
- Queue stats accuracy
|
||||
- Cancellation handling
|
||||
|
||||
### Integration Tests
|
||||
- End-to-end execution ordering
|
||||
- Multiple workers, one action
|
||||
- Concurrent actions independent
|
||||
- Stress test: 1000 concurrent enqueues
|
||||
|
||||
### Performance Tests
|
||||
- Throughput with queuing enabled
|
||||
- Latency impact of queuing
|
||||
- Memory usage under load
|
||||
|
||||
## Migration & Rollout
|
||||
|
||||
### Phase 1: Deploy with Queue Disabled (Default)
|
||||
- Deploy code with queue feature
|
||||
- Queue disabled by default (concurrency_limit = None)
|
||||
- Monitor for issues
|
||||
|
||||
### Phase 2: Enable for Select Actions
|
||||
- Enable queue for specific high-concurrency actions
|
||||
- Monitor ordering and performance
|
||||
- Gather metrics
|
||||
|
||||
### Phase 3: Enable Globally
|
||||
- Set default concurrency limits
|
||||
- Enable queue for all actions
|
||||
- Document behavior change
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- [ ] All tests pass (unit, integration, performance)
|
||||
- [ ] FIFO ordering guaranteed for same action
|
||||
- [ ] Completion notification releases queue slot
|
||||
- [ ] Queue stats API endpoint works
|
||||
- [ ] Documentation complete
|
||||
- [ ] No performance regression (< 5% latency increase)
|
||||
- [ ] Zero race conditions under stress test
|
||||
|
||||
## Risks & Mitigations
|
||||
|
||||
| Risk | Impact | Mitigation |
|
||||
|------|--------|------------|
|
||||
| Memory exhaustion | HIGH | max_queue_length config |
|
||||
| Deadlock in notify | CRITICAL | Timeout on queue wait |
|
||||
| Worker crash loses completion | MEDIUM | Executor timeout cleanup |
|
||||
| Race in queue state | HIGH | Careful lock ordering |
|
||||
| Performance regression | MEDIUM | Benchmark before/after |
|
||||
|
||||
## Timeline
|
||||
|
||||
- **Total Estimate**: 6-7 days
|
||||
- **Step 1 (QueueManager)**: 2 days
|
||||
- **Step 2 (PolicyEnforcer)**: 1 day
|
||||
- **Step 3 (EnforcementProcessor)**: 1 day
|
||||
- **Step 4 (CompletionListener)**: 1 day
|
||||
- **Step 5 (Worker updates)**: 0.5 day
|
||||
- **Step 6 (API endpoint)**: 0.5 day
|
||||
- **Step 7 (Integration tests)**: 1 day
|
||||
- **Step 8 (Documentation)**: 0.5 day
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Review plan with team
|
||||
2. Create `queue_manager.rs` with core data structures
|
||||
3. Implement `enqueue_and_wait()` with tests
|
||||
4. Integrate with policy enforcer
|
||||
5. Continue with remaining steps
|
||||
|
||||
---
|
||||
|
||||
**Related Documents:**
|
||||
- `work-summary/TODO.md` - Phase 0.1 task list
|
||||
- `docs/architecture.md` - Overall system architecture
|
||||
- `crates/executor/src/policy_enforcer.rs` - Current policy implementation
|
||||
Reference in New Issue
Block a user