re-uploading work

This commit is contained in:
2026-02-04 17:46:30 -06:00
commit 3b14c65998
1388 changed files with 381262 additions and 0 deletions

View File

@@ -0,0 +1,427 @@
# Policy Execution Ordering Implementation Plan
**Date**: 2025-01-XX
**Status**: Planning
**Priority**: P0 - BLOCKING (Critical Correctness)
## Problem Statement
Currently, when execution policies (concurrency limits, delays) are enforced, there is **no guaranteed ordering** for which executions proceed when slots become available. This leads to:
1. **Fairness Violations**: Later requests can execute before earlier ones
2. **Non-deterministic Behavior**: Same workflow produces different orders across runs
3. **Workflow Dependencies Break**: Parent executions may proceed after children
4. **Poor User Experience**: Unpredictable queue behavior
### Current Flow (Broken)
```
Request A arrives → Policy blocks (concurrency=1, 1 running)
Request B arrives → Policy blocks (concurrency=1, 1 running)
Request C arrives → Policy blocks (concurrency=1, 1 running)
Running execution completes
→ A, B, or C might proceed (RANDOM, based on tokio scheduling)
```
### Desired Flow (FIFO)
```
Request A arrives → Enqueued at position 0
Request B arrives → Enqueued at position 1
Request C arrives → Enqueued at position 2
Running execution completes → Notify position 0 → A proceeds
A completes → Notify position 1 → B proceeds
B completes → Notify position 2 → C proceeds
```
## Architecture Design
### 1. ExecutionQueueManager
A new component that manages FIFO queues per action and provides slot-based synchronization.
**Key Features:**
- One queue per `action_id` (per-action concurrency control)
- FIFO ordering guarantee using `VecDeque`
- Tokio `Notify` for efficient async waiting
- Thread-safe with `Arc<Mutex<>>` or `DashMap`
- Queue statistics for monitoring
**Data Structures:**
```rust
struct QueueEntry {
execution_id: i64,
enqueued_at: DateTime<Utc>,
notifier: Arc<Notify>,
}
struct ActionQueue {
queue: VecDeque<QueueEntry>,
active_count: u32,
max_concurrent: u32,
}
struct ExecutionQueueManager {
queues: DashMap<i64, ActionQueue>, // key: action_id
}
```
### 2. Integration Points
#### A. EnforcementProcessor
- **Before**: Directly creates execution and publishes to scheduler
- **After**: Calls `queue_manager.enqueue_and_wait()` before creating execution
- **Change**: Async wait until queue allows execution
#### B. PolicyEnforcer
- **Before**: `wait_for_policy_compliance()` polls every 1 second
- **After**: `enforce_and_wait()` combines policy check + queue wait
- **Change**: More efficient, guaranteed ordering
#### C. ExecutionScheduler
- **No Change**: Receives ExecutionRequested messages as before
- **Note**: Queue happens before scheduling, not during
#### D. Worker → Executor Completion
- **New**: Worker publishes `execution.completed` message
- **New**: Executor's CompletionListener consumes these messages
- **New**: CompletionListener calls `queue_manager.notify_completion(action_id)`
### 3. Message Flow
```
┌─────────────────────────────────────────────────────────────────┐
│ EnforcementProcessor │
│ │
│ 1. Receive enforcement.created │
│ 2. queue_manager.enqueue_and_wait(action_id, execution_id) │
│ ├─ Check policy compliance │
│ ├─ Enqueue to action's FIFO queue │
│ ├─ Wait on notifier if queue full │
│ └─ Return when slot available │
│ 3. Create execution record │
│ 4. Publish execution.requested │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ ExecutionScheduler │
│ │
│ 5. Receive execution.requested │
│ 6. Select worker │
│ 7. Publish to worker queue │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Worker │
│ │
│ 8. Execute action │
│ 9. Publish execution.completed (NEW) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ CompletionListener (NEW) │
│ │
│ 10. Receive execution.completed │
│ 11. queue_manager.notify_completion(action_id) │
│ └─ Notify next waiter in queue │
└─────────────────────────────────────────────────────────────────┘
```
## Implementation Steps
### Step 1: Create ExecutionQueueManager (2 days)
**Files to Create:**
- `crates/executor/src/queue_manager.rs`
**Implementation:**
```rust
pub struct ExecutionQueueManager {
queues: DashMap<i64, Arc<Mutex<ActionQueue>>>,
}
impl ExecutionQueueManager {
pub async fn enqueue_and_wait(
&self,
action_id: i64,
execution_id: i64,
max_concurrent: u32,
) -> Result<()>;
pub async fn notify_completion(&self, action_id: i64) -> Result<()>;
pub async fn get_queue_stats(&self, action_id: i64) -> QueueStats;
pub async fn cancel_execution(&self, execution_id: i64) -> Result<()>;
}
```
**Tests:**
- FIFO ordering with 3 concurrent enqueues, limit=1
- 1000 concurrent enqueues maintain order
- Completion notification releases correct waiter
- Multiple actions have independent queues
- Cancel removes from queue correctly
### Step 2: Integrate with PolicyEnforcer (1 day)
**Files to Modify:**
- `crates/executor/src/policy_enforcer.rs`
**Changes:**
- Add `queue_manager: Arc<ExecutionQueueManager>` field
- Create `enforce_and_wait()` method that combines:
1. Policy compliance check
2. Queue enqueue and wait
- Keep existing `check_policies()` for validation
**Tests:**
- Policy violation prevents queue entry
- Policy pass allows queue entry
- Queue respects concurrency limits
### Step 3: Update EnforcementProcessor (1 day)
**Files to Modify:**
- `crates/executor/src/enforcement_processor.rs`
**Changes:**
- Add `queue_manager: Arc<ExecutionQueueManager>` field
- In `create_execution()`, before creating execution record:
```rust
// Get action's concurrency limit from policy
let concurrency_limit = policy_enforcer
.get_concurrency_limit(rule.action)
.unwrap_or(u32::MAX);
// Wait for queue slot
queue_manager
.enqueue_and_wait(rule.action, enforcement.id, concurrency_limit)
.await?;
// Now create execution (we have a slot)
let execution = ExecutionRepository::create(pool, execution_input).await?;
```
**Tests:**
- Three executions with limit=1 execute in FIFO order
- Queue blocks until slot available
- Execution created only after queue allows
### Step 4: Create CompletionListener (1 day)
**Files to Create:**
- `crates/executor/src/completion_listener.rs`
**Implementation:**
- New component that consumes `execution.completed` messages
- Calls `queue_manager.notify_completion(action_id)`
- Updates execution status in database (if needed)
- Publishes notifications
**Message Type:**
```rust
// In attune_common/mq/messages.rs
pub struct ExecutionCompletedPayload {
pub execution_id: i64,
pub action_id: i64,
pub status: ExecutionStatus,
pub result: Option<JsonValue>,
}
```
**Tests:**
- Completion message triggers queue notification
- Correct action_id used for notification
- Database status updated correctly
### Step 5: Update Worker to Publish Completions (0.5 day)
**Files to Modify:**
- `crates/worker/src/executor.rs`
**Changes:**
- After execution completes (success or failure), publish `execution.completed`
- Include action_id in message payload
- Use reliable publishing (ensure message is sent)
**Tests:**
- Worker publishes on success
- Worker publishes on failure
- Worker publishes on timeout
- Worker publishes on cancel
### Step 6: Add Queue Stats API Endpoint (0.5 day)
**Files to Modify:**
- `crates/api/src/routes/actions.rs`
**New Endpoint:**
```
GET /api/v1/actions/:ref/queue-stats
Response:
{
"action_id": 123,
"action_ref": "core.echo",
"queue_length": 5,
"active_count": 2,
"max_concurrent": 3,
"oldest_enqueued_at": "2025-01-15T10:30:00Z"
}
```
**Tests:**
- Endpoint returns correct stats
- Queue stats update in real-time
- Non-existent action returns 404
### Step 7: Integration Testing (1 day)
**Test Scenarios:**
1. **FIFO Ordering**: 10 executions, limit=1, verify order
2. **Concurrent Actions**: Multiple actions don't interfere
3. **High Concurrency**: 1000 simultaneous enqueues
4. **Completion Handling**: Verify queue progresses on completion
5. **Failure Scenarios**: Worker crash, timeout, cancel
6. **Policy Integration**: Rate limit + queue interaction
7. **API Stats**: Verify queue stats are accurate
**Files:**
- `crates/executor/tests/queue_ordering_test.rs`
- `crates/executor/tests/queue_stress_test.rs`
### Step 8: Documentation (0.5 day)
**Files to Create/Update:**
- `docs/queue-architecture.md` - Queue design and behavior
- `docs/api-actions.md` - Add queue-stats endpoint
- `README.md` - Mention queue ordering guarantees
**Content:**
- How queues work per action
- FIFO guarantees
- Monitoring queue stats
- Performance characteristics
- Troubleshooting queue issues
## API Changes
### New Endpoint
- `GET /api/v1/actions/:ref/queue-stats` - View queue statistics
### Message Types
- `execution.completed` (new) - Worker notifies completion
## Database Changes
**None required** - All queue state is in-memory
## Configuration
Add to `ExecutorConfig`:
```yaml
executor:
queue:
max_queue_length: 10000 # Per-action queue limit
queue_timeout_seconds: 3600 # Max time in queue
enable_queue_metrics: true
```
## Performance Considerations
1. **Memory Usage**: O(n) per queued execution
- Mitigation: `max_queue_length` config
- Typical: 100-1000 queued per action
2. **Lock Contention**: DashMap per action reduces contention
- Each action has independent lock
- Notify uses efficient futex-based waiting
3. **Message Overhead**: One additional message per execution
- `execution.completed` is lightweight
- Published async, no blocking
## Testing Strategy
### Unit Tests
- QueueManager FIFO behavior
- Notify mechanism correctness
- Queue stats accuracy
- Cancellation handling
### Integration Tests
- End-to-end execution ordering
- Multiple workers, one action
- Concurrent actions independent
- Stress test: 1000 concurrent enqueues
### Performance Tests
- Throughput with queuing enabled
- Latency impact of queuing
- Memory usage under load
## Migration & Rollout
### Phase 1: Deploy with Queue Disabled (Default)
- Deploy code with queue feature
- Queue disabled by default (concurrency_limit = None)
- Monitor for issues
### Phase 2: Enable for Select Actions
- Enable queue for specific high-concurrency actions
- Monitor ordering and performance
- Gather metrics
### Phase 3: Enable Globally
- Set default concurrency limits
- Enable queue for all actions
- Document behavior change
## Success Criteria
- [ ] All tests pass (unit, integration, performance)
- [ ] FIFO ordering guaranteed for same action
- [ ] Completion notification releases queue slot
- [ ] Queue stats API endpoint works
- [ ] Documentation complete
- [ ] No performance regression (< 5% latency increase)
- [ ] Zero race conditions under stress test
## Risks & Mitigations
| Risk | Impact | Mitigation |
|------|--------|------------|
| Memory exhaustion | HIGH | max_queue_length config |
| Deadlock in notify | CRITICAL | Timeout on queue wait |
| Worker crash loses completion | MEDIUM | Executor timeout cleanup |
| Race in queue state | HIGH | Careful lock ordering |
| Performance regression | MEDIUM | Benchmark before/after |
## Timeline
- **Total Estimate**: 6-7 days
- **Step 1 (QueueManager)**: 2 days
- **Step 2 (PolicyEnforcer)**: 1 day
- **Step 3 (EnforcementProcessor)**: 1 day
- **Step 4 (CompletionListener)**: 1 day
- **Step 5 (Worker updates)**: 0.5 day
- **Step 6 (API endpoint)**: 0.5 day
- **Step 7 (Integration tests)**: 1 day
- **Step 8 (Documentation)**: 0.5 day
## Next Steps
1. Review plan with team
2. Create `queue_manager.rs` with core data structures
3. Implement `enqueue_and_wait()` with tests
4. Integrate with policy enforcer
5. Continue with remaining steps
---
**Related Documents:**
- `work-summary/TODO.md` - Phase 0.1 task list
- `docs/architecture.md` - Overall system architecture
- `crates/executor/src/policy_enforcer.rs` - Current policy implementation