re-uploading work

2026-02-04 17:46:30 -06:00
commit 3b14c65998
1388 changed files with 381262 additions and 0 deletions
--- a/work-summary/phases/2025-01-policy-ordering-plan.md
+++ b/work-summary/phases/2025-01-policy-ordering-plan.md
@@ -0,0 +1,427 @@
+# Policy Execution Ordering Implementation Plan
+
+**Date**: 2025-01-XX  
+**Status**: Planning  
+**Priority**: P0 - BLOCKING (Critical Correctness)
+
+## Problem Statement
+
+Currently, when execution policies (concurrency limits, delays) are enforced, there is **no guaranteed ordering** for which executions proceed when slots become available. This leads to:
+
+1. **Fairness Violations**: Later requests can execute before earlier ones
+2. **Non-deterministic Behavior**: Same workflow produces different orders across runs
+3. **Workflow Dependencies Break**: Parent executions may proceed after children
+4. **Poor User Experience**: Unpredictable queue behavior
+
+### Current Flow (Broken)
+```
+Request A arrives → Policy blocks (concurrency=1, 1 running)
+Request B arrives → Policy blocks (concurrency=1, 1 running)
+Request C arrives → Policy blocks (concurrency=1, 1 running)
+Running execution completes
+→ A, B, or C might proceed (RANDOM, based on tokio scheduling)
+```
+
+### Desired Flow (FIFO)
+```
+Request A arrives → Enqueued at position 0
+Request B arrives → Enqueued at position 1
+Request C arrives → Enqueued at position 2
+Running execution completes → Notify position 0 → A proceeds
+A completes → Notify position 1 → B proceeds
+B completes → Notify position 2 → C proceeds
+```
+
+## Architecture Design
+
+### 1. ExecutionQueueManager
+
+A new component that manages FIFO queues per action and provides slot-based synchronization.
+
+**Key Features:**
+- One queue per `action_id` (per-action concurrency control)
+- FIFO ordering guarantee using `VecDeque`
+- Tokio `Notify` for efficient async waiting
+- Thread-safe with `Arc<Mutex<>>` or `DashMap`
+- Queue statistics for monitoring
+
+**Data Structures:**
+```rust
+struct QueueEntry {
+    execution_id: i64,
+    enqueued_at: DateTime<Utc>,
+    notifier: Arc<Notify>,
+}
+
+struct ActionQueue {
+    queue: VecDeque<QueueEntry>,
+    active_count: u32,
+    max_concurrent: u32,
+}
+
+struct ExecutionQueueManager {
+    queues: DashMap<i64, ActionQueue>, // key: action_id
+}
+```
+
+### 2. Integration Points
+
+#### A. EnforcementProcessor
+- **Before**: Directly creates execution and publishes to scheduler
+- **After**: Calls `queue_manager.enqueue_and_wait()` before creating execution
+- **Change**: Async wait until queue allows execution
+
+#### B. PolicyEnforcer
+- **Before**: `wait_for_policy_compliance()` polls every 1 second
+- **After**: `enforce_and_wait()` combines policy check + queue wait
+- **Change**: More efficient, guaranteed ordering
+
+#### C. ExecutionScheduler
+- **No Change**: Receives ExecutionRequested messages as before
+- **Note**: Queue happens before scheduling, not during
+
+#### D. Worker → Executor Completion
+- **New**: Worker publishes `execution.completed` message
+- **New**: Executor's CompletionListener consumes these messages
+- **New**: CompletionListener calls `queue_manager.notify_completion(action_id)`
+
+### 3. Message Flow
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│ EnforcementProcessor                                             │
+│                                                                  │
+│  1. Receive enforcement.created                                 │
+│  2. queue_manager.enqueue_and_wait(action_id, execution_id)     │
+│     ├─ Check policy compliance                                  │
+│     ├─ Enqueue to action's FIFO queue                           │
+│     ├─ Wait on notifier if queue full                           │
+│     └─ Return when slot available                               │
+│  3. Create execution record                                     │
+│  4. Publish execution.requested                                 │
+└─────────────────────────────────────────────────────────────────┘
+                                  │
+                                  ▼
+┌─────────────────────────────────────────────────────────────────┐
+│ ExecutionScheduler                                              │
+│                                                                  │
+│  5. Receive execution.requested                                 │
+│  6. Select worker                                               │
+│  7. Publish to worker queue                                     │
+└─────────────────────────────────────────────────────────────────┘
+                                  │
+                                  ▼
+┌─────────────────────────────────────────────────────────────────┐
+│ Worker                                                          │
+│                                                                  │
+│  8. Execute action                                              │
+│  9. Publish execution.completed (NEW)                           │
+└─────────────────────────────────────────────────────────────────┘
+                                  │
+                                  ▼
+┌─────────────────────────────────────────────────────────────────┐
+│ CompletionListener (NEW)                                        │
+│                                                                  │
+│ 10. Receive execution.completed                                 │
+│ 11. queue_manager.notify_completion(action_id)                  │
+│     └─ Notify next waiter in queue                              │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+## Implementation Steps
+
+### Step 1: Create ExecutionQueueManager (2 days)
+
+**Files to Create:**
+- `crates/executor/src/queue_manager.rs`
+
+**Implementation:**
+```rust
+pub struct ExecutionQueueManager {
+    queues: DashMap<i64, Arc<Mutex<ActionQueue>>>,
+}
+
+impl ExecutionQueueManager {
+    pub async fn enqueue_and_wait(
+        &self,
+        action_id: i64,
+        execution_id: i64,
+        max_concurrent: u32,
+    ) -> Result<()>;
+    
+    pub async fn notify_completion(&self, action_id: i64) -> Result<()>;
+    
+    pub async fn get_queue_stats(&self, action_id: i64) -> QueueStats;
+    
+    pub async fn cancel_execution(&self, execution_id: i64) -> Result<()>;
+}
+```
+
+**Tests:**
+- FIFO ordering with 3 concurrent enqueues, limit=1
+- 1000 concurrent enqueues maintain order
+- Completion notification releases correct waiter
+- Multiple actions have independent queues
+- Cancel removes from queue correctly
+
+### Step 2: Integrate with PolicyEnforcer (1 day)
+
+**Files to Modify:**
+- `crates/executor/src/policy_enforcer.rs`
+
+**Changes:**
+- Add `queue_manager: Arc<ExecutionQueueManager>` field
+- Create `enforce_and_wait()` method that combines:
+  1. Policy compliance check
+  2. Queue enqueue and wait
+- Keep existing `check_policies()` for validation
+
+**Tests:**
+- Policy violation prevents queue entry
+- Policy pass allows queue entry
+- Queue respects concurrency limits
+
+### Step 3: Update EnforcementProcessor (1 day)
+
+**Files to Modify:**
+- `crates/executor/src/enforcement_processor.rs`
+
+**Changes:**
+- Add `queue_manager: Arc<ExecutionQueueManager>` field
+- In `create_execution()`, before creating execution record:
+  ```rust
+  // Get action's concurrency limit from policy
+  let concurrency_limit = policy_enforcer
+      .get_concurrency_limit(rule.action)
+      .unwrap_or(u32::MAX);
+  
+  // Wait for queue slot
+  queue_manager
+      .enqueue_and_wait(rule.action, enforcement.id, concurrency_limit)
+      .await?;
+  
+  // Now create execution (we have a slot)
+  let execution = ExecutionRepository::create(pool, execution_input).await?;
+  ```
+
+**Tests:**
+- Three executions with limit=1 execute in FIFO order
+- Queue blocks until slot available
+- Execution created only after queue allows
+
+### Step 4: Create CompletionListener (1 day)
+
+**Files to Create:**
+- `crates/executor/src/completion_listener.rs`
+
+**Implementation:**
+- New component that consumes `execution.completed` messages
+- Calls `queue_manager.notify_completion(action_id)`
+- Updates execution status in database (if needed)
+- Publishes notifications
+
+**Message Type:**
+```rust
+// In attune_common/mq/messages.rs
+pub struct ExecutionCompletedPayload {
+    pub execution_id: i64,
+    pub action_id: i64,
+    pub status: ExecutionStatus,
+    pub result: Option<JsonValue>,
+}
+```
+
+**Tests:**
+- Completion message triggers queue notification
+- Correct action_id used for notification
+- Database status updated correctly
+
+### Step 5: Update Worker to Publish Completions (0.5 day)
+
+**Files to Modify:**
+- `crates/worker/src/executor.rs`
+
+**Changes:**
+- After execution completes (success or failure), publish `execution.completed`
+- Include action_id in message payload
+- Use reliable publishing (ensure message is sent)
+
+**Tests:**
+- Worker publishes on success
+- Worker publishes on failure
+- Worker publishes on timeout
+- Worker publishes on cancel
+
+### Step 6: Add Queue Stats API Endpoint (0.5 day)
+
+**Files to Modify:**
+- `crates/api/src/routes/actions.rs`
+
+**New Endpoint:**
+```
+GET /api/v1/actions/:ref/queue-stats
+
+Response:
+{
+  "action_id": 123,
+  "action_ref": "core.echo",
+  "queue_length": 5,
+  "active_count": 2,
+  "max_concurrent": 3,
+  "oldest_enqueued_at": "2025-01-15T10:30:00Z"
+}
+```
+
+**Tests:**
+- Endpoint returns correct stats
+- Queue stats update in real-time
+- Non-existent action returns 404
+
+### Step 7: Integration Testing (1 day)
+
+**Test Scenarios:**
+1. **FIFO Ordering**: 10 executions, limit=1, verify order
+2. **Concurrent Actions**: Multiple actions don't interfere
+3. **High Concurrency**: 1000 simultaneous enqueues
+4. **Completion Handling**: Verify queue progresses on completion
+5. **Failure Scenarios**: Worker crash, timeout, cancel
+6. **Policy Integration**: Rate limit + queue interaction
+7. **API Stats**: Verify queue stats are accurate
+
+**Files:**
+- `crates/executor/tests/queue_ordering_test.rs`
+- `crates/executor/tests/queue_stress_test.rs`
+
+### Step 8: Documentation (0.5 day)
+
+**Files to Create/Update:**
+- `docs/queue-architecture.md` - Queue design and behavior
+- `docs/api-actions.md` - Add queue-stats endpoint
+- `README.md` - Mention queue ordering guarantees
+
+**Content:**
+- How queues work per action
+- FIFO guarantees
+- Monitoring queue stats
+- Performance characteristics
+- Troubleshooting queue issues
+
+## API Changes
+
+### New Endpoint
+- `GET /api/v1/actions/:ref/queue-stats` - View queue statistics
+
+### Message Types
+- `execution.completed` (new) - Worker notifies completion
+
+## Database Changes
+
+**None required** - All queue state is in-memory
+
+## Configuration
+
+Add to `ExecutorConfig`:
+```yaml
+executor:
+  queue:
+    max_queue_length: 10000  # Per-action queue limit
+    queue_timeout_seconds: 3600  # Max time in queue
+    enable_queue_metrics: true
+```
+
+## Performance Considerations
+
+1. **Memory Usage**: O(n) per queued execution
+   - Mitigation: `max_queue_length` config
+   - Typical: 100-1000 queued per action
+
+2. **Lock Contention**: DashMap per action reduces contention
+   - Each action has independent lock
+   - Notify uses efficient futex-based waiting
+
+3. **Message Overhead**: One additional message per execution
+   - `execution.completed` is lightweight
+   - Published async, no blocking
+
+## Testing Strategy
+
+### Unit Tests
+- QueueManager FIFO behavior
+- Notify mechanism correctness
+- Queue stats accuracy
+- Cancellation handling
+
+### Integration Tests
+- End-to-end execution ordering
+- Multiple workers, one action
+- Concurrent actions independent
+- Stress test: 1000 concurrent enqueues
+
+### Performance Tests
+- Throughput with queuing enabled
+- Latency impact of queuing
+- Memory usage under load
+
+## Migration & Rollout
+
+### Phase 1: Deploy with Queue Disabled (Default)
+- Deploy code with queue feature
+- Queue disabled by default (concurrency_limit = None)
+- Monitor for issues
+
+### Phase 2: Enable for Select Actions
+- Enable queue for specific high-concurrency actions
+- Monitor ordering and performance
+- Gather metrics
+
+### Phase 3: Enable Globally
+- Set default concurrency limits
+- Enable queue for all actions
+- Document behavior change
+
+## Success Criteria
+
+- [ ] All tests pass (unit, integration, performance)
+- [ ] FIFO ordering guaranteed for same action
+- [ ] Completion notification releases queue slot
+- [ ] Queue stats API endpoint works
+- [ ] Documentation complete
+- [ ] No performance regression (< 5% latency increase)
+- [ ] Zero race conditions under stress test
+
+## Risks & Mitigations
+
+| Risk | Impact | Mitigation |
+|------|--------|------------|
+| Memory exhaustion | HIGH | max_queue_length config |
+| Deadlock in notify | CRITICAL | Timeout on queue wait |
+| Worker crash loses completion | MEDIUM | Executor timeout cleanup |
+| Race in queue state | HIGH | Careful lock ordering |
+| Performance regression | MEDIUM | Benchmark before/after |
+
+## Timeline
+
+- **Total Estimate**: 6-7 days
+- **Step 1 (QueueManager)**: 2 days
+- **Step 2 (PolicyEnforcer)**: 1 day
+- **Step 3 (EnforcementProcessor)**: 1 day
+- **Step 4 (CompletionListener)**: 1 day
+- **Step 5 (Worker updates)**: 0.5 day
+- **Step 6 (API endpoint)**: 0.5 day
+- **Step 7 (Integration tests)**: 1 day
+- **Step 8 (Documentation)**: 0.5 day
+
+## Next Steps
+
+1. Review plan with team
+2. Create `queue_manager.rs` with core data structures
+3. Implement `enqueue_and_wait()` with tests
+4. Integrate with policy enforcer
+5. Continue with remaining steps
+
+---
+
+**Related Documents:**
+- `work-summary/TODO.md` - Phase 0.1 task list
+- `docs/architecture.md` - Overall system architecture
+- `crates/executor/src/policy_enforcer.rs` - Current policy implementation