Files
attune/work-summary/phases/2025-01-policy-ordering-plan.md
2026-02-04 17:46:30 -06:00

14 KiB

Policy Execution Ordering Implementation Plan

Date: 2025-01-XX
Status: Planning
Priority: P0 - BLOCKING (Critical Correctness)

Problem Statement

Currently, when execution policies (concurrency limits, delays) are enforced, there is no guaranteed ordering for which executions proceed when slots become available. This leads to:

  1. Fairness Violations: Later requests can execute before earlier ones
  2. Non-deterministic Behavior: Same workflow produces different orders across runs
  3. Workflow Dependencies Break: Parent executions may proceed after children
  4. Poor User Experience: Unpredictable queue behavior

Current Flow (Broken)

Request A arrives → Policy blocks (concurrency=1, 1 running)
Request B arrives → Policy blocks (concurrency=1, 1 running)
Request C arrives → Policy blocks (concurrency=1, 1 running)
Running execution completes
→ A, B, or C might proceed (RANDOM, based on tokio scheduling)

Desired Flow (FIFO)

Request A arrives → Enqueued at position 0
Request B arrives → Enqueued at position 1
Request C arrives → Enqueued at position 2
Running execution completes → Notify position 0 → A proceeds
A completes → Notify position 1 → B proceeds
B completes → Notify position 2 → C proceeds

Architecture Design

1. ExecutionQueueManager

A new component that manages FIFO queues per action and provides slot-based synchronization.

Key Features:

  • One queue per action_id (per-action concurrency control)
  • FIFO ordering guarantee using VecDeque
  • Tokio Notify for efficient async waiting
  • Thread-safe with Arc<Mutex<>> or DashMap
  • Queue statistics for monitoring

Data Structures:

struct QueueEntry {
    execution_id: i64,
    enqueued_at: DateTime<Utc>,
    notifier: Arc<Notify>,
}

struct ActionQueue {
    queue: VecDeque<QueueEntry>,
    active_count: u32,
    max_concurrent: u32,
}

struct ExecutionQueueManager {
    queues: DashMap<i64, ActionQueue>, // key: action_id
}

2. Integration Points

A. EnforcementProcessor

  • Before: Directly creates execution and publishes to scheduler
  • After: Calls queue_manager.enqueue_and_wait() before creating execution
  • Change: Async wait until queue allows execution

B. PolicyEnforcer

  • Before: wait_for_policy_compliance() polls every 1 second
  • After: enforce_and_wait() combines policy check + queue wait
  • Change: More efficient, guaranteed ordering

C. ExecutionScheduler

  • No Change: Receives ExecutionRequested messages as before
  • Note: Queue happens before scheduling, not during

D. Worker → Executor Completion

  • New: Worker publishes execution.completed message
  • New: Executor's CompletionListener consumes these messages
  • New: CompletionListener calls queue_manager.notify_completion(action_id)

3. Message Flow

┌─────────────────────────────────────────────────────────────────┐
│ EnforcementProcessor                                             │
│                                                                  │
│  1. Receive enforcement.created                                 │
│  2. queue_manager.enqueue_and_wait(action_id, execution_id)     │
│     ├─ Check policy compliance                                  │
│     ├─ Enqueue to action's FIFO queue                           │
│     ├─ Wait on notifier if queue full                           │
│     └─ Return when slot available                               │
│  3. Create execution record                                     │
│  4. Publish execution.requested                                 │
└─────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│ ExecutionScheduler                                              │
│                                                                  │
│  5. Receive execution.requested                                 │
│  6. Select worker                                               │
│  7. Publish to worker queue                                     │
└─────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│ Worker                                                          │
│                                                                  │
│  8. Execute action                                              │
│  9. Publish execution.completed (NEW)                           │
└─────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│ CompletionListener (NEW)                                        │
│                                                                  │
│ 10. Receive execution.completed                                 │
│ 11. queue_manager.notify_completion(action_id)                  │
│     └─ Notify next waiter in queue                              │
└─────────────────────────────────────────────────────────────────┘

Implementation Steps

Step 1: Create ExecutionQueueManager (2 days)

Files to Create:

  • crates/executor/src/queue_manager.rs

Implementation:

pub struct ExecutionQueueManager {
    queues: DashMap<i64, Arc<Mutex<ActionQueue>>>,
}

impl ExecutionQueueManager {
    pub async fn enqueue_and_wait(
        &self,
        action_id: i64,
        execution_id: i64,
        max_concurrent: u32,
    ) -> Result<()>;
    
    pub async fn notify_completion(&self, action_id: i64) -> Result<()>;
    
    pub async fn get_queue_stats(&self, action_id: i64) -> QueueStats;
    
    pub async fn cancel_execution(&self, execution_id: i64) -> Result<()>;
}

Tests:

  • FIFO ordering with 3 concurrent enqueues, limit=1
  • 1000 concurrent enqueues maintain order
  • Completion notification releases correct waiter
  • Multiple actions have independent queues
  • Cancel removes from queue correctly

Step 2: Integrate with PolicyEnforcer (1 day)

Files to Modify:

  • crates/executor/src/policy_enforcer.rs

Changes:

  • Add queue_manager: Arc<ExecutionQueueManager> field
  • Create enforce_and_wait() method that combines:
    1. Policy compliance check
    2. Queue enqueue and wait
  • Keep existing check_policies() for validation

Tests:

  • Policy violation prevents queue entry
  • Policy pass allows queue entry
  • Queue respects concurrency limits

Step 3: Update EnforcementProcessor (1 day)

Files to Modify:

  • crates/executor/src/enforcement_processor.rs

Changes:

  • Add queue_manager: Arc<ExecutionQueueManager> field
  • In create_execution(), before creating execution record:
    // Get action's concurrency limit from policy
    let concurrency_limit = policy_enforcer
        .get_concurrency_limit(rule.action)
        .unwrap_or(u32::MAX);
    
    // Wait for queue slot
    queue_manager
        .enqueue_and_wait(rule.action, enforcement.id, concurrency_limit)
        .await?;
    
    // Now create execution (we have a slot)
    let execution = ExecutionRepository::create(pool, execution_input).await?;
    

Tests:

  • Three executions with limit=1 execute in FIFO order
  • Queue blocks until slot available
  • Execution created only after queue allows

Step 4: Create CompletionListener (1 day)

Files to Create:

  • crates/executor/src/completion_listener.rs

Implementation:

  • New component that consumes execution.completed messages
  • Calls queue_manager.notify_completion(action_id)
  • Updates execution status in database (if needed)
  • Publishes notifications

Message Type:

// In attune_common/mq/messages.rs
pub struct ExecutionCompletedPayload {
    pub execution_id: i64,
    pub action_id: i64,
    pub status: ExecutionStatus,
    pub result: Option<JsonValue>,
}

Tests:

  • Completion message triggers queue notification
  • Correct action_id used for notification
  • Database status updated correctly

Step 5: Update Worker to Publish Completions (0.5 day)

Files to Modify:

  • crates/worker/src/executor.rs

Changes:

  • After execution completes (success or failure), publish execution.completed
  • Include action_id in message payload
  • Use reliable publishing (ensure message is sent)

Tests:

  • Worker publishes on success
  • Worker publishes on failure
  • Worker publishes on timeout
  • Worker publishes on cancel

Step 6: Add Queue Stats API Endpoint (0.5 day)

Files to Modify:

  • crates/api/src/routes/actions.rs

New Endpoint:

GET /api/v1/actions/:ref/queue-stats

Response:
{
  "action_id": 123,
  "action_ref": "core.echo",
  "queue_length": 5,
  "active_count": 2,
  "max_concurrent": 3,
  "oldest_enqueued_at": "2025-01-15T10:30:00Z"
}

Tests:

  • Endpoint returns correct stats
  • Queue stats update in real-time
  • Non-existent action returns 404

Step 7: Integration Testing (1 day)

Test Scenarios:

  1. FIFO Ordering: 10 executions, limit=1, verify order
  2. Concurrent Actions: Multiple actions don't interfere
  3. High Concurrency: 1000 simultaneous enqueues
  4. Completion Handling: Verify queue progresses on completion
  5. Failure Scenarios: Worker crash, timeout, cancel
  6. Policy Integration: Rate limit + queue interaction
  7. API Stats: Verify queue stats are accurate

Files:

  • crates/executor/tests/queue_ordering_test.rs
  • crates/executor/tests/queue_stress_test.rs

Step 8: Documentation (0.5 day)

Files to Create/Update:

  • docs/queue-architecture.md - Queue design and behavior
  • docs/api-actions.md - Add queue-stats endpoint
  • README.md - Mention queue ordering guarantees

Content:

  • How queues work per action
  • FIFO guarantees
  • Monitoring queue stats
  • Performance characteristics
  • Troubleshooting queue issues

API Changes

New Endpoint

  • GET /api/v1/actions/:ref/queue-stats - View queue statistics

Message Types

  • execution.completed (new) - Worker notifies completion

Database Changes

None required - All queue state is in-memory

Configuration

Add to ExecutorConfig:

executor:
  queue:
    max_queue_length: 10000  # Per-action queue limit
    queue_timeout_seconds: 3600  # Max time in queue
    enable_queue_metrics: true

Performance Considerations

  1. Memory Usage: O(n) per queued execution

    • Mitigation: max_queue_length config
    • Typical: 100-1000 queued per action
  2. Lock Contention: DashMap per action reduces contention

    • Each action has independent lock
    • Notify uses efficient futex-based waiting
  3. Message Overhead: One additional message per execution

    • execution.completed is lightweight
    • Published async, no blocking

Testing Strategy

Unit Tests

  • QueueManager FIFO behavior
  • Notify mechanism correctness
  • Queue stats accuracy
  • Cancellation handling

Integration Tests

  • End-to-end execution ordering
  • Multiple workers, one action
  • Concurrent actions independent
  • Stress test: 1000 concurrent enqueues

Performance Tests

  • Throughput with queuing enabled
  • Latency impact of queuing
  • Memory usage under load

Migration & Rollout

Phase 1: Deploy with Queue Disabled (Default)

  • Deploy code with queue feature
  • Queue disabled by default (concurrency_limit = None)
  • Monitor for issues

Phase 2: Enable for Select Actions

  • Enable queue for specific high-concurrency actions
  • Monitor ordering and performance
  • Gather metrics

Phase 3: Enable Globally

  • Set default concurrency limits
  • Enable queue for all actions
  • Document behavior change

Success Criteria

  • All tests pass (unit, integration, performance)
  • FIFO ordering guaranteed for same action
  • Completion notification releases queue slot
  • Queue stats API endpoint works
  • Documentation complete
  • No performance regression (< 5% latency increase)
  • Zero race conditions under stress test

Risks & Mitigations

Risk Impact Mitigation
Memory exhaustion HIGH max_queue_length config
Deadlock in notify CRITICAL Timeout on queue wait
Worker crash loses completion MEDIUM Executor timeout cleanup
Race in queue state HIGH Careful lock ordering
Performance regression MEDIUM Benchmark before/after

Timeline

  • Total Estimate: 6-7 days
  • Step 1 (QueueManager): 2 days
  • Step 2 (PolicyEnforcer): 1 day
  • Step 3 (EnforcementProcessor): 1 day
  • Step 4 (CompletionListener): 1 day
  • Step 5 (Worker updates): 0.5 day
  • Step 6 (API endpoint): 0.5 day
  • Step 7 (Integration tests): 1 day
  • Step 8 (Documentation): 0.5 day

Next Steps

  1. Review plan with team
  2. Create queue_manager.rs with core data structures
  3. Implement enqueue_and_wait() with tests
  4. Integrate with policy enforcer
  5. Continue with remaining steps

Related Documents:

  • work-summary/TODO.md - Phase 0.1 task list
  • docs/architecture.md - Overall system architecture
  • crates/executor/src/policy_enforcer.rs - Current policy implementation