attune-system/attune

Fork 0

Files

David Culbreth 3b14c65998 re-uploading work

2026-02-04 17:46:30 -06:00

14 KiB

Raw Blame History

Policy Execution Ordering Implementation Plan

Date: 2025-01-XX
Status: Planning
Priority: P0 - BLOCKING (Critical Correctness)

Problem Statement

Currently, when execution policies (concurrency limits, delays) are enforced, there is no guaranteed ordering for which executions proceed when slots become available. This leads to:

Fairness Violations: Later requests can execute before earlier ones
Non-deterministic Behavior: Same workflow produces different orders across runs
Workflow Dependencies Break: Parent executions may proceed after children
Poor User Experience: Unpredictable queue behavior

Current Flow (Broken)

Request A arrives → Policy blocks (concurrency=1, 1 running)
Request B arrives → Policy blocks (concurrency=1, 1 running)
Request C arrives → Policy blocks (concurrency=1, 1 running)
Running execution completes
→ A, B, or C might proceed (RANDOM, based on tokio scheduling)

Desired Flow (FIFO)

Request A arrives → Enqueued at position 0
Request B arrives → Enqueued at position 1
Request C arrives → Enqueued at position 2
Running execution completes → Notify position 0 → A proceeds
A completes → Notify position 1 → B proceeds
B completes → Notify position 2 → C proceeds

Architecture Design

1. ExecutionQueueManager

A new component that manages FIFO queues per action and provides slot-based synchronization.

Key Features:

One queue per action_id (per-action concurrency control)
FIFO ordering guarantee using VecDeque
Tokio Notify for efficient async waiting
Thread-safe with Arc<Mutex<>> or DashMap
Queue statistics for monitoring

Data Structures:

struct QueueEntry {
    execution_id: i64,
    enqueued_at: DateTime<Utc>,
    notifier: Arc<Notify>,
}

struct ActionQueue {
    queue: VecDeque<QueueEntry>,
    active_count: u32,
    max_concurrent: u32,
}

struct ExecutionQueueManager {
    queues: DashMap<i64, ActionQueue>, // key: action_id
}

2. Integration Points

A. EnforcementProcessor

Before: Directly creates execution and publishes to scheduler
After: Calls queue_manager.enqueue_and_wait() before creating execution
Change: Async wait until queue allows execution

B. PolicyEnforcer

Before: wait_for_policy_compliance() polls every 1 second
After: enforce_and_wait() combines policy check + queue wait
Change: More efficient, guaranteed ordering

C. ExecutionScheduler

No Change: Receives ExecutionRequested messages as before
Note: Queue happens before scheduling, not during

D. Worker → Executor Completion

New: Worker publishes execution.completed message
New: Executor's CompletionListener consumes these messages
New: CompletionListener calls queue_manager.notify_completion(action_id)

3. Message Flow

┌─────────────────────────────────────────────────────────────────┐
│ EnforcementProcessor                                             │
│                                                                  │
│  1. Receive enforcement.created                                 │
│  2. queue_manager.enqueue_and_wait(action_id, execution_id)     │
│     ├─ Check policy compliance                                  │
│     ├─ Enqueue to action's FIFO queue                           │
│     ├─ Wait on notifier if queue full                           │
│     └─ Return when slot available                               │
│  3. Create execution record                                     │
│  4. Publish execution.requested                                 │
└─────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│ ExecutionScheduler                                              │
│                                                                  │
│  5. Receive execution.requested                                 │
│  6. Select worker                                               │
│  7. Publish to worker queue                                     │
└─────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│ Worker                                                          │
│                                                                  │
│  8. Execute action                                              │
│  9. Publish execution.completed (NEW)                           │
└─────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│ CompletionListener (NEW)                                        │
│                                                                  │
│ 10. Receive execution.completed                                 │
│ 11. queue_manager.notify_completion(action_id)                  │
│     └─ Notify next waiter in queue                              │
└─────────────────────────────────────────────────────────────────┘

Implementation Steps

Step 1: Create ExecutionQueueManager (2 days)

Files to Create:

crates/executor/src/queue_manager.rs

Implementation:

pub struct ExecutionQueueManager {
    queues: DashMap<i64, Arc<Mutex<ActionQueue>>>,
}

impl ExecutionQueueManager {
    pub async fn enqueue_and_wait(
        &self,
        action_id: i64,
        execution_id: i64,
        max_concurrent: u32,
    ) -> Result<()>;
    
    pub async fn notify_completion(&self, action_id: i64) -> Result<()>;
    
    pub async fn get_queue_stats(&self, action_id: i64) -> QueueStats;
    
    pub async fn cancel_execution(&self, execution_id: i64) -> Result<()>;
}

Tests:

FIFO ordering with 3 concurrent enqueues, limit=1
1000 concurrent enqueues maintain order
Completion notification releases correct waiter
Multiple actions have independent queues
Cancel removes from queue correctly

Step 2: Integrate with PolicyEnforcer (1 day)

Files to Modify:

crates/executor/src/policy_enforcer.rs

Changes:

Add queue_manager: Arc<ExecutionQueueManager> field
Create enforce_and_wait() method that combines:
1. Policy compliance check
2. Queue enqueue and wait
Keep existing check_policies() for validation

Tests:

Policy violation prevents queue entry
Policy pass allows queue entry
Queue respects concurrency limits

Step 3: Update EnforcementProcessor (1 day)

Files to Modify:

crates/executor/src/enforcement_processor.rs

Changes:

Add queue_manager: Arc<ExecutionQueueManager> field

In create_execution(), before creating execution record:

// Get action's concurrency limit from policy
let concurrency_limit = policy_enforcer
    .get_concurrency_limit(rule.action)
    .unwrap_or(u32::MAX);

// Wait for queue slot
queue_manager
    .enqueue_and_wait(rule.action, enforcement.id, concurrency_limit)
    .await?;

// Now create execution (we have a slot)
let execution = ExecutionRepository::create(pool, execution_input).await?;

Tests:

Three executions with limit=1 execute in FIFO order
Queue blocks until slot available
Execution created only after queue allows

Step 4: Create CompletionListener (1 day)

Files to Create:

crates/executor/src/completion_listener.rs

Implementation:

New component that consumes execution.completed messages
Calls queue_manager.notify_completion(action_id)
Updates execution status in database (if needed)
Publishes notifications

Message Type:

// In attune_common/mq/messages.rs
pub struct ExecutionCompletedPayload {
    pub execution_id: i64,
    pub action_id: i64,
    pub status: ExecutionStatus,
    pub result: Option<JsonValue>,
}

Tests:

Completion message triggers queue notification
Correct action_id used for notification
Database status updated correctly

Step 5: Update Worker to Publish Completions (0.5 day)

Files to Modify:

crates/worker/src/executor.rs

Changes:

After execution completes (success or failure), publish execution.completed
Include action_id in message payload
Use reliable publishing (ensure message is sent)

Tests:

Worker publishes on success
Worker publishes on failure
Worker publishes on timeout
Worker publishes on cancel

Step 6: Add Queue Stats API Endpoint (0.5 day)

Files to Modify:

crates/api/src/routes/actions.rs

New Endpoint:

GET /api/v1/actions/:ref/queue-stats

Response:
{
  "action_id": 123,
  "action_ref": "core.echo",
  "queue_length": 5,
  "active_count": 2,
  "max_concurrent": 3,
  "oldest_enqueued_at": "2025-01-15T10:30:00Z"
}

Tests:

Endpoint returns correct stats
Queue stats update in real-time
Non-existent action returns 404

Step 7: Integration Testing (1 day)

Test Scenarios:

FIFO Ordering: 10 executions, limit=1, verify order
Concurrent Actions: Multiple actions don't interfere
High Concurrency: 1000 simultaneous enqueues
Completion Handling: Verify queue progresses on completion
Failure Scenarios: Worker crash, timeout, cancel
Policy Integration: Rate limit + queue interaction
API Stats: Verify queue stats are accurate

Files:

crates/executor/tests/queue_ordering_test.rs
crates/executor/tests/queue_stress_test.rs

Step 8: Documentation (0.5 day)

Files to Create/Update:

docs/queue-architecture.md - Queue design and behavior
docs/api-actions.md - Add queue-stats endpoint
README.md - Mention queue ordering guarantees

Content:

How queues work per action
FIFO guarantees
Monitoring queue stats
Performance characteristics
Troubleshooting queue issues

API Changes

New Endpoint

GET /api/v1/actions/:ref/queue-stats - View queue statistics

Message Types

execution.completed (new) - Worker notifies completion

Database Changes

None required - All queue state is in-memory

Configuration

Add to ExecutorConfig:

executor:
  queue:
    max_queue_length: 10000  # Per-action queue limit
    queue_timeout_seconds: 3600  # Max time in queue
    enable_queue_metrics: true

Performance Considerations

Memory Usage: O(n) per queued execution
- Mitigation: max_queue_length config
- Typical: 100-1000 queued per action
Lock Contention: DashMap per action reduces contention
- Each action has independent lock
- Notify uses efficient futex-based waiting
Message Overhead: One additional message per execution
- execution.completed is lightweight
- Published async, no blocking

Testing Strategy

Unit Tests

QueueManager FIFO behavior
Notify mechanism correctness
Queue stats accuracy
Cancellation handling

Integration Tests

End-to-end execution ordering
Multiple workers, one action
Concurrent actions independent
Stress test: 1000 concurrent enqueues

Performance Tests

Throughput with queuing enabled
Latency impact of queuing
Memory usage under load

Migration & Rollout

Phase 1: Deploy with Queue Disabled (Default)

Deploy code with queue feature
Queue disabled by default (concurrency_limit = None)
Monitor for issues

Phase 2: Enable for Select Actions

Enable queue for specific high-concurrency actions
Monitor ordering and performance
Gather metrics

Phase 3: Enable Globally

Set default concurrency limits
Enable queue for all actions
Document behavior change

Success Criteria

All tests pass (unit, integration, performance)
FIFO ordering guaranteed for same action
Completion notification releases queue slot
Queue stats API endpoint works
Documentation complete
No performance regression (< 5% latency increase)
Zero race conditions under stress test

Risks & Mitigations

Risk	Impact	Mitigation
Memory exhaustion	HIGH	max_queue_length config
Deadlock in notify	CRITICAL	Timeout on queue wait
Worker crash loses completion	MEDIUM	Executor timeout cleanup
Race in queue state	HIGH	Careful lock ordering
Performance regression	MEDIUM	Benchmark before/after

Timeline

Total Estimate: 6-7 days
Step 1 (QueueManager): 2 days
Step 2 (PolicyEnforcer): 1 day
Step 3 (EnforcementProcessor): 1 day
Step 4 (CompletionListener): 1 day
Step 5 (Worker updates): 0.5 day
Step 6 (API endpoint): 0.5 day
Step 7 (Integration tests): 1 day
Step 8 (Documentation): 0.5 day

Next Steps

Review plan with team
Create queue_manager.rs with core data structures
Implement enqueue_and_wait() with tests
Integrate with policy enforcer
Continue with remaining steps

Related Documents:

work-summary/TODO.md - Phase 0.1 task list
docs/architecture.md - Overall system architecture
crates/executor/src/policy_enforcer.rs - Current policy implementation

14 KiB Raw Blame History

Policy Execution Ordering Implementation Plan

Problem Statement

Current Flow (Broken)

Desired Flow (FIFO)

Architecture Design

1. ExecutionQueueManager

2. Integration Points

A. EnforcementProcessor

B. PolicyEnforcer

C. ExecutionScheduler

D. Worker → Executor Completion

3. Message Flow

Implementation Steps

Step 1: Create ExecutionQueueManager (2 days)

Step 2: Integrate with PolicyEnforcer (1 day)

Step 3: Update EnforcementProcessor (1 day)

Step 4: Create CompletionListener (1 day)

Step 5: Update Worker to Publish Completions (0.5 day)

Step 6: Add Queue Stats API Endpoint (0.5 day)

Step 7: Integration Testing (1 day)

Step 8: Documentation (0.5 day)

API Changes

New Endpoint

Message Types

Database Changes

Configuration

Performance Considerations

Testing Strategy

Unit Tests

Integration Tests

Performance Tests

Migration & Rollout

Phase 1: Deploy with Queue Disabled (Default)

Phase 2: Enable for Select Actions

Phase 3: Enable Globally

Success Criteria

Risks & Mitigations

Timeline

Next Steps

14 KiB

Raw Blame History