more internal polish, resilient workers
This commit is contained in:
367
docs/ARCHITECTURE-execution-state-ownership.md
Normal file
367
docs/ARCHITECTURE-execution-state-ownership.md
Normal file
@@ -0,0 +1,367 @@
|
||||
# Execution State Ownership Model
|
||||
|
||||
**Date**: 2026-02-09
|
||||
**Status**: Implemented
|
||||
**Related Issues**: Duplicate completion notifications, unnecessary database updates
|
||||
|
||||
## Overview
|
||||
|
||||
This document defines the **ownership model** for execution state management in Attune. It clarifies which service is responsible for updating execution records at each stage of the lifecycle, eliminating race conditions and redundant database writes.
|
||||
|
||||
## The Problem
|
||||
|
||||
Prior to this change, both the executor and worker were updating execution state in the database, causing:
|
||||
|
||||
1. **Race conditions** - unclear which service's update would happen first
|
||||
2. **Redundant writes** - both services writing the same status value
|
||||
3. **Architectural confusion** - no clear ownership boundaries
|
||||
4. **Warning logs** - duplicate completion notifications
|
||||
|
||||
## The Solution: Lifecycle-Based Ownership
|
||||
|
||||
Execution state ownership is divided based on **lifecycle stage**, with a clear handoff point:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ EXECUTOR OWNERSHIP │
|
||||
│ │
|
||||
│ Requested → Scheduling → Scheduled │
|
||||
│ │ │
|
||||
│ (includes cancellations/failures │ │
|
||||
│ before execution.scheduled │ │
|
||||
│ message is published) │ │
|
||||
│ │ │
|
||||
│ Handoff Point: │
|
||||
│ execution.scheduled message PUBLISHED │
|
||||
│ ▼ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
│ Worker receives message
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ WORKER OWNERSHIP │
|
||||
│ │
|
||||
│ Running → Completed / Failed / Cancelled / Timeout │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Executor Responsibilities
|
||||
|
||||
The **Executor Service** owns execution state from creation through scheduling:
|
||||
|
||||
- ✅ Creates execution records (`Requested`)
|
||||
- ✅ Updates status during scheduling (`Scheduling`)
|
||||
- ✅ Updates status when scheduled to worker (`Scheduled`)
|
||||
- ✅ Publishes `execution.scheduled` message **← HANDOFF POINT**
|
||||
- ✅ Handles cancellations/failures BEFORE `execution.scheduled` is published
|
||||
- ❌ Does NOT update status after `execution.scheduled` is published
|
||||
|
||||
**Lifecycle stages**: `Requested` → `Scheduling` → `Scheduled`
|
||||
|
||||
**Important**: If an execution is cancelled or fails before the executor publishes `execution.scheduled`, the executor is responsible for updating the status (e.g., to `Cancelled`). The worker never learns about executions that don't reach the handoff point.
|
||||
|
||||
### Worker Responsibilities
|
||||
|
||||
The **Worker Service** owns execution state after receiving the handoff:
|
||||
|
||||
- ✅ Receives `execution.scheduled` message **← TAKES OWNERSHIP**
|
||||
- ✅ Updates status when execution starts (`Running`)
|
||||
- ✅ Updates status when execution completes (`Completed`, `Failed`, etc.)
|
||||
- ✅ Handles cancellations AFTER receiving `execution.scheduled`
|
||||
- ✅ Updates execution result data
|
||||
- ✅ Publishes `execution.status_changed` notifications
|
||||
- ✅ Publishes `execution.completed` notifications
|
||||
- ❌ Does NOT update status for executions it hasn't received
|
||||
|
||||
**Lifecycle stages**: `Running` → `Completed` / `Failed` / `Cancelled` / `Timeout`
|
||||
|
||||
**Important**: The worker only owns executions it has received via `execution.scheduled`. If a cancellation happens before this message is sent, the worker is never involved.
|
||||
|
||||
## Message Flow
|
||||
|
||||
### 1. Executor Creates and Schedules
|
||||
|
||||
```
|
||||
Executor Service
|
||||
├─> Creates execution (status: Requested)
|
||||
├─> Updates status: Scheduling
|
||||
├─> Selects worker
|
||||
├─> Updates status: Scheduled
|
||||
└─> Publishes: execution.scheduled → worker-specific queue
|
||||
```
|
||||
|
||||
### 2. Worker Receives and Executes
|
||||
|
||||
```
|
||||
Worker Service
|
||||
├─> Receives: execution.scheduled
|
||||
├─> Updates DB: Scheduled → Running
|
||||
├─> Publishes: execution.status_changed (running)
|
||||
├─> Executes action
|
||||
├─> Updates DB: Running → Completed/Failed
|
||||
├─> Publishes: execution.status_changed (completed/failed)
|
||||
└─> Publishes: execution.completed
|
||||
```
|
||||
|
||||
### 3. Executor Handles Orchestration
|
||||
|
||||
```
|
||||
Executor Service (ExecutionManager)
|
||||
├─> Receives: execution.status_changed
|
||||
├─> Does NOT update database
|
||||
├─> Handles orchestration logic:
|
||||
│ ├─> Triggers workflow children (if parent completed)
|
||||
│ ├─> Updates workflow state
|
||||
│ └─> Manages parent-child relationships
|
||||
└─> Logs event for monitoring
|
||||
```
|
||||
|
||||
### 4. Queue Management
|
||||
|
||||
```
|
||||
Executor Service (CompletionListener)
|
||||
├─> Receives: execution.completed
|
||||
├─> Releases queue slot
|
||||
├─> Notifies waiting executions
|
||||
└─> Updates queue statistics
|
||||
```
|
||||
|
||||
## Database Update Rules
|
||||
|
||||
### Executor (Pre-Scheduling)
|
||||
|
||||
**File**: `crates/executor/src/scheduler.rs`
|
||||
|
||||
```rust
|
||||
// ✅ Executor updates DB before scheduling
|
||||
execution.status = ExecutionStatus::Scheduled;
|
||||
ExecutionRepository::update(pool, execution.id, execution.into()).await?;
|
||||
|
||||
// Publish to worker
|
||||
Self::queue_to_worker(...).await?;
|
||||
```
|
||||
|
||||
### Worker (Post-Scheduling)
|
||||
|
||||
**File**: `crates/worker/src/executor.rs`
|
||||
|
||||
```rust
|
||||
// ✅ Worker updates DB when starting
|
||||
async fn execute(&self, execution_id: i64) -> Result<ExecutionResult> {
|
||||
// Update status to running
|
||||
self.update_execution_status(execution_id, ExecutionStatus::Running).await?;
|
||||
|
||||
// Execute action...
|
||||
}
|
||||
|
||||
// ✅ Worker updates DB when completing
|
||||
async fn handle_execution_success(&self, execution_id: i64, result: &ExecutionResult) -> Result<()> {
|
||||
let input = UpdateExecutionInput {
|
||||
status: Some(ExecutionStatus::Completed),
|
||||
result: Some(result_data),
|
||||
// ...
|
||||
};
|
||||
ExecutionRepository::update(&self.pool, execution_id, input).await?;
|
||||
}
|
||||
```
|
||||
|
||||
### Executor (Post-Scheduling)
|
||||
|
||||
**File**: `crates/executor/src/execution_manager.rs`
|
||||
|
||||
```rust
|
||||
// ❌ Executor does NOT update DB after scheduling
|
||||
async fn process_status_change(...) -> Result<()> {
|
||||
// Fetch execution (for orchestration logic only)
|
||||
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
|
||||
|
||||
// Handle orchestration, but do NOT update DB
|
||||
match status {
|
||||
ExecutionStatus::Completed | ExecutionStatus::Failed | ExecutionStatus::Cancelled => {
|
||||
Self::handle_completion(pool, publisher, &execution).await?;
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
## Benefits
|
||||
|
||||
### 1. Clear Ownership Boundaries
|
||||
|
||||
- No ambiguity about who updates what
|
||||
- Easy to reason about system behavior
|
||||
- Reduced cognitive load for developers
|
||||
|
||||
### 2. Eliminated Race Conditions
|
||||
|
||||
- Only one service updates each lifecycle stage
|
||||
- No competing writes to same fields
|
||||
- Predictable state transitions
|
||||
|
||||
### 3. Better Performance
|
||||
|
||||
- No redundant database writes
|
||||
- Reduced database contention
|
||||
- Lower network overhead (fewer queries)
|
||||
|
||||
### 4. Cleaner Logs
|
||||
|
||||
Before:
|
||||
```
|
||||
executor | Updated execution 9061 status: Scheduled -> Running
|
||||
executor | Updated execution 9061 status: Running -> Running
|
||||
executor | Updated execution 9061 status: Completed -> Completed
|
||||
executor | WARN: Completion notification for action 3 but active_count is 0
|
||||
```
|
||||
|
||||
After:
|
||||
```
|
||||
executor | Execution 9061 scheduled to worker 29
|
||||
worker | Starting execution: 9061
|
||||
worker | Execution 9061 completed successfully in 142ms
|
||||
executor | Execution 9061 reached terminal state: Completed, handling orchestration
|
||||
```
|
||||
|
||||
### 5. Idempotent Message Handling
|
||||
|
||||
- Executor can safely receive duplicate status change messages
|
||||
- Worker updates are authoritative
|
||||
- No special logic needed for retries
|
||||
|
||||
## Edge Cases & Error Handling
|
||||
|
||||
### Cancellation Before Handoff
|
||||
|
||||
**Scenario**: Execution is queued due to concurrency policy, user cancels before scheduling.
|
||||
|
||||
**Handling**:
|
||||
- Execution in `Requested` or `Scheduling` state
|
||||
- Executor updates status: → `Cancelled`
|
||||
- Worker never receives `execution.scheduled`
|
||||
- No worker resources consumed ✅
|
||||
|
||||
### Cancellation After Handoff
|
||||
|
||||
**Scenario**: Execution already scheduled to worker, user cancels while running.
|
||||
|
||||
**Handling**:
|
||||
- Worker has received `execution.scheduled` and owns execution
|
||||
- Worker updates status: `Running` → `Cancelled`
|
||||
- Worker publishes status change notification
|
||||
- Executor handles orchestration (e.g., skip workflow children)
|
||||
|
||||
### Worker Crashes Before Updating Status
|
||||
|
||||
**Scenario**: Worker receives `execution.scheduled` but crashes before updating status to `Running`.
|
||||
|
||||
**Handling**:
|
||||
- Execution remains in `Scheduled` state
|
||||
- Worker owned the execution but failed to update
|
||||
- Executor's heartbeat monitoring detects stale scheduled executions
|
||||
- After timeout, executor can reschedule to another worker or mark as abandoned
|
||||
- Idempotent: If worker already started, duplicate scheduling is rejected
|
||||
|
||||
### Message Delivery Delays
|
||||
|
||||
**Scenario**: Worker updates DB but `execution.status_changed` message is delayed.
|
||||
|
||||
**Handling**:
|
||||
- Database reflects correct state (source of truth)
|
||||
- Executor eventually receives notification and handles orchestration
|
||||
- Orchestration logic is idempotent (safe to call multiple times)
|
||||
- Critical: Workflows may have slight delay, but remain consistent
|
||||
|
||||
### Partial Failures
|
||||
|
||||
**Scenario**: Worker updates DB successfully but fails to publish notification.
|
||||
|
||||
**Handling**:
|
||||
- Database has correct state (worker succeeded)
|
||||
- Executor won't trigger orchestration until notification arrives
|
||||
- Future enhancement: Periodic executor polling for stale completions
|
||||
- Workaround: Worker retries message publishing with exponential backoff
|
||||
|
||||
## Migration Notes
|
||||
|
||||
### Changes Required
|
||||
|
||||
1. **Executor Service** (`execution_manager.rs`):
|
||||
- ✅ Removed database updates from `process_status_change()`
|
||||
- ✅ Changed to read-only orchestration handler
|
||||
- ✅ Updated logs to reflect observer role
|
||||
|
||||
2. **Worker Service** (`service.rs`):
|
||||
- ✅ Already updates DB directly (no changes needed)
|
||||
- ✅ Updated comment: "we'll update the database directly"
|
||||
|
||||
3. **Documentation**:
|
||||
- ✅ Updated module docs to reflect ownership model
|
||||
- ✅ Added ownership boundaries to architecture docs
|
||||
|
||||
### Backward Compatibility
|
||||
|
||||
- ✅ No breaking changes to external APIs
|
||||
- ✅ Message formats unchanged
|
||||
- ✅ Database schema unchanged
|
||||
- ✅ Workflow behavior unchanged
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
|
||||
- ✅ Executor tests verify no DB updates after scheduling
|
||||
- ✅ Worker tests verify DB updates at all lifecycle stages
|
||||
- ✅ Message handler tests verify orchestration without DB writes
|
||||
|
||||
### Integration Tests
|
||||
|
||||
- Test full execution lifecycle end-to-end
|
||||
- Verify status transitions in database
|
||||
- Confirm orchestration logic (workflow children) still works
|
||||
- Test failure scenarios (worker crashes, message delays)
|
||||
|
||||
### Monitoring
|
||||
|
||||
Monitor for:
|
||||
- Executions stuck in `Scheduled` state (worker not picking up)
|
||||
- Large delays between status changes (message queue lag)
|
||||
- Workflow children not triggering (orchestration failure)
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### 1. Executor Polling for Stale Completions
|
||||
|
||||
If `execution.status_changed` messages are lost, executor could periodically poll for completed executions that haven't triggered orchestration.
|
||||
|
||||
### 2. Worker Health Checks
|
||||
|
||||
More robust detection of worker failures before scheduled executions time out.
|
||||
|
||||
### 3. Explicit Handoff Messages
|
||||
|
||||
Consider adding `execution.handoff` message to explicitly mark ownership transfer point.
|
||||
|
||||
## References
|
||||
|
||||
- **Architecture Doc**: `docs/architecture/executor-service.md`
|
||||
- **Work Summary**: `work-summary/2026-02-09-duplicate-completion-fix.md`
|
||||
- **Bug Fix Doc**: `docs/BUGFIX-duplicate-completion-2026-02-09.md`
|
||||
- **ExecutionManager**: `crates/executor/src/execution_manager.rs`
|
||||
- **Worker Executor**: `crates/worker/src/executor.rs`
|
||||
- **Worker Service**: `crates/worker/src/service.rs`
|
||||
|
||||
## Summary
|
||||
|
||||
The execution state ownership model provides **clear, lifecycle-based boundaries** for who updates execution records:
|
||||
|
||||
- **Executor**: Owns state from creation through scheduling (including pre-handoff cancellations)
|
||||
- **Worker**: Owns state after receiving `execution.scheduled` message
|
||||
- **Handoff**: Occurs when `execution.scheduled` message is **published to worker**
|
||||
- **Key Principle**: Worker only knows about executions it receives; pre-handoff cancellations are executor's responsibility
|
||||
|
||||
This eliminates race conditions, reduces database load, and provides a clean architectural foundation for future enhancements.
|
||||
342
docs/BUGFIX-duplicate-completion-2026-02-09.md
Normal file
342
docs/BUGFIX-duplicate-completion-2026-02-09.md
Normal file
@@ -0,0 +1,342 @@
|
||||
# Bug Fix: Duplicate Completion Notifications & Unnecessary Database Updates
|
||||
|
||||
**Date**: 2026-02-09
|
||||
**Component**: Executor Service (ExecutionManager)
|
||||
**Issue Type**: Performance & Correctness
|
||||
|
||||
## Overview
|
||||
|
||||
Fixed two related inefficiencies in the executor service:
|
||||
1. **Duplicate completion notifications** causing queue manager warnings
|
||||
2. **Unnecessary database updates** writing unchanged status values
|
||||
|
||||
---
|
||||
|
||||
## Problem 1: Duplicate Completion Notifications
|
||||
|
||||
### Symptom
|
||||
```
|
||||
WARN crates/executor/src/queue_manager.rs:320:
|
||||
Completion notification for action 3 but active_count is 0
|
||||
```
|
||||
|
||||
### Before Fix - Message Flow
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Worker Service │
|
||||
│ │
|
||||
│ 1. Completes action execution │
|
||||
│ 2. Updates DB: status = "Completed" │
|
||||
│ 3. Publishes: execution.status_changed (status: "completed") │
|
||||
│ 4. Publishes: execution.completed ────────────┐ │
|
||||
└─────────────────────────────────────────────────┼───────────────┘
|
||||
│
|
||||
┌────────────────────────────────┼───────────────┐
|
||||
│ │ │
|
||||
▼ ▼ │
|
||||
┌─────────────────────────────┐ ┌──────────────────────────────┤
|
||||
│ ExecutionManager │ │ CompletionListener │
|
||||
│ │ │ │
|
||||
│ Receives: │ │ Receives: execution.completed│
|
||||
│ execution.status_changed │ │ │
|
||||
│ │ │ → notify_completion() │
|
||||
│ → handle_completion() │ │ → Decrements active_count ✅ │
|
||||
│ → publish_completion_notif()│ └──────────────────────────────┘
|
||||
│ │
|
||||
│ Publishes: execution.completed ───────┐
|
||||
└─────────────────────────────┘ │
|
||||
│
|
||||
┌─────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌────────────────────────────┐
|
||||
│ CompletionListener (again) │
|
||||
│ │
|
||||
│ Receives: execution.completed (2nd time!)
|
||||
│ │
|
||||
│ → notify_completion() │
|
||||
│ → active_count already 0 │
|
||||
│ → ⚠️ WARNING LOGGED │
|
||||
└────────────────────────────┘
|
||||
|
||||
Result: 2x completion notifications, 1x warning
|
||||
```
|
||||
|
||||
### After Fix - Message Flow
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Worker Service │
|
||||
│ │
|
||||
│ 1. Completes action execution │
|
||||
│ 2. Updates DB: status = "Completed" │
|
||||
│ 3. Publishes: execution.status_changed (status: "completed") │
|
||||
│ 4. Publishes: execution.completed ────────────┐ │
|
||||
└─────────────────────────────────────────────────┼───────────────┘
|
||||
│
|
||||
┌────────────────────────────────┼───────────────┐
|
||||
│ │ │
|
||||
▼ ▼ │
|
||||
┌─────────────────────────────┐ ┌──────────────────────────────┤
|
||||
│ ExecutionManager │ │ CompletionListener │
|
||||
│ │ │ │
|
||||
│ Receives: │ │ Receives: execution.completed│
|
||||
│ execution.status_changed │ │ │
|
||||
│ │ │ → notify_completion() │
|
||||
│ → handle_completion() │ │ → Decrements active_count ✅ │
|
||||
│ → Handles workflow children │ └──────────────────────────────┘
|
||||
│ → NO completion publish ✅ │
|
||||
└─────────────────────────────┘
|
||||
|
||||
Result: 1x completion notification, 0x warnings ✅
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Problem 2: Unnecessary Database Updates
|
||||
|
||||
### Symptom
|
||||
```
|
||||
INFO crates/executor/src/execution_manager.rs:108:
|
||||
Updated execution 9061 status: Completed -> Completed
|
||||
```
|
||||
|
||||
### Before Fix - Status Update Flow
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Worker Service │
|
||||
│ │
|
||||
│ 1. Completes action execution │
|
||||
│ 2. ExecutionRepository::update() │
|
||||
│ status: Running → Completed ✅ │
|
||||
│ 3. Publishes: execution.status_changed (status: "completed") │
|
||||
└─────────────────────────────────┬───────────────────────────────┘
|
||||
│
|
||||
│ Message Queue
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ ExecutionManager │
|
||||
│ │
|
||||
│ 1. Receives: execution.status_changed (status: "completed") │
|
||||
│ 2. Fetches execution from DB │
|
||||
│ Current status: Completed │
|
||||
│ 3. Sets: execution.status = Completed (same value) │
|
||||
│ 4. ExecutionRepository::update() │
|
||||
│ status: Completed → Completed ❌ │
|
||||
│ 5. Logs: "Updated execution 9061 status: Completed -> Completed"
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
|
||||
Result: 2x database writes for same status value
|
||||
```
|
||||
|
||||
### After Fix - Status Update Flow
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Worker Service │
|
||||
│ │
|
||||
│ 1. Completes action execution │
|
||||
│ 2. ExecutionRepository::update() │
|
||||
│ status: Running → Completed ✅ │
|
||||
│ 3. Publishes: execution.status_changed (status: "completed") │
|
||||
└─────────────────────────────────────┬───────────────────────────┘
|
||||
│
|
||||
│ Message Queue
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ ExecutionManager │
|
||||
│ │
|
||||
│ 1. Receives: execution.status_changed (status: "completed") │
|
||||
│ 2. Fetches execution from DB │
|
||||
│ Current status: Completed │
|
||||
│ 3. Compares: old_status (Completed) == new_status (Completed) │
|
||||
│ 4. Skips database update ✅ │
|
||||
│ 5. Still handles orchestration (workflow children) │
|
||||
│ 6. Logs: "Execution 9061 status unchanged, skipping update" │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
|
||||
Result: 1x database write (only when status changes) ✅
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Code Changes
|
||||
|
||||
### Change 1: Remove Duplicate Completion Publication
|
||||
|
||||
**File**: `crates/executor/src/execution_manager.rs`
|
||||
|
||||
```rust
|
||||
// BEFORE
|
||||
async fn handle_completion(...) -> Result<()> {
|
||||
// Handle workflow children...
|
||||
|
||||
// Publish completion notification
|
||||
Self::publish_completion_notification(pool, publisher, execution).await?;
|
||||
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
// DUPLICATE - worker already did this!
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
```rust
|
||||
// AFTER
|
||||
async fn handle_completion(...) -> Result<()> {
|
||||
// Handle workflow children...
|
||||
|
||||
// NOTE: Completion notification is published by the worker, not here.
|
||||
// This prevents duplicate execution.completed messages that would cause
|
||||
// the queue manager to decrement active_count twice.
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
// Removed entire publish_completion_notification() method
|
||||
```
|
||||
|
||||
### Change 2: Skip Unnecessary Database Updates
|
||||
|
||||
**File**: `crates/executor/src/execution_manager.rs`
|
||||
|
||||
```rust
|
||||
// BEFORE
|
||||
async fn process_status_change(...) -> Result<()> {
|
||||
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
|
||||
|
||||
let old_status = execution.status.clone();
|
||||
execution.status = status; // Always set, even if same
|
||||
|
||||
ExecutionRepository::update(pool, execution.id, execution.clone().into()).await?;
|
||||
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
// ALWAYS writes, even if unchanged!
|
||||
|
||||
info!("Updated execution {} status: {:?} -> {:?}", execution_id, old_status, status);
|
||||
|
||||
// Handle completion logic...
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
```rust
|
||||
// AFTER
|
||||
async fn process_status_change(...) -> Result<()> {
|
||||
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
|
||||
|
||||
let old_status = execution.status.clone();
|
||||
|
||||
// Skip update if status hasn't changed
|
||||
if old_status == status {
|
||||
debug!("Execution {} status unchanged ({:?}), skipping database update",
|
||||
execution_id, status);
|
||||
|
||||
// Still handle completion logic for orchestration (e.g., workflow children)
|
||||
if matches!(status, ExecutionStatus::Completed | ExecutionStatus::Failed | ExecutionStatus::Cancelled) {
|
||||
Self::handle_completion(pool, publisher, &execution).await?;
|
||||
}
|
||||
|
||||
return Ok(()); // Early return - no DB write
|
||||
}
|
||||
|
||||
execution.status = status;
|
||||
ExecutionRepository::update(pool, execution.id, execution.clone().into()).await?;
|
||||
|
||||
info!("Updated execution {} status: {:?} -> {:?}", execution_id, old_status, status);
|
||||
|
||||
// Handle completion logic...
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Impact & Benefits
|
||||
|
||||
### Performance Improvements
|
||||
|
||||
| Metric | Before | After | Improvement |
|
||||
|--------|--------|-------|-------------|
|
||||
| Completion messages per execution | 2 | 1 | **50% reduction** |
|
||||
| Queue manager warnings | Frequent | None | **100% elimination** |
|
||||
| Database writes (no status change) | Always | Never | **100% elimination** |
|
||||
| Log noise | High | Low | **Significant reduction** |
|
||||
|
||||
### Typical Execution Flow
|
||||
|
||||
**Before fixes**:
|
||||
- 1x execution completed
|
||||
- 2x `execution.completed` messages published
|
||||
- 1x unnecessary database write (Completed → Completed)
|
||||
- 1x queue manager warning
|
||||
- Noisy logs with redundant "status: Completed -> Completed" messages
|
||||
|
||||
**After fixes**:
|
||||
- 1x execution completed
|
||||
- 1x `execution.completed` message published (worker only)
|
||||
- 0x unnecessary database writes
|
||||
- 0x queue manager warnings
|
||||
- Clean, informative logs
|
||||
|
||||
### High-Throughput Scenarios
|
||||
|
||||
At **1000 executions/minute**:
|
||||
|
||||
**Before**:
|
||||
- 2000 completion messages/min
|
||||
- ~1000 unnecessary DB writes/min
|
||||
- ~1000 warning logs/min
|
||||
|
||||
**After**:
|
||||
- 1000 completion messages/min (50% reduction)
|
||||
- 0 unnecessary DB writes (100% reduction)
|
||||
- 0 warning logs (100% reduction)
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
|
||||
✅ All 58 executor unit tests pass
|
||||
✅ Zero compiler warnings
|
||||
✅ No breaking changes to external behavior
|
||||
✅ Orchestration logic (workflow children) still works correctly
|
||||
|
||||
---
|
||||
|
||||
## Architecture Clarifications
|
||||
|
||||
### Separation of Concerns
|
||||
|
||||
| Component | Responsibility |
|
||||
|-----------|----------------|
|
||||
| **Worker** | Authoritative source for execution completion, publishes completion notifications |
|
||||
| **Executor** | Orchestration (workflows, child executions), NOT completion notifications |
|
||||
| **CompletionListener** | Queue management (releases slots for queued executions) |
|
||||
|
||||
### Idempotency
|
||||
|
||||
The executor is now **idempotent** with respect to status change messages:
|
||||
- Receiving the same status change multiple times has no effect after the first
|
||||
- Database is only written when state actually changes
|
||||
- Orchestration logic (workflows) runs correctly regardless
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. **Message publishers should be explicit** - Only one component should publish a given message type
|
||||
2. **Always check for actual changes** - Don't blindly write to database without comparing old/new values
|
||||
3. **Separate orchestration from notification** - Workflow logic shouldn't trigger duplicate notifications
|
||||
4. **Log levels matter** - Changed redundant updates from INFO to DEBUG to reduce noise
|
||||
5. **Trust the source** - Worker owns execution lifecycle; executor shouldn't second-guess it
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- Work Summary: `attune/work-summary/2026-02-09-duplicate-completion-fix.md`
|
||||
- Queue Manager: `attune/crates/executor/src/queue_manager.rs`
|
||||
- Completion Listener: `attune/crates/executor/src/completion_listener.rs`
|
||||
- Execution Manager: `attune/crates/executor/src/execution_manager.rs`
|
||||
337
docs/QUICKREF-dotenv-shell-actions.md
Normal file
337
docs/QUICKREF-dotenv-shell-actions.md
Normal file
@@ -0,0 +1,337 @@
|
||||
# Quick Reference: DOTENV Shell Actions Pattern
|
||||
|
||||
**Purpose:** Standard pattern for writing portable shell actions without external dependencies like `jq`.
|
||||
|
||||
## Core Principles
|
||||
|
||||
1. **Use POSIX shell** (`#!/bin/sh`), not bash
|
||||
2. **Read parameters in DOTENV format** from stdin
|
||||
3. **No external JSON parsers** (jq, yq, etc.)
|
||||
4. **Minimal dependencies** (only POSIX utilities + curl)
|
||||
|
||||
## Complete Template
|
||||
|
||||
```sh
|
||||
#!/bin/sh
|
||||
# Action Name - Core Pack
|
||||
# Brief description of what this action does
|
||||
#
|
||||
# This script uses pure POSIX shell without external dependencies like jq.
|
||||
# It reads parameters in DOTENV format from stdin until the delimiter.
|
||||
|
||||
set -e
|
||||
|
||||
# Initialize variables with defaults
|
||||
param1=""
|
||||
param2="default_value"
|
||||
bool_param="false"
|
||||
numeric_param="0"
|
||||
|
||||
# Read DOTENV-formatted parameters from stdin until delimiter
|
||||
while IFS= read -r line; do
|
||||
# Check for parameter delimiter
|
||||
case "$line" in
|
||||
*"---ATTUNE_PARAMS_END---"*)
|
||||
break
|
||||
;;
|
||||
esac
|
||||
[ -z "$line" ] && continue
|
||||
|
||||
key="${line%%=*}"
|
||||
value="${line#*=}"
|
||||
|
||||
# Remove quotes if present (both single and double)
|
||||
case "$value" in
|
||||
\"*\")
|
||||
value="${value#\"}"
|
||||
value="${value%\"}"
|
||||
;;
|
||||
\'*\')
|
||||
value="${value#\'}"
|
||||
value="${value%\'}"
|
||||
;;
|
||||
esac
|
||||
|
||||
# Process parameters
|
||||
case "$key" in
|
||||
param1)
|
||||
param1="$value"
|
||||
;;
|
||||
param2)
|
||||
param2="$value"
|
||||
;;
|
||||
bool_param)
|
||||
bool_param="$value"
|
||||
;;
|
||||
numeric_param)
|
||||
numeric_param="$value"
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
# Normalize boolean values
|
||||
case "$bool_param" in
|
||||
true|True|TRUE|yes|Yes|YES|1) bool_param="true" ;;
|
||||
*) bool_param="false" ;;
|
||||
esac
|
||||
|
||||
# Validate numeric parameters
|
||||
case "$numeric_param" in
|
||||
''|*[!0-9]*)
|
||||
echo "ERROR: numeric_param must be a positive integer" >&2
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
|
||||
# Validate required parameters
|
||||
if [ -z "$param1" ]; then
|
||||
echo "ERROR: param1 is required" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Action logic goes here
|
||||
echo "Processing with param1=$param1, param2=$param2"
|
||||
|
||||
# Exit successfully
|
||||
exit 0
|
||||
```
|
||||
|
||||
## YAML Metadata Configuration
|
||||
|
||||
```yaml
|
||||
ref: core.action_name
|
||||
label: "Action Name"
|
||||
description: "Brief description"
|
||||
enabled: true
|
||||
runner_type: shell
|
||||
entry_point: action_name.sh
|
||||
|
||||
# IMPORTANT: Use dotenv format for POSIX shell compatibility
|
||||
parameter_delivery: stdin
|
||||
parameter_format: dotenv
|
||||
|
||||
# Output format (text or json)
|
||||
output_format: text
|
||||
|
||||
parameters:
|
||||
type: object
|
||||
properties:
|
||||
param1:
|
||||
type: string
|
||||
description: "First parameter"
|
||||
param2:
|
||||
type: string
|
||||
description: "Second parameter"
|
||||
default: "default_value"
|
||||
bool_param:
|
||||
type: boolean
|
||||
description: "Boolean parameter"
|
||||
default: false
|
||||
required:
|
||||
- param1
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### 1. Parameter Parsing
|
||||
|
||||
**Read until delimiter:**
|
||||
```sh
|
||||
while IFS= read -r line; do
|
||||
case "$line" in
|
||||
*"---ATTUNE_PARAMS_END---"*) break ;;
|
||||
esac
|
||||
done
|
||||
```
|
||||
|
||||
**Extract key-value:**
|
||||
```sh
|
||||
key="${line%%=*}" # Everything before first =
|
||||
value="${line#*=}" # Everything after first =
|
||||
```
|
||||
|
||||
**Remove quotes:**
|
||||
```sh
|
||||
case "$value" in
|
||||
\"*\") value="${value#\"}"; value="${value%\"}" ;;
|
||||
\'*\') value="${value#\'}"; value="${value%\'}" ;;
|
||||
esac
|
||||
```
|
||||
|
||||
### 2. Boolean Normalization
|
||||
|
||||
```sh
|
||||
case "$bool_param" in
|
||||
true|True|TRUE|yes|Yes|YES|1) bool_param="true" ;;
|
||||
*) bool_param="false" ;;
|
||||
esac
|
||||
```
|
||||
|
||||
### 3. Numeric Validation
|
||||
|
||||
```sh
|
||||
case "$number" in
|
||||
''|*[!0-9]*)
|
||||
echo "ERROR: must be a number" >&2
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
```
|
||||
|
||||
### 4. JSON Output (without jq)
|
||||
|
||||
**Escape special characters:**
|
||||
```sh
|
||||
escaped=$(printf '%s' "$value" | sed 's/\\/\\\\/g; s/"/\\"/g')
|
||||
```
|
||||
|
||||
**Build JSON:**
|
||||
```sh
|
||||
cat <<EOF
|
||||
{
|
||||
"field": "$escaped",
|
||||
"boolean": $bool_value,
|
||||
"number": $number
|
||||
}
|
||||
EOF
|
||||
```
|
||||
|
||||
### 5. Making HTTP Requests
|
||||
|
||||
**With curl and temp files:**
|
||||
```sh
|
||||
temp_response=$(mktemp)
|
||||
cleanup() { rm -f "$temp_response"; }
|
||||
trap cleanup EXIT
|
||||
|
||||
http_code=$(curl -X POST \
|
||||
-H "Content-Type: application/json" \
|
||||
${api_token:+-H "Authorization: Bearer ${api_token}"} \
|
||||
-d "$request_body" \
|
||||
-s \
|
||||
-w "%{http_code}" \
|
||||
-o "$temp_response" \
|
||||
--max-time 60 \
|
||||
"${api_url}/api/v1/endpoint" 2>/dev/null || echo "000")
|
||||
|
||||
if [ "$http_code" -ge 200 ] && [ "$http_code" -lt 300 ]; then
|
||||
cat "$temp_response"
|
||||
exit 0
|
||||
else
|
||||
echo "ERROR: API call failed (HTTP $http_code)" >&2
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
### 6. Extracting JSON Fields (simple cases)
|
||||
|
||||
**Extract field value:**
|
||||
```sh
|
||||
case "$response" in
|
||||
*'"field":'*)
|
||||
value=$(printf '%s' "$response" | sed -n 's/.*"field":\s*"\([^"]*\)".*/\1/p')
|
||||
;;
|
||||
esac
|
||||
```
|
||||
|
||||
**Note:** For complex JSON, consider having the API return the exact format needed.
|
||||
|
||||
## Anti-Patterns (DO NOT DO)
|
||||
|
||||
❌ **Using jq:**
|
||||
```sh
|
||||
value=$(echo "$json" | jq -r '.field') # NO!
|
||||
```
|
||||
|
||||
❌ **Using bash-specific features:**
|
||||
```sh
|
||||
#!/bin/bash # NO! Use #!/bin/sh
|
||||
[[ "$var" == "value" ]] # NO! Use [ "$var" = "value" ]
|
||||
```
|
||||
|
||||
❌ **Reading JSON directly from stdin:**
|
||||
```yaml
|
||||
parameter_format: json # NO! Use dotenv
|
||||
```
|
||||
|
||||
❌ **Using Python/Node.js in core pack:**
|
||||
```yaml
|
||||
runner_type: python # NO! Use shell for core pack
|
||||
```
|
||||
|
||||
## Testing Checklist
|
||||
|
||||
- [ ] Script has `#!/bin/sh` shebang
|
||||
- [ ] Script is executable (`chmod +x`)
|
||||
- [ ] All parameters have defaults or validation
|
||||
- [ ] Boolean values are normalized
|
||||
- [ ] Numeric values are validated
|
||||
- [ ] Required parameters are checked
|
||||
- [ ] Error messages go to stderr (`>&2`)
|
||||
- [ ] Successful output goes to stdout
|
||||
- [ ] Temp files are cleaned up (trap handler)
|
||||
- [ ] YAML has `parameter_format: dotenv`
|
||||
- [ ] YAML has `runner_type: shell`
|
||||
- [ ] No `jq`, `yq`, or bash-isms used
|
||||
- [ ] Works on Alpine Linux (minimal environment)
|
||||
|
||||
## Examples from Core Pack
|
||||
|
||||
### Simple Action (echo.sh)
|
||||
- Minimal parameter parsing
|
||||
- Single string parameter
|
||||
- Text output
|
||||
|
||||
### Complex Action (http_request.sh)
|
||||
- Multiple parameters (headers, query params)
|
||||
- HTTP client implementation
|
||||
- JSON output construction
|
||||
- Error handling
|
||||
|
||||
### API Wrapper (register_packs.sh)
|
||||
- JSON request body construction
|
||||
- API authentication
|
||||
- Response parsing
|
||||
- Structured error messages
|
||||
|
||||
## DOTENV Format Specification
|
||||
|
||||
**Format:** Each parameter on a new line as `key=value`
|
||||
|
||||
**Example:**
|
||||
```
|
||||
param1="string value"
|
||||
param2=42
|
||||
bool_param=true
|
||||
---ATTUNE_PARAMS_END---
|
||||
```
|
||||
|
||||
**Key Rules:**
|
||||
- Parameters end with `---ATTUNE_PARAMS_END---` delimiter
|
||||
- Values may be quoted (single or double quotes)
|
||||
- Empty lines are skipped
|
||||
- No multiline values (use base64 if needed)
|
||||
- Array/object parameters passed as JSON strings
|
||||
|
||||
## When to Use This Pattern
|
||||
|
||||
✅ **Use DOTENV shell pattern for:**
|
||||
- Core pack actions
|
||||
- Simple utility actions
|
||||
- Actions that need maximum portability
|
||||
- Actions that run in minimal containers
|
||||
- Actions that don't need complex JSON parsing
|
||||
|
||||
❌ **Consider other runtimes if you need:**
|
||||
- Complex JSON manipulation
|
||||
- External libraries (AWS SDK, etc.)
|
||||
- Advanced string processing
|
||||
- Parallel processing
|
||||
- Language-specific features
|
||||
|
||||
## Further Reading
|
||||
|
||||
- `packs/core/actions/echo.sh` - Simplest example
|
||||
- `packs/core/actions/http_request.sh` - Complex example
|
||||
- `packs/core/actions/register_packs.sh` - API wrapper example
|
||||
- `docs/pack-structure.md` - Pack development guide
|
||||
204
docs/QUICKREF-execution-state-ownership.md
Normal file
204
docs/QUICKREF-execution-state-ownership.md
Normal file
@@ -0,0 +1,204 @@
|
||||
# Quick Reference: Execution State Ownership
|
||||
|
||||
**Last Updated**: 2026-02-09
|
||||
|
||||
## Ownership Model at a Glance
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────┐
|
||||
│ EXECUTOR OWNS │ WORKER OWNS │
|
||||
│ Requested │ Running │
|
||||
│ Scheduling │ Completed │
|
||||
│ Scheduled │ Failed │
|
||||
│ (+ pre-handoff Cancelled) │ (+ post-handoff │
|
||||
│ │ Cancelled/Timeout/ │
|
||||
│ │ Abandoned) │
|
||||
└───────────────────────────────┴──────────────────────────┘
|
||||
│ │
|
||||
└─────── HANDOFF ──────────┘
|
||||
execution.scheduled PUBLISHED
|
||||
```
|
||||
|
||||
## Who Updates the Database?
|
||||
|
||||
### Executor Updates (Pre-Handoff Only)
|
||||
- ✅ Creates execution record
|
||||
- ✅ Updates status: `Requested` → `Scheduling` → `Scheduled`
|
||||
- ✅ Publishes `execution.scheduled` message **← HANDOFF POINT**
|
||||
- ✅ Handles cancellations/failures BEFORE handoff (worker never notified)
|
||||
- ❌ NEVER updates after `execution.scheduled` is published
|
||||
|
||||
### Worker Updates (Post-Handoff Only)
|
||||
- ✅ Receives `execution.scheduled` message (takes ownership)
|
||||
- ✅ Updates status: `Scheduled` → `Running`
|
||||
- ✅ Updates status: `Running` → `Completed`/`Failed`/`Cancelled`/etc.
|
||||
- ✅ Handles cancellations/failures AFTER handoff
|
||||
- ✅ Updates result data
|
||||
- ✅ Writes for every status change after receiving handoff
|
||||
|
||||
## Who Publishes Messages?
|
||||
|
||||
### Executor Publishes
|
||||
- `enforcement.created` (from rules)
|
||||
- `execution.requested` (to scheduler)
|
||||
- `execution.scheduled` (to worker) **← HANDOFF MESSAGE - OWNERSHIP TRANSFER**
|
||||
|
||||
### Worker Publishes
|
||||
- `execution.status_changed` (for each status change after handoff)
|
||||
- `execution.completed` (when done)
|
||||
|
||||
### Executor Receives (But Doesn't Update DB Post-Handoff)
|
||||
- `execution.status_changed` → triggers orchestration logic (read-only)
|
||||
- `execution.completed` → releases queue slots
|
||||
|
||||
## Code Locations
|
||||
|
||||
### Executor Updates DB
|
||||
```rust
|
||||
// crates/executor/src/scheduler.rs
|
||||
execution.status = ExecutionStatus::Scheduled;
|
||||
ExecutionRepository::update(pool, execution.id, execution.into()).await?;
|
||||
```
|
||||
|
||||
### Worker Updates DB
|
||||
```rust
|
||||
// crates/worker/src/executor.rs
|
||||
self.update_execution_status(execution_id, ExecutionStatus::Running).await?;
|
||||
// ...
|
||||
ExecutionRepository::update(&self.pool, execution_id, input).await?;
|
||||
```
|
||||
|
||||
### Executor Orchestrates (Read-Only)
|
||||
```rust
|
||||
// crates/executor/src/execution_manager.rs
|
||||
async fn process_status_change(...) -> Result<()> {
|
||||
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
|
||||
// NO UPDATE - just orchestration logic
|
||||
Self::handle_completion(pool, publisher, &execution).await?;
|
||||
}
|
||||
```
|
||||
|
||||
## Decision Tree: Should I Update the DB?
|
||||
|
||||
```
|
||||
Are you in the Executor?
|
||||
├─ Have you published execution.scheduled for this execution?
|
||||
│ ├─ NO → Update DB (you own it)
|
||||
│ │ └─ Includes: Requested/Scheduling/Scheduled/pre-handoff Cancelled
|
||||
│ └─ YES → Don't update DB (worker owns it now)
|
||||
│ └─ Just orchestrate (trigger workflows, etc)
|
||||
│
|
||||
Are you in the Worker?
|
||||
├─ Have you received execution.scheduled for this execution?
|
||||
│ ├─ YES → Update DB for ALL status changes (you own it)
|
||||
│ │ └─ Includes: Running/Completed/Failed/post-handoff Cancelled/etc.
|
||||
│ └─ NO → Don't touch this execution (doesn't exist for you yet)
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### ✅ DO: Worker Updates After Handoff
|
||||
```rust
|
||||
// Worker receives execution.scheduled
|
||||
self.update_execution_status(execution_id, ExecutionStatus::Running).await?;
|
||||
self.publish_status_update(execution_id, ExecutionStatus::Running).await?;
|
||||
```
|
||||
|
||||
### ✅ DO: Executor Orchestrates Without DB Write
|
||||
```rust
|
||||
// Executor receives execution.status_changed
|
||||
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
|
||||
if status == ExecutionStatus::Completed {
|
||||
Self::trigger_child_executions(pool, publisher, &execution).await?;
|
||||
}
|
||||
```
|
||||
|
||||
### ❌ DON'T: Executor Updates After Handoff
|
||||
```rust
|
||||
// Executor receives execution.status_changed
|
||||
execution.status = status;
|
||||
ExecutionRepository::update(pool, execution.id, execution).await?; // ❌ WRONG!
|
||||
```
|
||||
|
||||
### ❌ DON'T: Worker Updates Before Handoff
|
||||
```rust
|
||||
// Worker updates execution it hasn't received via execution.scheduled
|
||||
ExecutionRepository::update(&self.pool, execution_id, input).await?; // ❌ WRONG!
|
||||
```
|
||||
|
||||
### ✅ DO: Executor Handles Pre-Handoff Cancellation
|
||||
```rust
|
||||
// User cancels execution before it's scheduled to worker
|
||||
// Execution is still in Requested/Scheduling state
|
||||
execution.status = ExecutionStatus::Cancelled;
|
||||
ExecutionRepository::update(pool, execution_id, execution).await?; // ✅ CORRECT!
|
||||
// Worker never receives execution.scheduled, never knows execution existed
|
||||
```
|
||||
|
||||
### ✅ DO: Worker Handles Post-Handoff Cancellation
|
||||
```rust
|
||||
// Worker received execution.scheduled, now owns execution
|
||||
// User cancels execution while it's running
|
||||
execution.status = ExecutionStatus::Cancelled;
|
||||
ExecutionRepository::update(&self.pool, execution_id, execution).await?; // ✅ CORRECT!
|
||||
self.publish_status_update(execution_id, ExecutionStatus::Cancelled).await?;
|
||||
```
|
||||
|
||||
## Handoff Checklist
|
||||
|
||||
When an execution is scheduled:
|
||||
|
||||
**Executor Must**:
|
||||
- [x] Update status to `Scheduled`
|
||||
- [x] Write to database
|
||||
- [x] Publish `execution.scheduled` message **← HANDOFF OCCURS HERE**
|
||||
- [x] Stop updating this execution (ownership transferred)
|
||||
- [x] Continue to handle orchestration (read-only)
|
||||
|
||||
**Worker Must**:
|
||||
- [x] Receive `execution.scheduled` message **← OWNERSHIP RECEIVED**
|
||||
- [x] Take ownership of execution state
|
||||
- [x] Update DB for all future status changes
|
||||
- [x] Handle any cancellations/failures after this point
|
||||
- [x] Publish status notifications
|
||||
|
||||
**Important**: If execution is cancelled BEFORE executor publishes `execution.scheduled`, the executor updates status to `Cancelled` and worker never learns about it.
|
||||
|
||||
## Benefits Summary
|
||||
|
||||
| Aspect | Benefit |
|
||||
|--------|---------|
|
||||
| **Race Conditions** | Eliminated - only one owner per stage |
|
||||
| **DB Writes** | Reduced by ~50% - no duplicates |
|
||||
| **Code Clarity** | Clear boundaries - easy to reason about |
|
||||
| **Message Traffic** | Reduced - no duplicate completions |
|
||||
| **Idempotency** | Safe to receive duplicate messages |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Execution Stuck in "Scheduled"
|
||||
**Problem**: Worker not updating status to Running
|
||||
**Check**: Was execution.scheduled published? Worker received it? Worker healthy?
|
||||
|
||||
### Workflow Children Not Triggering
|
||||
**Problem**: Orchestration not running
|
||||
**Check**: Worker published execution.status_changed? Message queue healthy?
|
||||
|
||||
### Duplicate Status Updates
|
||||
**Problem**: Both services updating DB
|
||||
**Check**: Executor should NOT update after publishing execution.scheduled
|
||||
|
||||
### Execution Cancelled But Status Not Updated
|
||||
**Problem**: Cancellation not reflected in database
|
||||
**Check**: Was it cancelled before or after handoff?
|
||||
**Fix**: If before handoff → executor updates; if after handoff → worker updates
|
||||
|
||||
### Queue Warnings
|
||||
**Problem**: Duplicate completion notifications
|
||||
**Check**: Only worker should publish execution.completed
|
||||
|
||||
## See Also
|
||||
|
||||
- **Full Architecture Doc**: `docs/ARCHITECTURE-execution-state-ownership.md`
|
||||
- **Bug Fix Visualization**: `docs/BUGFIX-duplicate-completion-2026-02-09.md`
|
||||
- **Work Summary**: `work-summary/2026-02-09-execution-state-ownership.md`
|
||||
460
docs/QUICKREF-phase3-retry-health.md
Normal file
460
docs/QUICKREF-phase3-retry-health.md
Normal file
@@ -0,0 +1,460 @@
|
||||
# Quick Reference: Phase 3 - Intelligent Retry & Worker Health
|
||||
|
||||
## Overview
|
||||
|
||||
Phase 3 adds intelligent retry logic and proactive worker health monitoring to automatically recover from transient failures and optimize worker selection.
|
||||
|
||||
**Key Features:**
|
||||
- **Automatic Retry:** Failed executions automatically retry with exponential backoff
|
||||
- **Health-Aware Scheduling:** Prefer healthy workers with low queue depth
|
||||
- **Per-Action Configuration:** Custom timeouts and retry limits per action
|
||||
- **Failure Classification:** Distinguish retriable vs non-retriable failures
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Enable Retry for an Action
|
||||
|
||||
```yaml
|
||||
# packs/mypack/actions/flaky-api.yaml
|
||||
name: flaky_api_call
|
||||
runtime: python
|
||||
entrypoint: actions/flaky_api.py
|
||||
timeout_seconds: 120 # Custom timeout (overrides global 5 min)
|
||||
max_retries: 3 # Retry up to 3 times on failure
|
||||
parameters:
|
||||
url:
|
||||
type: string
|
||||
required: true
|
||||
```
|
||||
|
||||
### Database Migration
|
||||
|
||||
```bash
|
||||
# Apply Phase 3 schema changes
|
||||
sqlx migrate run
|
||||
|
||||
# Or via Docker Compose
|
||||
docker compose exec postgres psql -U attune -d attune -f /migrations/20260209000000_phase3_retry_and_health.sql
|
||||
```
|
||||
|
||||
### Check Worker Health
|
||||
|
||||
```bash
|
||||
# View healthy workers
|
||||
psql -c "SELECT * FROM healthy_workers;"
|
||||
|
||||
# Check specific worker health
|
||||
psql -c "
|
||||
SELECT
|
||||
name,
|
||||
capabilities->'health'->>'status' as health_status,
|
||||
capabilities->'health'->>'queue_depth' as queue_depth,
|
||||
capabilities->'health'->>'consecutive_failures' as failures
|
||||
FROM worker
|
||||
WHERE id = 1;
|
||||
"
|
||||
```
|
||||
|
||||
## Retry Behavior
|
||||
|
||||
### Retriable Failures
|
||||
|
||||
Executions are automatically retried for:
|
||||
- ✓ Worker unavailable (`worker_unavailable`)
|
||||
- ✓ Queue timeout/TTL expired (`queue_timeout`)
|
||||
- ✓ Worker heartbeat stale (`worker_heartbeat_stale`)
|
||||
- ✓ Transient errors (`transient_error`)
|
||||
- ✓ Manual retry requested (`manual_retry`)
|
||||
|
||||
### Non-Retriable Failures
|
||||
|
||||
These failures are NOT retried:
|
||||
- ✗ Validation errors
|
||||
- ✗ Permission denied
|
||||
- ✗ Action not found
|
||||
- ✗ Invalid parameters
|
||||
- ✗ Explicit action failure
|
||||
|
||||
### Retry Backoff
|
||||
|
||||
**Strategy:** Exponential backoff with jitter
|
||||
|
||||
```
|
||||
Attempt 0: ~1 second
|
||||
Attempt 1: ~2 seconds
|
||||
Attempt 2: ~4 seconds
|
||||
Attempt 3: ~8 seconds
|
||||
Attempt N: min(base * 2^N, 300 seconds)
|
||||
```
|
||||
|
||||
**Jitter:** ±20% randomization to avoid thundering herd
|
||||
|
||||
### Retry Configuration
|
||||
|
||||
```rust
|
||||
// Default retry configuration
|
||||
RetryConfig {
|
||||
enabled: true,
|
||||
base_backoff_secs: 1,
|
||||
max_backoff_secs: 300, // 5 minutes max
|
||||
backoff_multiplier: 2.0,
|
||||
jitter_factor: 0.2, // 20% jitter
|
||||
}
|
||||
```
|
||||
|
||||
## Worker Health
|
||||
|
||||
### Health States
|
||||
|
||||
**Healthy:**
|
||||
- Heartbeat < 30 seconds old
|
||||
- Consecutive failures < 3
|
||||
- Queue depth < 50
|
||||
- Failure rate < 30%
|
||||
|
||||
**Degraded:**
|
||||
- Consecutive failures: 3-9
|
||||
- Queue depth: 50-99
|
||||
- Failure rate: 30-69%
|
||||
- Still receives tasks but deprioritized
|
||||
|
||||
**Unhealthy:**
|
||||
- Heartbeat > 30 seconds old
|
||||
- Consecutive failures ≥ 10
|
||||
- Queue depth ≥ 100
|
||||
- Failure rate ≥ 70%
|
||||
- Does NOT receive new tasks
|
||||
|
||||
### Health Metrics
|
||||
|
||||
Workers self-report health in capabilities:
|
||||
|
||||
```json
|
||||
{
|
||||
"runtimes": ["shell", "python"],
|
||||
"health": {
|
||||
"status": "healthy",
|
||||
"last_check": "2026-02-09T12:00:00Z",
|
||||
"consecutive_failures": 0,
|
||||
"total_executions": 1000,
|
||||
"failed_executions": 20,
|
||||
"average_execution_time_ms": 1500,
|
||||
"queue_depth": 5
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Worker Selection
|
||||
|
||||
**Selection Priority:**
|
||||
1. Healthy workers (queue depth ascending)
|
||||
2. Degraded workers (queue depth ascending)
|
||||
3. Skip unhealthy workers
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Worker A: Healthy, queue=5 ← Selected first
|
||||
Worker B: Healthy, queue=20 ← Selected second
|
||||
Worker C: Degraded, queue=10 ← Selected third
|
||||
Worker D: Unhealthy, queue=0 ← Never selected
|
||||
```
|
||||
|
||||
## Database Schema
|
||||
|
||||
### Execution Retry Fields
|
||||
|
||||
```sql
|
||||
-- Added to execution table
|
||||
retry_count INTEGER NOT NULL DEFAULT 0,
|
||||
max_retries INTEGER,
|
||||
retry_reason TEXT,
|
||||
original_execution BIGINT REFERENCES execution(id)
|
||||
```
|
||||
|
||||
### Action Configuration Fields
|
||||
|
||||
```sql
|
||||
-- Added to action table
|
||||
timeout_seconds INTEGER, -- Per-action timeout override
|
||||
max_retries INTEGER DEFAULT 0 -- Per-action retry limit
|
||||
```
|
||||
|
||||
### Helper Functions
|
||||
|
||||
```sql
|
||||
-- Check if execution can be retried
|
||||
SELECT is_execution_retriable(123);
|
||||
|
||||
-- Get worker queue depth
|
||||
SELECT get_worker_queue_depth(1);
|
||||
```
|
||||
|
||||
### Views
|
||||
|
||||
```sql
|
||||
-- Get all healthy workers
|
||||
SELECT * FROM healthy_workers;
|
||||
```
|
||||
|
||||
## Practical Examples
|
||||
|
||||
### Example 1: View Retry Chain
|
||||
|
||||
```sql
|
||||
-- Find all retries for execution 100
|
||||
WITH RECURSIVE retry_chain AS (
|
||||
SELECT id, retry_count, retry_reason, original_execution, status
|
||||
FROM execution
|
||||
WHERE id = 100
|
||||
|
||||
UNION ALL
|
||||
|
||||
SELECT e.id, e.retry_count, e.retry_reason, e.original_execution, e.status
|
||||
FROM execution e
|
||||
JOIN retry_chain rc ON e.original_execution = rc.id
|
||||
)
|
||||
SELECT * FROM retry_chain ORDER BY retry_count;
|
||||
```
|
||||
|
||||
### Example 2: Analyze Retry Success Rate
|
||||
|
||||
```sql
|
||||
-- Success rate of retries by reason
|
||||
SELECT
|
||||
config->>'retry_reason' as reason,
|
||||
COUNT(*) as total_retries,
|
||||
COUNT(CASE WHEN status = 'completed' THEN 1 END) as succeeded,
|
||||
ROUND(100.0 * COUNT(CASE WHEN status = 'completed' THEN 1 END) / COUNT(*), 2) as success_rate
|
||||
FROM execution
|
||||
WHERE retry_count > 0
|
||||
GROUP BY config->>'retry_reason'
|
||||
ORDER BY total_retries DESC;
|
||||
```
|
||||
|
||||
### Example 3: Find Workers by Health
|
||||
|
||||
```sql
|
||||
-- Workers sorted by health and load
|
||||
SELECT
|
||||
w.name,
|
||||
w.status,
|
||||
(w.capabilities->'health'->>'status')::TEXT as health,
|
||||
(w.capabilities->'health'->>'queue_depth')::INTEGER as queue,
|
||||
(w.capabilities->'health'->>'consecutive_failures')::INTEGER as failures,
|
||||
w.last_heartbeat
|
||||
FROM worker w
|
||||
WHERE w.status = 'active'
|
||||
ORDER BY
|
||||
CASE (w.capabilities->'health'->>'status')::TEXT
|
||||
WHEN 'healthy' THEN 1
|
||||
WHEN 'degraded' THEN 2
|
||||
WHEN 'unhealthy' THEN 3
|
||||
ELSE 4
|
||||
END,
|
||||
(w.capabilities->'health'->>'queue_depth')::INTEGER;
|
||||
```
|
||||
|
||||
### Example 4: Manual Retry via API
|
||||
|
||||
```bash
|
||||
# Create retry execution
|
||||
curl -X POST http://localhost:8080/api/v1/executions \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"action_ref": "core.echo",
|
||||
"parameters": {"message": "retry test"},
|
||||
"config": {
|
||||
"retry_of": 123,
|
||||
"retry_count": 1,
|
||||
"max_retries": 3,
|
||||
"retry_reason": "manual_retry",
|
||||
"original_execution": 123
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Key Metrics
|
||||
|
||||
**Retry Metrics:**
|
||||
- Retry rate: % of executions that retry
|
||||
- Retry success rate: % of retries that succeed
|
||||
- Average retries per execution
|
||||
- Retry reason distribution
|
||||
|
||||
**Health Metrics:**
|
||||
- Healthy worker count
|
||||
- Degraded worker count
|
||||
- Unhealthy worker count
|
||||
- Average queue depth per worker
|
||||
- Average failure rate per worker
|
||||
|
||||
### SQL Queries
|
||||
|
||||
```sql
|
||||
-- Retry rate over last hour
|
||||
SELECT
|
||||
COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END) as original_executions,
|
||||
COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) as retry_executions,
|
||||
ROUND(100.0 * COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) /
|
||||
COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END), 2) as retry_rate
|
||||
FROM execution
|
||||
WHERE created > NOW() - INTERVAL '1 hour';
|
||||
|
||||
-- Worker health distribution
|
||||
SELECT
|
||||
COALESCE((capabilities->'health'->>'status')::TEXT, 'unknown') as health_status,
|
||||
COUNT(*) as worker_count,
|
||||
AVG((capabilities->'health'->>'queue_depth')::INTEGER) as avg_queue_depth
|
||||
FROM worker
|
||||
WHERE status = 'active'
|
||||
GROUP BY health_status;
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Retry Configuration
|
||||
|
||||
```rust
|
||||
// In executor service initialization
|
||||
let retry_manager = RetryManager::new(pool.clone(), RetryConfig {
|
||||
enabled: true,
|
||||
base_backoff_secs: 1,
|
||||
max_backoff_secs: 300,
|
||||
backoff_multiplier: 2.0,
|
||||
jitter_factor: 0.2,
|
||||
});
|
||||
```
|
||||
|
||||
### Health Probe Configuration
|
||||
|
||||
```rust
|
||||
// In executor service initialization
|
||||
let health_probe = WorkerHealthProbe::new(pool.clone(), HealthProbeConfig {
|
||||
enabled: true,
|
||||
heartbeat_max_age_secs: 30,
|
||||
degraded_threshold: 3,
|
||||
unhealthy_threshold: 10,
|
||||
queue_depth_degraded: 50,
|
||||
queue_depth_unhealthy: 100,
|
||||
failure_rate_degraded: 0.3,
|
||||
failure_rate_unhealthy: 0.7,
|
||||
});
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### High Retry Rate
|
||||
|
||||
**Symptoms:** Many executions retrying repeatedly
|
||||
|
||||
**Causes:**
|
||||
- Workers unstable or frequently restarting
|
||||
- Network issues causing transient failures
|
||||
- Actions not idempotent (retry makes things worse)
|
||||
|
||||
**Resolution:**
|
||||
1. Check worker stability: `docker compose ps`
|
||||
2. Review action idempotency
|
||||
3. Adjust `max_retries` if retries are unhelpful
|
||||
4. Investigate root cause of failures
|
||||
|
||||
### Retries Not Triggering
|
||||
|
||||
**Symptoms:** Failed executions not retrying despite max_retries > 0
|
||||
|
||||
**Causes:**
|
||||
- Action doesn't have `max_retries` set
|
||||
- Failure is non-retriable (validation error, etc.)
|
||||
- Global retry disabled
|
||||
|
||||
**Resolution:**
|
||||
1. Check action configuration: `SELECT timeout_seconds, max_retries FROM action WHERE ref = 'action.name';`
|
||||
2. Check failure message for retriable patterns
|
||||
3. Verify retry enabled in executor config
|
||||
|
||||
### Workers Marked Unhealthy
|
||||
|
||||
**Symptoms:** Workers not receiving tasks
|
||||
|
||||
**Causes:**
|
||||
- High queue depth (overloaded)
|
||||
- Consecutive failures exceed threshold
|
||||
- Heartbeat stale
|
||||
|
||||
**Resolution:**
|
||||
1. Check worker logs: `docker compose logs -f worker-shell`
|
||||
2. Verify heartbeat: `SELECT name, last_heartbeat FROM worker;`
|
||||
3. Check queue depth in capabilities
|
||||
4. Restart worker if stuck: `docker compose restart worker-shell`
|
||||
|
||||
### Retry Loops
|
||||
|
||||
**Symptoms:** Execution retries forever or excessive retries
|
||||
|
||||
**Causes:**
|
||||
- Bug in retry reason detection
|
||||
- Action failure always classified as retriable
|
||||
- max_retries not being enforced
|
||||
|
||||
**Resolution:**
|
||||
1. Check retry chain: See Example 1 above
|
||||
2. Verify max_retries: `SELECT config FROM execution WHERE id = 123;`
|
||||
3. Fix retry reason classification if incorrect
|
||||
4. Manually fail execution if stuck
|
||||
|
||||
## Integration with Previous Phases
|
||||
|
||||
### Phase 1 + Phase 2 + Phase 3 Together
|
||||
|
||||
**Defense in Depth:**
|
||||
1. **Phase 1 (Timeout Monitor):** Catches stuck SCHEDULED executions (30s-5min)
|
||||
2. **Phase 2 (Queue TTL/DLQ):** Expires messages in worker queues (5min)
|
||||
3. **Phase 3 (Intelligent Retry):** Retries retriable failures (1s-5min backoff)
|
||||
|
||||
**Failure Flow:**
|
||||
```
|
||||
Execution dispatched → Worker unavailable (Phase 2: 5min TTL)
|
||||
→ DLQ handler marks FAILED (Phase 2)
|
||||
→ Retry manager creates retry (Phase 3)
|
||||
→ Retry dispatched with backoff (Phase 3)
|
||||
→ Success or exhaust retries
|
||||
```
|
||||
|
||||
**Backup Safety Net:**
|
||||
If Phase 3 retry fails to create retry, Phase 1 timeout monitor will still catch stuck executions.
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Action Design for Retries
|
||||
|
||||
1. **Make actions idempotent:** Safe to run multiple times
|
||||
2. **Set realistic timeouts:** Based on typical execution time
|
||||
3. **Configure appropriate max_retries:**
|
||||
- Network calls: 3-5 retries
|
||||
- Database operations: 2-3 retries
|
||||
- External APIs: 3 retries
|
||||
- Local operations: 0-1 retries
|
||||
|
||||
### Worker Health Management
|
||||
|
||||
1. **Report queue depth regularly:** Update every heartbeat
|
||||
2. **Track failure metrics:** Consecutive failures, total/failed counts
|
||||
3. **Implement graceful degradation:** Continue working when degraded
|
||||
4. **Fail fast when unhealthy:** Stop accepting work if overloaded
|
||||
|
||||
### Monitoring Strategy
|
||||
|
||||
1. **Alert on high retry rates:** > 20% of executions retrying
|
||||
2. **Alert on unhealthy workers:** > 50% workers unhealthy
|
||||
3. **Track retry success rate:** Should be > 70%
|
||||
4. **Monitor queue depths:** Average should stay < 20
|
||||
|
||||
## See Also
|
||||
|
||||
- **Architecture:** `docs/architecture/worker-availability-handling.md`
|
||||
- **Phase 1 Guide:** `docs/QUICKREF-worker-availability-phase1.md`
|
||||
- **Phase 2 Guide:** `docs/QUICKREF-worker-queue-ttl-dlq.md`
|
||||
- **Migration:** `migrations/20260209000000_phase3_retry_and_health.sql`
|
||||
227
docs/QUICKREF-worker-heartbeat-monitoring.md
Normal file
227
docs/QUICKREF-worker-heartbeat-monitoring.md
Normal file
@@ -0,0 +1,227 @@
|
||||
# Quick Reference: Worker Heartbeat Monitoring
|
||||
|
||||
**Purpose**: Automatically detect and deactivate workers that have stopped sending heartbeats
|
||||
|
||||
## Overview
|
||||
|
||||
The executor service includes a background task that monitors worker heartbeats and automatically marks stale workers as inactive. This prevents the scheduler from attempting to assign work to workers that are no longer available.
|
||||
|
||||
## How It Works
|
||||
|
||||
### Background Monitor Task
|
||||
|
||||
- **Location**: `crates/executor/src/service.rs` → `worker_heartbeat_monitor_loop()`
|
||||
- **Check Interval**: Every 60 seconds
|
||||
- **Staleness Threshold**: 90 seconds (3x the expected 30-second heartbeat interval)
|
||||
|
||||
### Detection Logic
|
||||
|
||||
The monitor checks all workers with `status = 'active'`:
|
||||
|
||||
1. **No Heartbeat**: Workers with `last_heartbeat = NULL` → marked inactive
|
||||
2. **Stale Heartbeat**: Workers with heartbeat older than 90 seconds → marked inactive
|
||||
3. **Fresh Heartbeat**: Workers with heartbeat within 90 seconds → remain active
|
||||
|
||||
### Automatic Deactivation
|
||||
|
||||
When a stale worker is detected:
|
||||
- Worker status updated to `inactive` in database
|
||||
- Warning logged with worker name, ID, and heartbeat age
|
||||
- Summary logged with count of deactivated workers
|
||||
|
||||
## Configuration
|
||||
|
||||
### Constants (in scheduler.rs and service.rs)
|
||||
|
||||
```rust
|
||||
DEFAULT_HEARTBEAT_INTERVAL: 30 seconds // Expected worker heartbeat frequency
|
||||
HEARTBEAT_STALENESS_MULTIPLIER: 3 // Grace period multiplier
|
||||
MAX_STALENESS: 90 seconds // Calculated: 30 * 3
|
||||
```
|
||||
|
||||
### Check Interval
|
||||
|
||||
Currently hardcoded to 60 seconds. Configured when spawning the monitor task:
|
||||
|
||||
```rust
|
||||
Self::worker_heartbeat_monitor_loop(worker_pool, 60).await;
|
||||
```
|
||||
|
||||
## Worker Lifecycle
|
||||
|
||||
### Normal Operation
|
||||
|
||||
```
|
||||
Worker Starts → Registers → Sends Heartbeats (30s) → Remains Active
|
||||
```
|
||||
|
||||
### Graceful Shutdown
|
||||
|
||||
```
|
||||
Worker Stops → No More Heartbeats → Monitor Detects (60s) → Marked Inactive
|
||||
```
|
||||
|
||||
### Crash/Network Failure
|
||||
|
||||
```
|
||||
Worker Crashes → Heartbeats Stop → Monitor Detects (60s) → Marked Inactive
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Check Active Workers
|
||||
|
||||
```sql
|
||||
SELECT name, worker_role, status, last_heartbeat
|
||||
FROM worker
|
||||
WHERE status = 'active'
|
||||
ORDER BY last_heartbeat DESC;
|
||||
```
|
||||
|
||||
### Check Recent Deactivations
|
||||
|
||||
```sql
|
||||
SELECT name, worker_role, status, last_heartbeat, updated
|
||||
FROM worker
|
||||
WHERE status = 'inactive'
|
||||
AND updated > NOW() - INTERVAL '5 minutes'
|
||||
ORDER BY updated DESC;
|
||||
```
|
||||
|
||||
### Count Workers by Status
|
||||
|
||||
```sql
|
||||
SELECT status, COUNT(*)
|
||||
FROM worker
|
||||
GROUP BY status;
|
||||
```
|
||||
|
||||
## Logs
|
||||
|
||||
### Monitor Startup
|
||||
|
||||
```
|
||||
INFO: Starting worker heartbeat monitor...
|
||||
INFO: Worker heartbeat monitor started (check interval: 60s, staleness threshold: 90s)
|
||||
```
|
||||
|
||||
### Worker Deactivation
|
||||
|
||||
```
|
||||
WARN: Worker sensor-77cd23b50478 (ID: 27) heartbeat is stale (1289s old), marking as inactive
|
||||
INFO: Deactivated 5 worker(s) with stale heartbeats
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
|
||||
```
|
||||
ERROR: Failed to deactivate worker worker-123 (stale heartbeat): <error details>
|
||||
ERROR: Failed to query active workers for heartbeat check: <error details>
|
||||
```
|
||||
|
||||
## Scheduler Integration
|
||||
|
||||
The scheduler already filters out stale workers during worker selection:
|
||||
|
||||
```rust
|
||||
// Filter by heartbeat freshness
|
||||
let fresh_workers: Vec<_> = active_workers
|
||||
.into_iter()
|
||||
.filter(|w| Self::is_worker_heartbeat_fresh(w))
|
||||
.collect();
|
||||
```
|
||||
|
||||
**Before Heartbeat Monitor**: Scheduler filtered at selection time, but workers stayed "active" in DB
|
||||
**After Heartbeat Monitor**: Workers marked inactive in DB, scheduler sees accurate state
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Workers Constantly Becoming Inactive
|
||||
|
||||
**Symptoms**: Active workers being marked inactive despite running
|
||||
**Causes**:
|
||||
- Worker heartbeat interval > 30 seconds
|
||||
- Network issues preventing heartbeat messages
|
||||
- Worker service crash loop
|
||||
|
||||
**Solutions**:
|
||||
1. Check worker logs for heartbeat send attempts
|
||||
2. Verify RabbitMQ connectivity
|
||||
3. Check worker configuration for heartbeat interval
|
||||
|
||||
### Stale Workers Not Being Deactivated
|
||||
|
||||
**Symptoms**: Workers with old heartbeats remain active
|
||||
**Causes**:
|
||||
- Executor service not running
|
||||
- Monitor task crashed
|
||||
|
||||
**Solutions**:
|
||||
1. Check executor service logs
|
||||
2. Verify monitor task started: `grep "heartbeat monitor started" executor.log`
|
||||
3. Restart executor service
|
||||
|
||||
### Too Many Inactive Workers
|
||||
|
||||
**Symptoms**: Database has hundreds of inactive workers
|
||||
**Causes**: Historical workers from development/testing
|
||||
|
||||
**Solutions**:
|
||||
```sql
|
||||
-- Delete inactive workers older than 7 days
|
||||
DELETE FROM worker
|
||||
WHERE status = 'inactive'
|
||||
AND updated < NOW() - INTERVAL '7 days';
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Worker Registration
|
||||
|
||||
Workers should:
|
||||
- Set appropriate unique name (hostname-based)
|
||||
- Send heartbeat every 30 seconds
|
||||
- Handle graceful shutdown (optional: mark self inactive)
|
||||
|
||||
### Database Maintenance
|
||||
|
||||
- Periodically clean up old inactive workers
|
||||
- Monitor worker table growth
|
||||
- Index on `status` and `last_heartbeat` for efficient queries
|
||||
|
||||
### Monitoring & Alerts
|
||||
|
||||
- Track worker deactivation rate (should be low in production)
|
||||
- Alert on sudden increase in deactivations (infrastructure issue)
|
||||
- Monitor active worker count vs. expected
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- `docs/architecture/worker-service.md` - Worker architecture
|
||||
- `docs/architecture/executor-service.md` - Executor architecture
|
||||
- `docs/deployment/ops-runbook-queues.md` - Operational procedures
|
||||
- `AGENTS.md` - Project rules and conventions
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
### Why 90 Seconds?
|
||||
|
||||
- Worker sends heartbeat every 30 seconds
|
||||
- 3x multiplier provides grace period for:
|
||||
- Network latency
|
||||
- Brief load spikes
|
||||
- Temporary connectivity issues
|
||||
- Balances responsiveness vs. false positives
|
||||
|
||||
### Why Check Every 60 Seconds?
|
||||
|
||||
- Allows 1.5 heartbeat intervals between checks
|
||||
- Reduces database query frequency
|
||||
- Adequate response time (stale workers removed within ~2 minutes)
|
||||
|
||||
### Thread Safety
|
||||
|
||||
- Monitor runs in separate tokio task
|
||||
- Uses connection pool for database access
|
||||
- No shared mutable state
|
||||
- Safe to run multiple executor instances (each monitors independently)
|
||||
322
docs/QUICKREF-worker-queue-ttl-dlq.md
Normal file
322
docs/QUICKREF-worker-queue-ttl-dlq.md
Normal file
@@ -0,0 +1,322 @@
|
||||
# Quick Reference: Worker Queue TTL and Dead Letter Queue (Phase 2)
|
||||
|
||||
## Overview
|
||||
|
||||
Phase 2 implements message TTL on worker queues and dead letter queue processing to automatically fail executions when workers are unavailable.
|
||||
|
||||
**Key Concept:** If a worker doesn't process an execution within 5 minutes, the message expires and the execution is automatically marked as FAILED.
|
||||
|
||||
## How It Works
|
||||
|
||||
```
|
||||
Execution → Worker Queue (TTL: 5 min) → Worker Processing ✓
|
||||
↓ (if timeout)
|
||||
Dead Letter Exchange
|
||||
↓
|
||||
Dead Letter Queue
|
||||
↓
|
||||
DLQ Handler (in Executor)
|
||||
↓
|
||||
Execution marked FAILED
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Default Settings (All Environments)
|
||||
|
||||
```yaml
|
||||
message_queue:
|
||||
rabbitmq:
|
||||
worker_queue_ttl_ms: 300000 # 5 minutes
|
||||
dead_letter:
|
||||
enabled: true
|
||||
exchange: attune.dlx
|
||||
ttl_ms: 86400000 # 24 hours DLQ retention
|
||||
```
|
||||
|
||||
### Tuning TTL
|
||||
|
||||
**Worker Queue TTL** (`worker_queue_ttl_ms`):
|
||||
- **Default:** 300000 (5 minutes)
|
||||
- **Purpose:** How long to wait before declaring worker unavailable
|
||||
- **Tuning:** Set to 2-5x your typical execution time
|
||||
- **Too short:** Slow executions fail prematurely
|
||||
- **Too long:** Delayed failure detection for unavailable workers
|
||||
|
||||
**DLQ Retention** (`dead_letter.ttl_ms`):
|
||||
- **Default:** 86400000 (24 hours)
|
||||
- **Purpose:** How long to keep expired messages for debugging
|
||||
- **Tuning:** Based on your debugging/forensics needs
|
||||
|
||||
## Components
|
||||
|
||||
### 1. Worker Queue TTL
|
||||
|
||||
- Applied to all `worker.{id}.executions` queues
|
||||
- Configured via RabbitMQ queue argument `x-message-ttl`
|
||||
- Messages expire if not consumed within TTL
|
||||
- Expired messages routed to dead letter exchange
|
||||
|
||||
### 2. Dead Letter Exchange (DLX)
|
||||
|
||||
- **Name:** `attune.dlx`
|
||||
- **Type:** `direct`
|
||||
- Receives all expired messages from worker queues
|
||||
- Routes to dead letter queue
|
||||
|
||||
### 3. Dead Letter Queue (DLQ)
|
||||
|
||||
- **Name:** `attune.dlx.queue`
|
||||
- Stores expired messages for processing
|
||||
- Retains messages for 24 hours (configurable)
|
||||
- Processed by dead letter handler
|
||||
|
||||
### 4. Dead Letter Handler
|
||||
|
||||
- Runs in executor service
|
||||
- Consumes messages from DLQ
|
||||
- Updates executions to FAILED status
|
||||
- Provides descriptive error messages
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Key Metrics
|
||||
|
||||
```bash
|
||||
# Check DLQ depth
|
||||
rabbitmqadmin list queues name messages | grep attune.dlx.queue
|
||||
|
||||
# View DLQ rate
|
||||
# Watch for sustained DLQ message rate > 10/min
|
||||
|
||||
# Check failed executions
|
||||
curl http://localhost:8080/api/v1/executions?status=failed
|
||||
```
|
||||
|
||||
### Health Checks
|
||||
|
||||
**Good:**
|
||||
- DLQ depth: 0-10
|
||||
- DLQ rate: < 5 messages/min
|
||||
- Most executions complete successfully
|
||||
|
||||
**Warning:**
|
||||
- DLQ depth: 10-100
|
||||
- DLQ rate: 5-20 messages/min
|
||||
- May indicate worker instability
|
||||
|
||||
**Critical:**
|
||||
- DLQ depth: > 100
|
||||
- DLQ rate: > 20 messages/min
|
||||
- Workers likely down or overloaded
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### High DLQ Rate
|
||||
|
||||
**Symptoms:** Many executions failing via DLQ
|
||||
|
||||
**Common Causes:**
|
||||
1. Workers stopped or restarting
|
||||
2. Workers overloaded (not consuming fast enough)
|
||||
3. TTL too aggressive for your workload
|
||||
4. Network connectivity issues
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# 1. Check worker status
|
||||
docker compose ps | grep worker
|
||||
docker compose logs -f worker-shell
|
||||
|
||||
# 2. Verify worker heartbeats
|
||||
psql -c "SELECT name, status, last_heartbeat FROM worker;"
|
||||
|
||||
# 3. Check worker queue depths
|
||||
rabbitmqadmin list queues name messages | grep "worker\."
|
||||
|
||||
# 4. Consider increasing TTL if legitimate slow executions
|
||||
# Edit config and restart executor:
|
||||
# worker_queue_ttl_ms: 600000 # 10 minutes
|
||||
```
|
||||
|
||||
### DLQ Not Processing
|
||||
|
||||
**Symptoms:** DLQ depth increasing, executions stuck
|
||||
|
||||
**Common Causes:**
|
||||
1. Executor service not running
|
||||
2. DLQ disabled in config
|
||||
3. Database connection issues
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# 1. Verify executor is running
|
||||
docker compose ps executor
|
||||
docker compose logs -f executor | grep "dead letter"
|
||||
|
||||
# 2. Check configuration
|
||||
grep -A 3 "dead_letter:" config.docker.yaml
|
||||
|
||||
# 3. Restart executor if needed
|
||||
docker compose restart executor
|
||||
```
|
||||
|
||||
### Messages Not Expiring
|
||||
|
||||
**Symptoms:** Executions stuck in SCHEDULED, DLQ empty
|
||||
|
||||
**Common Causes:**
|
||||
1. Worker queues not configured with TTL
|
||||
2. Worker queues not configured with DLX
|
||||
3. Infrastructure setup failed
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# 1. Check queue properties
|
||||
rabbitmqadmin show queue name=worker.1.executions
|
||||
|
||||
# Look for:
|
||||
# - arguments.x-message-ttl: 300000
|
||||
# - arguments.x-dead-letter-exchange: attune.dlx
|
||||
|
||||
# 2. Recreate infrastructure (safe, idempotent)
|
||||
docker compose restart executor worker-shell
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
### Manual Test: Verify TTL Expiration
|
||||
|
||||
```bash
|
||||
# 1. Stop all workers
|
||||
docker compose stop worker-shell worker-python worker-node
|
||||
|
||||
# 2. Create execution
|
||||
curl -X POST http://localhost:8080/api/v1/executions \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"action_ref": "core.echo",
|
||||
"parameters": {"message": "test"}
|
||||
}'
|
||||
|
||||
# 3. Wait for TTL expiration (5+ minutes)
|
||||
sleep 330
|
||||
|
||||
# 4. Check execution status
|
||||
curl http://localhost:8080/api/v1/executions/{id} | jq '.data.status'
|
||||
# Should be "failed"
|
||||
|
||||
# 5. Check error message
|
||||
curl http://localhost:8080/api/v1/executions/{id} | jq '.data.result'
|
||||
# Should contain "Worker queue TTL expired"
|
||||
|
||||
# 6. Verify DLQ processed it
|
||||
rabbitmqadmin list queues name messages | grep attune.dlx.queue
|
||||
# Should show 0 messages (processed and removed)
|
||||
```
|
||||
|
||||
## Relationship to Phase 1
|
||||
|
||||
**Phase 1 (Timeout Monitor):**
|
||||
- Monitors executions in SCHEDULED state
|
||||
- Fails executions after configured timeout
|
||||
- Acts as backup safety net
|
||||
|
||||
**Phase 2 (Queue TTL + DLQ):**
|
||||
- Expires messages at queue level
|
||||
- More precise failure detection
|
||||
- Provides better visibility (DLQ metrics)
|
||||
|
||||
**Together:** Provide defense-in-depth for worker unavailability
|
||||
|
||||
## Common Operations
|
||||
|
||||
### View DLQ Messages
|
||||
|
||||
```bash
|
||||
# Get messages from DLQ (doesn't remove)
|
||||
rabbitmqadmin get queue=attune.dlx.queue count=10
|
||||
|
||||
# View x-death header for expiration details
|
||||
rabbitmqadmin get queue=attune.dlx.queue count=1 --format=long
|
||||
```
|
||||
|
||||
### Manually Purge DLQ
|
||||
|
||||
```bash
|
||||
# Use with caution - removes all messages
|
||||
rabbitmqadmin purge queue name=attune.dlx.queue
|
||||
```
|
||||
|
||||
### Temporarily Disable DLQ
|
||||
|
||||
```yaml
|
||||
# config.docker.yaml
|
||||
message_queue:
|
||||
rabbitmq:
|
||||
dead_letter:
|
||||
enabled: false # Disables DLQ handler
|
||||
```
|
||||
|
||||
**Note:** Messages will still expire but won't be processed
|
||||
|
||||
### Adjust TTL Without Restart
|
||||
|
||||
Not possible - queue TTL is set at queue creation time. To change:
|
||||
|
||||
```bash
|
||||
# 1. Stop all services
|
||||
docker compose down
|
||||
|
||||
# 2. Delete worker queues (forces recreation)
|
||||
rabbitmqadmin delete queue name=worker.1.executions
|
||||
# Repeat for all worker queues
|
||||
|
||||
# 3. Update config
|
||||
# Edit worker_queue_ttl_ms
|
||||
|
||||
# 4. Restart services (queues recreated with new TTL)
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
## Key Files
|
||||
|
||||
### Configuration
|
||||
- `config.docker.yaml` - Production settings
|
||||
- `config.development.yaml` - Development settings
|
||||
|
||||
### Implementation
|
||||
- `crates/common/src/mq/config.rs` - TTL configuration
|
||||
- `crates/common/src/mq/connection.rs` - Queue setup with TTL
|
||||
- `crates/executor/src/dead_letter_handler.rs` - DLQ processing
|
||||
- `crates/executor/src/service.rs` - DLQ handler integration
|
||||
|
||||
### Documentation
|
||||
- `docs/architecture/worker-queue-ttl-dlq.md` - Full architecture
|
||||
- `docs/architecture/worker-availability-handling.md` - Phase 1 (backup)
|
||||
|
||||
## When to Use
|
||||
|
||||
**Enable DLQ (default):**
|
||||
- Production environments
|
||||
- Development with multiple workers
|
||||
- Any environment requiring high reliability
|
||||
|
||||
**Disable DLQ:**
|
||||
- Local development with single worker
|
||||
- Testing scenarios where you want manual control
|
||||
- Debugging worker behavior
|
||||
|
||||
## Next Steps (Phase 3)
|
||||
|
||||
- **Health probes:** Proactive worker health checking
|
||||
- **Intelligent retry:** Retry transient failures
|
||||
- **Per-action TTL:** Custom timeouts per action type
|
||||
- **DLQ analytics:** Aggregate failure statistics
|
||||
|
||||
## See Also
|
||||
|
||||
- Phase 1 Documentation: `docs/architecture/worker-availability-handling.md`
|
||||
- Queue Architecture: `docs/architecture/queue-architecture.md`
|
||||
- RabbitMQ Dead Letter Exchanges: https://www.rabbitmq.com/dlx.html
|
||||
@@ -339,7 +339,7 @@ Understanding the execution lifecycle helps with monitoring and debugging:
|
||||
```
|
||||
1. requested → Action execution requested
|
||||
2. scheduling → Finding available worker
|
||||
3. scheduled → Assigned to worker, queued
|
||||
3. scheduled → Assigned to worker, queued [HANDOFF TO WORKER]
|
||||
4. running → Currently executing
|
||||
5. completed → Finished successfully
|
||||
OR
|
||||
@@ -352,33 +352,78 @@ Understanding the execution lifecycle helps with monitoring and debugging:
|
||||
abandoned → Worker lost
|
||||
```
|
||||
|
||||
### State Ownership Model
|
||||
|
||||
Execution state is owned by different services at different lifecycle stages:
|
||||
|
||||
**Executor Ownership (Pre-Handoff):**
|
||||
- `requested` → `scheduling` → `scheduled`
|
||||
- Executor creates and updates execution records
|
||||
- Executor selects worker and publishes `execution.scheduled`
|
||||
- **Handles cancellations/failures BEFORE handoff** (before `execution.scheduled` is published)
|
||||
|
||||
**Handoff Point:**
|
||||
- When `execution.scheduled` message is **published to worker**
|
||||
- Before handoff: Executor owns and updates state
|
||||
- After handoff: Worker owns and updates state
|
||||
|
||||
**Worker Ownership (Post-Handoff):**
|
||||
- `running` → `completed` / `failed` / `cancelled` / `timeout` / `abandoned`
|
||||
- Worker updates execution records directly
|
||||
- Worker publishes status change notifications
|
||||
- **Handles cancellations/failures AFTER handoff** (after receiving `execution.scheduled`)
|
||||
- Worker only owns executions it has received
|
||||
|
||||
**Orchestration (Read-Only):**
|
||||
- Executor receives status change notifications for orchestration
|
||||
- Triggers workflow children, manages parent-child relationships
|
||||
- Does NOT update execution state after handoff
|
||||
|
||||
### State Transitions
|
||||
|
||||
**Normal Flow:**
|
||||
```
|
||||
requested → scheduling → scheduled → running → completed
|
||||
requested → scheduling → scheduled → [HANDOFF] → running → completed
|
||||
└─ Executor Updates ─────────┘ └─ Worker Updates ─┘
|
||||
```
|
||||
|
||||
**Failure Flow:**
|
||||
```
|
||||
requested → scheduling → scheduled → running → failed
|
||||
requested → scheduling → scheduled → [HANDOFF] → running → failed
|
||||
└─ Executor Updates ─────────┘ └─ Worker Updates ──┘
|
||||
```
|
||||
|
||||
**Cancellation:**
|
||||
**Cancellation (depends on handoff):**
|
||||
```
|
||||
(any state) → canceling → cancelled
|
||||
Before handoff:
|
||||
requested/scheduling/scheduled → cancelled
|
||||
└─ Executor Updates (worker never notified) ──┘
|
||||
|
||||
After handoff:
|
||||
running → canceling → cancelled
|
||||
└─ Worker Updates ──┘
|
||||
```
|
||||
|
||||
**Timeout:**
|
||||
```
|
||||
scheduled/running → timeout
|
||||
scheduled/running → [HANDOFF] → timeout
|
||||
└─ Worker Updates
|
||||
```
|
||||
|
||||
**Abandonment:**
|
||||
```
|
||||
scheduled/running → abandoned
|
||||
scheduled/running → [HANDOFF] → abandoned
|
||||
└─ Worker Updates
|
||||
```
|
||||
|
||||
**Key Points:**
|
||||
- Only one service updates each execution stage (no race conditions)
|
||||
- Handoff occurs when `execution.scheduled` is **published**, not just when status is set to `scheduled`
|
||||
- If cancelled before handoff: Executor updates (worker never knows execution existed)
|
||||
- If cancelled after handoff: Worker updates (worker owns execution)
|
||||
- Worker is authoritative source for execution state after receiving `execution.scheduled`
|
||||
- Status changes are reflected in real-time via notifications
|
||||
|
||||
---
|
||||
|
||||
## Data Fields
|
||||
|
||||
@@ -87,32 +87,47 @@ Execution Requested → Scheduler → Worker Selection → Execution Scheduled
|
||||
|
||||
### 3. Execution Manager
|
||||
|
||||
**Purpose**: Manages execution lifecycle and status transitions.
|
||||
**Purpose**: Orchestrates execution workflows and handles lifecycle events.
|
||||
|
||||
**Responsibilities**:
|
||||
- Listens for `execution.status.*` messages from workers
|
||||
- Updates execution records with status changes
|
||||
- Handles execution completion (success, failure, cancellation)
|
||||
- Orchestrates workflow executions (parent-child relationships)
|
||||
- Publishes completion notifications for downstream consumers
|
||||
- **Does NOT update execution state** (worker owns state after scheduling)
|
||||
- Handles execution completion orchestration (triggering child executions)
|
||||
- Manages workflow executions (parent-child relationships)
|
||||
- Coordinates workflow state transitions
|
||||
|
||||
**Ownership Model**:
|
||||
- **Executor owns**: Requested → Scheduling → Scheduled (updates DB)
|
||||
- Includes pre-handoff cancellations/failures (before `execution.scheduled` is published)
|
||||
- **Worker owns**: Running → Completed/Failed/Cancelled (updates DB)
|
||||
- Includes post-handoff cancellations/failures (after receiving `execution.scheduled`)
|
||||
- **Handoff Point**: When `execution.scheduled` message is **published** to worker
|
||||
- Before publish: Executor owns and updates state
|
||||
- After publish: Worker owns and updates state
|
||||
|
||||
**Message Flow**:
|
||||
```
|
||||
Worker Status Update → Execution Manager → Database Update → Completion Handler
|
||||
Worker Status Update → Execution Manager → Orchestration Logic (Read-Only)
|
||||
→ Trigger Child Executions
|
||||
```
|
||||
|
||||
**Status Lifecycle**:
|
||||
```
|
||||
Requested → Scheduling → Scheduled → Running → Completed/Failed/Cancelled
|
||||
│
|
||||
└→ Child Executions (workflows)
|
||||
Requested → Scheduling → Scheduled → [HANDOFF: execution.scheduled published] → Running → Completed/Failed/Cancelled
|
||||
│ │ │
|
||||
└─ Executor Updates ───┘ └─ Worker Updates
|
||||
│ (includes pre-handoff │ (includes post-handoff
|
||||
│ Cancelled) │ Cancelled/Timeout/Abandoned)
|
||||
│
|
||||
└→ Child Executions (workflows)
|
||||
```
|
||||
|
||||
**Key Implementation Details**:
|
||||
- Parses status strings to typed enums for type safety
|
||||
- Receives status change notifications for orchestration purposes only
|
||||
- Does not update execution state after handoff to worker
|
||||
- Handles workflow orchestration (parent-child execution chaining)
|
||||
- Only triggers child executions on successful parent completion
|
||||
- Publishes completion events for notification service
|
||||
- Read-only access to execution records for orchestration logic
|
||||
|
||||
## Message Queue Integration
|
||||
|
||||
@@ -123,12 +138,14 @@ The Executor consumes and produces several message types:
|
||||
**Consumed**:
|
||||
- `enforcement.created` - New enforcement from triggered rules
|
||||
- `execution.requested` - Execution scheduling requests
|
||||
- `execution.status.*` - Status updates from workers
|
||||
- `execution.status.changed` - Status change notifications from workers (for orchestration)
|
||||
- `execution.completed` - Completion notifications from workers (for queue management)
|
||||
|
||||
**Published**:
|
||||
- `execution.requested` - To scheduler (from enforcement processor)
|
||||
- `execution.scheduled` - To workers (from scheduler)
|
||||
- `execution.completed` - To notifier (from execution manager)
|
||||
- `execution.scheduled` - To workers (from scheduler) **← OWNERSHIP HANDOFF**
|
||||
|
||||
**Note**: The executor does NOT publish `execution.completed` messages. This is the worker's responsibility as the authoritative source of execution state after scheduling.
|
||||
|
||||
### Message Envelope Structure
|
||||
|
||||
@@ -186,11 +203,34 @@ use attune_common::repositories::{
|
||||
};
|
||||
```
|
||||
|
||||
### Database Update Ownership
|
||||
|
||||
**Executor updates execution state** from creation through handoff:
|
||||
- Creates execution records (`Requested` status)
|
||||
- Updates status during scheduling (`Scheduling` → `Scheduled`)
|
||||
- Publishes `execution.scheduled` message to worker **← HANDOFF POINT**
|
||||
- **Handles cancellations/failures BEFORE handoff** (before message is published)
|
||||
- Example: User cancels execution while queued by concurrency policy
|
||||
- Executor updates to `Cancelled`, worker never receives message
|
||||
|
||||
**Worker updates execution state** after receiving handoff:
|
||||
- Receives `execution.scheduled` message (takes ownership)
|
||||
- Updates status when execution starts (`Running`)
|
||||
- Updates status when execution completes (`Completed`, `Failed`, etc.)
|
||||
- **Handles cancellations/failures AFTER handoff** (after receiving message)
|
||||
- Updates result data and artifacts
|
||||
- Worker only owns executions it has received
|
||||
|
||||
**Executor reads execution state** for orchestration after handoff:
|
||||
- Receives status change notifications from workers
|
||||
- Reads execution records to trigger workflow children
|
||||
- Does NOT update execution state after publishing `execution.scheduled`
|
||||
|
||||
### Transaction Support
|
||||
|
||||
Future implementations will use database transactions for multi-step operations:
|
||||
- Creating execution + publishing message (atomic)
|
||||
- Status update + completion handling (atomic)
|
||||
- Enforcement processing + execution creation (atomic)
|
||||
|
||||
## Configuration
|
||||
|
||||
|
||||
557
docs/architecture/worker-availability-handling.md
Normal file
557
docs/architecture/worker-availability-handling.md
Normal file
@@ -0,0 +1,557 @@
|
||||
# Worker Availability Handling
|
||||
|
||||
**Status**: Implementation Gap Identified
|
||||
**Priority**: High
|
||||
**Date**: 2026-02-09
|
||||
|
||||
## Problem Statement
|
||||
|
||||
When workers are stopped or become unavailable, the executor continues attempting to schedule executions to them, resulting in:
|
||||
|
||||
1. **Stuck executions**: Executions remain in `SCHEDULING` or `SCHEDULED` status indefinitely
|
||||
2. **Queue buildup**: Messages accumulate in worker-specific RabbitMQ queues
|
||||
3. **No failure notification**: Users don't know their executions are stuck
|
||||
4. **Resource waste**: System resources consumed by queued messages and database records
|
||||
|
||||
## Current Architecture
|
||||
|
||||
### Heartbeat Mechanism
|
||||
|
||||
Workers send heartbeat updates to the database periodically (default: 30 seconds).
|
||||
|
||||
```rust
|
||||
// From crates/executor/src/scheduler.rs
|
||||
const DEFAULT_HEARTBEAT_INTERVAL: u64 = 30;
|
||||
const HEARTBEAT_STALENESS_MULTIPLIER: u64 = 3;
|
||||
|
||||
fn is_worker_heartbeat_fresh(worker: &Worker) -> bool {
|
||||
// Worker is fresh if heartbeat < 90 seconds old
|
||||
let max_age = Duration::from_secs(
|
||||
DEFAULT_HEARTBEAT_INTERVAL * HEARTBEAT_STALENESS_MULTIPLIER
|
||||
);
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
### Scheduling Flow
|
||||
|
||||
```
|
||||
Execution Created (REQUESTED)
|
||||
↓
|
||||
Scheduler receives message
|
||||
↓
|
||||
Find compatible worker with fresh heartbeat
|
||||
↓
|
||||
Update execution to SCHEDULED
|
||||
↓
|
||||
Publish message to worker-specific queue
|
||||
↓
|
||||
Worker consumes and executes
|
||||
```
|
||||
|
||||
### Failure Points
|
||||
|
||||
1. **Worker stops after heartbeat**: Worker has fresh heartbeat but is actually down
|
||||
2. **Worker crashes**: No graceful shutdown, heartbeat appears fresh temporarily
|
||||
3. **Network partition**: Worker isolated but appears healthy
|
||||
4. **Queue accumulation**: Messages sit in worker-specific queues indefinitely
|
||||
|
||||
## Current Mitigations (Insufficient)
|
||||
|
||||
### 1. Heartbeat Staleness Check
|
||||
|
||||
```rust
|
||||
fn select_worker(pool: &PgPool, action: &Action) -> Result<Worker> {
|
||||
// Filter by active workers
|
||||
let active_workers: Vec<_> = workers
|
||||
.into_iter()
|
||||
.filter(|w| w.status == WorkerStatus::Active)
|
||||
.collect();
|
||||
|
||||
// Filter by heartbeat freshness
|
||||
let fresh_workers: Vec<_> = active_workers
|
||||
.into_iter()
|
||||
.filter(|w| is_worker_heartbeat_fresh(w))
|
||||
.collect();
|
||||
|
||||
if fresh_workers.is_empty() {
|
||||
return Err(anyhow!("No workers with fresh heartbeats"));
|
||||
}
|
||||
|
||||
// Select first available worker
|
||||
Ok(fresh_workers.into_iter().next().unwrap())
|
||||
}
|
||||
```
|
||||
|
||||
**Gap**: Workers can stop within the 90-second staleness window.
|
||||
|
||||
### 2. Message Requeue on Error
|
||||
|
||||
```rust
|
||||
// From crates/common/src/mq/consumer.rs
|
||||
match handler(envelope.clone()).await {
|
||||
Err(e) => {
|
||||
let requeue = e.is_retriable();
|
||||
channel.basic_nack(delivery_tag, BasicNackOptions {
|
||||
requeue,
|
||||
multiple: false,
|
||||
}).await?;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Gap**: Only requeues on retriable errors (connection/timeout), not worker unavailability.
|
||||
|
||||
### 3. Message TTL Configuration
|
||||
|
||||
```rust
|
||||
// From crates/common/src/config.rs
|
||||
pub struct MessageQueueConfig {
|
||||
#[serde(default = "default_message_ttl")]
|
||||
pub message_ttl: u64,
|
||||
}
|
||||
|
||||
fn default_message_ttl() -> u64 {
|
||||
3600 // 1 hour
|
||||
}
|
||||
```
|
||||
|
||||
**Gap**: TTL not currently applied to worker queues, and 1 hour is too long.
|
||||
|
||||
## Proposed Solutions
|
||||
|
||||
### Solution 1: Execution Timeout Mechanism (HIGH PRIORITY)
|
||||
|
||||
Add a background task that monitors scheduled executions and fails them if they don't start within a timeout.
|
||||
|
||||
**Implementation:**
|
||||
|
||||
```rust
|
||||
// crates/executor/src/execution_timeout_monitor.rs
|
||||
|
||||
pub struct ExecutionTimeoutMonitor {
|
||||
pool: PgPool,
|
||||
publisher: Arc<Publisher>,
|
||||
check_interval: Duration,
|
||||
scheduled_timeout: Duration,
|
||||
}
|
||||
|
||||
impl ExecutionTimeoutMonitor {
|
||||
pub async fn start(&self) -> Result<()> {
|
||||
let mut interval = tokio::time::interval(self.check_interval);
|
||||
|
||||
loop {
|
||||
interval.tick().await;
|
||||
|
||||
if let Err(e) = self.check_stale_executions().await {
|
||||
error!("Error checking stale executions: {}", e);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
async fn check_stale_executions(&self) -> Result<()> {
|
||||
let cutoff = Utc::now() - chrono::Duration::from_std(self.scheduled_timeout)?;
|
||||
|
||||
// Find executions stuck in SCHEDULED status
|
||||
let stale_executions = sqlx::query_as::<_, Execution>(
|
||||
"SELECT * FROM execution
|
||||
WHERE status = 'scheduled'
|
||||
AND updated < $1"
|
||||
)
|
||||
.bind(cutoff)
|
||||
.fetch_all(&self.pool)
|
||||
.await?;
|
||||
|
||||
for execution in stale_executions {
|
||||
warn!(
|
||||
"Execution {} has been scheduled for too long, marking as failed",
|
||||
execution.id
|
||||
);
|
||||
|
||||
self.fail_execution(
|
||||
execution.id,
|
||||
"Execution timeout: worker did not pick up task within timeout"
|
||||
).await?;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
async fn fail_execution(&self, execution_id: i64, reason: &str) -> Result<()> {
|
||||
// Update execution status
|
||||
sqlx::query(
|
||||
"UPDATE execution
|
||||
SET status = 'failed',
|
||||
result = $2,
|
||||
updated = NOW()
|
||||
WHERE id = $1"
|
||||
)
|
||||
.bind(execution_id)
|
||||
.bind(serde_json::json!({
|
||||
"error": reason,
|
||||
"failed_by": "execution_timeout_monitor"
|
||||
}))
|
||||
.execute(&self.pool)
|
||||
.await?;
|
||||
|
||||
// Publish completion notification
|
||||
let payload = ExecutionCompletedPayload {
|
||||
execution_id,
|
||||
status: ExecutionStatus::Failed,
|
||||
result: Some(serde_json::json!({"error": reason})),
|
||||
};
|
||||
|
||||
self.publisher
|
||||
.publish_envelope(
|
||||
MessageType::ExecutionCompleted,
|
||||
payload,
|
||||
"attune.executions",
|
||||
)
|
||||
.await?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Configuration:**
|
||||
|
||||
```yaml
|
||||
# config.yaml
|
||||
executor:
|
||||
scheduled_timeout: 300 # 5 minutes (fail if not running within 5 min)
|
||||
timeout_check_interval: 60 # Check every minute
|
||||
```
|
||||
|
||||
### Solution 2: Worker Queue TTL and DLQ (MEDIUM PRIORITY)
|
||||
|
||||
Apply message TTL to worker-specific queues with dead letter exchange.
|
||||
|
||||
**Implementation:**
|
||||
|
||||
```rust
|
||||
// When declaring worker-specific queues
|
||||
let mut queue_args = FieldTable::default();
|
||||
|
||||
// Set message TTL (5 minutes)
|
||||
queue_args.insert(
|
||||
"x-message-ttl".into(),
|
||||
AMQPValue::LongInt(300_000) // 5 minutes in milliseconds
|
||||
);
|
||||
|
||||
// Set dead letter exchange
|
||||
queue_args.insert(
|
||||
"x-dead-letter-exchange".into(),
|
||||
AMQPValue::LongString("attune.executions.dlx".into())
|
||||
);
|
||||
|
||||
channel.queue_declare(
|
||||
&format!("attune.execution.worker.{}", worker_id),
|
||||
QueueDeclareOptions {
|
||||
durable: true,
|
||||
..Default::default()
|
||||
},
|
||||
queue_args,
|
||||
).await?;
|
||||
```
|
||||
|
||||
**Dead Letter Handler:**
|
||||
|
||||
```rust
|
||||
// crates/executor/src/dead_letter_handler.rs
|
||||
|
||||
pub struct DeadLetterHandler {
|
||||
pool: PgPool,
|
||||
consumer: Arc<Consumer>,
|
||||
}
|
||||
|
||||
impl DeadLetterHandler {
|
||||
pub async fn start(&self) -> Result<()> {
|
||||
self.consumer
|
||||
.consume_with_handler(|envelope: MessageEnvelope<ExecutionScheduledPayload>| {
|
||||
let pool = self.pool.clone();
|
||||
|
||||
async move {
|
||||
warn!("Received dead letter for execution {}", envelope.payload.execution_id);
|
||||
|
||||
// Mark execution as failed
|
||||
sqlx::query(
|
||||
"UPDATE execution
|
||||
SET status = 'failed',
|
||||
result = $2,
|
||||
updated = NOW()
|
||||
WHERE id = $1 AND status = 'scheduled'"
|
||||
)
|
||||
.bind(envelope.payload.execution_id)
|
||||
.bind(serde_json::json!({
|
||||
"error": "Message expired in worker queue (worker unavailable)",
|
||||
"failed_by": "dead_letter_handler"
|
||||
}))
|
||||
.execute(&pool)
|
||||
.await?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
})
|
||||
.await
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Solution 3: Worker Health Probes (LOW PRIORITY)
|
||||
|
||||
Add active health checking instead of relying solely on heartbeats.
|
||||
|
||||
**Implementation:**
|
||||
|
||||
```rust
|
||||
// crates/executor/src/worker_health_checker.rs
|
||||
|
||||
pub struct WorkerHealthChecker {
|
||||
pool: PgPool,
|
||||
check_interval: Duration,
|
||||
}
|
||||
|
||||
impl WorkerHealthChecker {
|
||||
pub async fn start(&self) -> Result<()> {
|
||||
let mut interval = tokio::time::interval(self.check_interval);
|
||||
|
||||
loop {
|
||||
interval.tick().await;
|
||||
|
||||
if let Err(e) = self.check_worker_health().await {
|
||||
error!("Error checking worker health: {}", e);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
async fn check_worker_health(&self) -> Result<()> {
|
||||
let workers = WorkerRepository::find_action_workers(&self.pool).await?;
|
||||
|
||||
for worker in workers {
|
||||
// Skip if heartbeat is very stale (worker is definitely down)
|
||||
if !is_heartbeat_recent(&worker) {
|
||||
continue;
|
||||
}
|
||||
|
||||
// Attempt health check
|
||||
match self.ping_worker(&worker).await {
|
||||
Ok(true) => {
|
||||
// Worker is healthy, ensure status is Active
|
||||
if worker.status != Some(WorkerStatus::Active) {
|
||||
self.update_worker_status(worker.id, WorkerStatus::Active).await?;
|
||||
}
|
||||
}
|
||||
Ok(false) | Err(_) => {
|
||||
// Worker is unhealthy, mark as inactive
|
||||
warn!("Worker {} failed health check", worker.name);
|
||||
self.update_worker_status(worker.id, WorkerStatus::Inactive).await?;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
async fn ping_worker(&self, worker: &Worker) -> Result<bool> {
|
||||
// TODO: Implement health endpoint on worker
|
||||
// For now, check if worker's queue is being consumed
|
||||
Ok(true)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Solution 4: Graceful Worker Shutdown (MEDIUM PRIORITY)
|
||||
|
||||
Ensure workers mark themselves as inactive before shutdown.
|
||||
|
||||
**Implementation:**
|
||||
|
||||
```rust
|
||||
// In worker service shutdown handler
|
||||
impl WorkerService {
|
||||
pub async fn shutdown(&self) -> Result<()> {
|
||||
info!("Worker shutting down gracefully...");
|
||||
|
||||
// Mark worker as inactive
|
||||
sqlx::query(
|
||||
"UPDATE worker SET status = 'inactive', updated = NOW() WHERE id = $1"
|
||||
)
|
||||
.bind(self.worker_id)
|
||||
.execute(&self.pool)
|
||||
.await?;
|
||||
|
||||
// Stop accepting new tasks
|
||||
self.stop_consuming().await?;
|
||||
|
||||
// Wait for in-flight tasks to complete (with timeout)
|
||||
let timeout = Duration::from_secs(30);
|
||||
tokio::time::timeout(timeout, self.wait_for_completion()).await?;
|
||||
|
||||
info!("Worker shutdown complete");
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Docker Signal Handling:**
|
||||
|
||||
```yaml
|
||||
# docker-compose.yaml
|
||||
services:
|
||||
worker-shell:
|
||||
stop_grace_period: 45s # Give worker time to finish tasks
|
||||
```
|
||||
|
||||
## Implementation Priority
|
||||
|
||||
### Phase 1: Immediate (Week 1)
|
||||
1. **Execution Timeout Monitor** - Prevents stuck executions
|
||||
2. **Graceful Shutdown** - Marks workers inactive on stop
|
||||
|
||||
### Phase 2: Short-term (Week 2)
|
||||
3. **Worker Queue TTL + DLQ** - Prevents message buildup
|
||||
4. **Dead Letter Handler** - Fails expired executions
|
||||
|
||||
### Phase 3: Long-term (Month 1)
|
||||
5. **Worker Health Probes** - Active availability verification
|
||||
6. **Retry Logic** - Reschedule to different worker on failure
|
||||
|
||||
## Configuration
|
||||
|
||||
### Recommended Timeouts
|
||||
|
||||
```yaml
|
||||
executor:
|
||||
# How long an execution can stay SCHEDULED before failing
|
||||
scheduled_timeout: 300 # 5 minutes
|
||||
|
||||
# How often to check for stale executions
|
||||
timeout_check_interval: 60 # 1 minute
|
||||
|
||||
# Message TTL in worker queues
|
||||
worker_queue_ttl: 300 # 5 minutes (match scheduled_timeout)
|
||||
|
||||
# Worker health check interval
|
||||
health_check_interval: 30 # 30 seconds
|
||||
|
||||
worker:
|
||||
# How often to send heartbeats
|
||||
heartbeat_interval: 10 # 10 seconds (more frequent)
|
||||
|
||||
# Grace period for shutdown
|
||||
shutdown_timeout: 30 # 30 seconds
|
||||
```
|
||||
|
||||
### Staleness Calculation
|
||||
|
||||
```
|
||||
Heartbeat Staleness Threshold = heartbeat_interval * 3
|
||||
= 10 * 3 = 30 seconds
|
||||
|
||||
This means:
|
||||
- Worker sends heartbeat every 10s
|
||||
- If heartbeat is > 30s old, worker is considered stale
|
||||
- Reduces window where stopped worker appears healthy from 90s to 30s
|
||||
```
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### Metrics to Track
|
||||
|
||||
1. **Execution timeout rate**: Number of executions failed due to timeout
|
||||
2. **Worker downtime**: Time between last heartbeat and status change
|
||||
3. **Dead letter queue depth**: Number of expired messages
|
||||
4. **Average scheduling latency**: Time from REQUESTED to RUNNING
|
||||
|
||||
### Alerts
|
||||
|
||||
```yaml
|
||||
alerts:
|
||||
- name: high_execution_timeout_rate
|
||||
condition: execution_timeouts > 10 per minute
|
||||
severity: warning
|
||||
|
||||
- name: no_active_workers
|
||||
condition: active_workers == 0
|
||||
severity: critical
|
||||
|
||||
- name: dlq_buildup
|
||||
condition: dlq_depth > 100
|
||||
severity: warning
|
||||
|
||||
- name: stale_executions
|
||||
condition: scheduled_executions_older_than_5min > 0
|
||||
severity: warning
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
### Test Scenarios
|
||||
|
||||
1. **Worker stops mid-execution**: Should timeout and fail
|
||||
2. **Worker never picks up task**: Should timeout after 5 minutes
|
||||
3. **All workers down**: Should immediately fail with "no workers available"
|
||||
4. **Worker stops gracefully**: Should mark inactive and not receive new tasks
|
||||
5. **Message expires in queue**: Should be moved to DLQ and execution failed
|
||||
|
||||
### Integration Test Example
|
||||
|
||||
```rust
|
||||
#[tokio::test]
|
||||
async fn test_execution_timeout_on_worker_down() {
|
||||
let pool = setup_test_db().await;
|
||||
let mq = setup_test_mq().await;
|
||||
|
||||
// Create worker and execution
|
||||
let worker = create_test_worker(&pool).await;
|
||||
let execution = create_test_execution(&pool).await;
|
||||
|
||||
// Schedule execution to worker
|
||||
schedule_execution(&pool, &mq, execution.id, worker.id).await;
|
||||
|
||||
// Stop worker (simulate crash - no graceful shutdown)
|
||||
stop_worker(worker.id).await;
|
||||
|
||||
// Wait for timeout
|
||||
tokio::time::sleep(Duration::from_secs(310)).await;
|
||||
|
||||
// Verify execution is marked as failed
|
||||
let execution = get_execution(&pool, execution.id).await;
|
||||
assert_eq!(execution.status, ExecutionStatus::Failed);
|
||||
assert!(execution.result.unwrap()["error"]
|
||||
.as_str()
|
||||
.unwrap()
|
||||
.contains("timeout"));
|
||||
}
|
||||
```
|
||||
|
||||
## Migration Path
|
||||
|
||||
### Step 1: Add Monitoring (No Breaking Changes)
|
||||
- Deploy execution timeout monitor
|
||||
- Monitor logs for timeout events
|
||||
- Tune timeout values based on actual workload
|
||||
|
||||
### Step 2: Add DLQ (Requires Queue Reconfiguration)
|
||||
- Create dead letter exchange
|
||||
- Update queue declarations with TTL and DLX
|
||||
- Deploy dead letter handler
|
||||
- Monitor DLQ depth
|
||||
|
||||
### Step 3: Graceful Shutdown (Worker Update)
|
||||
- Add shutdown handler to worker
|
||||
- Update Docker Compose stop_grace_period
|
||||
- Test worker restarts
|
||||
|
||||
### Step 4: Health Probes (Future Enhancement)
|
||||
- Add health endpoint to worker
|
||||
- Deploy health checker service
|
||||
- Transition from heartbeat-only to active probing
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Queue Architecture](./queue-architecture.md)
|
||||
- [Worker Service](./worker-service.md)
|
||||
- [Executor Service](./executor-service.md)
|
||||
- [RabbitMQ Queues Quick Reference](../docs/QUICKREF-rabbitmq-queues.md)
|
||||
493
docs/architecture/worker-queue-ttl-dlq.md
Normal file
493
docs/architecture/worker-queue-ttl-dlq.md
Normal file
@@ -0,0 +1,493 @@
|
||||
# Worker Queue TTL and Dead Letter Queue (Phase 2)
|
||||
|
||||
## Overview
|
||||
|
||||
Phase 2 of worker availability handling implements message TTL (time-to-live) on worker-specific queues and dead letter queue (DLQ) processing. This ensures that executions sent to unavailable workers are automatically failed instead of remaining stuck indefinitely.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Message Flow
|
||||
|
||||
```
|
||||
┌─────────────┐
|
||||
│ Executor │
|
||||
│ Scheduler │
|
||||
└──────┬──────┘
|
||||
│ Publishes ExecutionRequested
|
||||
│ routing_key: execution.dispatch.worker.{id}
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────┐
|
||||
│ worker.{id}.executions queue │
|
||||
│ │
|
||||
│ Properties: │
|
||||
│ - x-message-ttl: 300000ms (5m) │
|
||||
│ - x-dead-letter-exchange: dlx │
|
||||
└──────┬───────────────────┬───────┘
|
||||
│ │
|
||||
│ Worker consumes │ TTL expires
|
||||
│ (normal flow) │ (worker unavailable)
|
||||
│ │
|
||||
▼ ▼
|
||||
┌──────────────┐ ┌──────────────────┐
|
||||
│ Worker │ │ attune.dlx │
|
||||
│ Service │ │ (Dead Letter │
|
||||
│ │ │ Exchange) │
|
||||
└──────────────┘ └────────┬─────────┘
|
||||
│
|
||||
│ Routes to DLQ
|
||||
│
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ attune.dlx.queue │
|
||||
│ (Dead Letter Queue) │
|
||||
└────────┬─────────────┘
|
||||
│
|
||||
│ Consumes
|
||||
│
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ Dead Letter Handler │
|
||||
│ (in Executor) │
|
||||
│ │
|
||||
│ - Identifies exec │
|
||||
│ - Marks as FAILED │
|
||||
│ - Logs failure │
|
||||
└──────────────────────┘
|
||||
```
|
||||
|
||||
### Components
|
||||
|
||||
#### 1. Worker Queue TTL
|
||||
|
||||
**Configuration:**
|
||||
- Default: 5 minutes (300,000 milliseconds)
|
||||
- Configurable via `rabbitmq.worker_queue_ttl_ms`
|
||||
|
||||
**Implementation:**
|
||||
- Applied during queue declaration in `Connection::setup_worker_infrastructure()`
|
||||
- Uses RabbitMQ's `x-message-ttl` queue argument
|
||||
- Only applies to worker-specific queues (`worker.{id}.executions`)
|
||||
|
||||
**Behavior:**
|
||||
- When a message remains in the queue longer than TTL
|
||||
- RabbitMQ automatically moves it to the configured dead letter exchange
|
||||
- Original message properties and headers are preserved
|
||||
- Includes `x-death` header with expiration details
|
||||
|
||||
#### 2. Dead Letter Exchange (DLX)
|
||||
|
||||
**Configuration:**
|
||||
- Exchange name: `attune.dlx`
|
||||
- Type: `direct`
|
||||
- Durable: `true`
|
||||
|
||||
**Setup:**
|
||||
- Created in `Connection::setup_common_infrastructure()`
|
||||
- Bound to dead letter queue with routing key `#` (all messages)
|
||||
- Shared across all services
|
||||
|
||||
#### 3. Dead Letter Queue
|
||||
|
||||
**Configuration:**
|
||||
- Queue name: `attune.dlx.queue`
|
||||
- Durable: `true`
|
||||
- TTL: 24 hours (configurable via `rabbitmq.dead_letter.ttl_ms`)
|
||||
|
||||
**Properties:**
|
||||
- Retains messages for debugging and analysis
|
||||
- Messages auto-expire after retention period
|
||||
- No DLX on the DLQ itself (prevents infinite loops)
|
||||
|
||||
#### 4. Dead Letter Handler
|
||||
|
||||
**Location:** `crates/executor/src/dead_letter_handler.rs`
|
||||
|
||||
**Responsibilities:**
|
||||
1. Consume messages from `attune.dlx.queue`
|
||||
2. Deserialize message envelope
|
||||
3. Extract execution ID from payload
|
||||
4. Verify execution is in non-terminal state
|
||||
5. Update execution to FAILED status
|
||||
6. Add descriptive error information
|
||||
7. Acknowledge message (remove from DLQ)
|
||||
|
||||
**Error Handling:**
|
||||
- Invalid messages: Acknowledged and discarded
|
||||
- Missing executions: Acknowledged (already processed)
|
||||
- Terminal state executions: Acknowledged (no action needed)
|
||||
- Database errors: Nacked with requeue (retry later)
|
||||
|
||||
## Configuration
|
||||
|
||||
### RabbitMQ Configuration Structure
|
||||
|
||||
```yaml
|
||||
message_queue:
|
||||
rabbitmq:
|
||||
# Worker queue TTL - how long messages wait before DLX
|
||||
worker_queue_ttl_ms: 300000 # 5 minutes (default)
|
||||
|
||||
# Dead letter configuration
|
||||
dead_letter:
|
||||
enabled: true # Enable DLQ system
|
||||
exchange: attune.dlx # DLX name
|
||||
ttl_ms: 86400000 # DLQ retention (24 hours)
|
||||
```
|
||||
|
||||
### Environment-Specific Settings
|
||||
|
||||
#### Development (`config.development.yaml`)
|
||||
```yaml
|
||||
message_queue:
|
||||
rabbitmq:
|
||||
worker_queue_ttl_ms: 300000 # 5 minutes
|
||||
dead_letter:
|
||||
enabled: true
|
||||
exchange: attune.dlx
|
||||
ttl_ms: 86400000 # 24 hours
|
||||
```
|
||||
|
||||
#### Production (`config.docker.yaml`)
|
||||
```yaml
|
||||
message_queue:
|
||||
rabbitmq:
|
||||
worker_queue_ttl_ms: 300000 # 5 minutes
|
||||
dead_letter:
|
||||
enabled: true
|
||||
exchange: attune.dlx
|
||||
ttl_ms: 86400000 # 24 hours
|
||||
```
|
||||
|
||||
### Tuning Guidelines
|
||||
|
||||
**Worker Queue TTL (`worker_queue_ttl_ms`):**
|
||||
- **Too short:** Legitimate slow workers may have executions failed prematurely
|
||||
- **Too long:** Unavailable workers cause delayed failure detection
|
||||
- **Recommendation:** 2-5x typical execution time, minimum 2 minutes
|
||||
- **Default (5 min):** Good balance for most workloads
|
||||
|
||||
**DLQ Retention (`dead_letter.ttl_ms`):**
|
||||
- Purpose: Debugging and forensics
|
||||
- **Too short:** May lose data before analysis
|
||||
- **Too long:** Accumulates stale data
|
||||
- **Recommendation:** 24-48 hours in production
|
||||
- **Default (24 hours):** Adequate for most troubleshooting
|
||||
|
||||
## Code Structure
|
||||
|
||||
### Queue Declaration with TTL
|
||||
|
||||
```rust
|
||||
// crates/common/src/mq/connection.rs
|
||||
|
||||
pub async fn declare_queue_with_dlx_and_ttl(
|
||||
&self,
|
||||
config: &QueueConfig,
|
||||
dlx_exchange: &str,
|
||||
ttl_ms: Option<u64>,
|
||||
) -> MqResult<()> {
|
||||
let mut args = FieldTable::default();
|
||||
|
||||
// Configure DLX
|
||||
args.insert(
|
||||
"x-dead-letter-exchange".into(),
|
||||
AMQPValue::LongString(dlx_exchange.into()),
|
||||
);
|
||||
|
||||
// Configure TTL if specified
|
||||
if let Some(ttl) = ttl_ms {
|
||||
args.insert(
|
||||
"x-message-ttl".into(),
|
||||
AMQPValue::LongInt(ttl as i64),
|
||||
);
|
||||
}
|
||||
|
||||
// Declare queue with arguments
|
||||
channel.queue_declare(&config.name, options, args).await?;
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
### Dead Letter Handler
|
||||
|
||||
```rust
|
||||
// crates/executor/src/dead_letter_handler.rs
|
||||
|
||||
pub struct DeadLetterHandler {
|
||||
pool: Arc<PgPool>,
|
||||
consumer: Consumer,
|
||||
running: Arc<Mutex<bool>>,
|
||||
}
|
||||
|
||||
impl DeadLetterHandler {
|
||||
pub async fn start(&self) -> Result<(), Error> {
|
||||
self.consumer.consume_with_handler(|envelope| {
|
||||
match envelope.message_type {
|
||||
MessageType::ExecutionRequested => {
|
||||
handle_execution_requested(&pool, &envelope).await
|
||||
}
|
||||
_ => {
|
||||
// Unexpected message type - acknowledge and discard
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
}).await
|
||||
}
|
||||
}
|
||||
|
||||
async fn handle_execution_requested(
|
||||
pool: &PgPool,
|
||||
envelope: &MessageEnvelope<Value>,
|
||||
) -> MqResult<()> {
|
||||
// Extract execution ID
|
||||
let execution_id = envelope.payload.get("execution_id")
|
||||
.and_then(|v| v.as_i64())
|
||||
.ok_or_else(|| /* error */)?;
|
||||
|
||||
// Fetch current state
|
||||
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
|
||||
|
||||
// Only fail if in non-terminal state
|
||||
if !execution.status.is_terminal() {
|
||||
ExecutionRepository::update(pool, execution_id, UpdateExecutionInput {
|
||||
status: Some(ExecutionStatus::Failed),
|
||||
result: Some(json!({
|
||||
"error": "Worker queue TTL expired",
|
||||
"message": "Worker did not process execution within configured TTL",
|
||||
})),
|
||||
ended: Some(Some(Utc::now())),
|
||||
..Default::default()
|
||||
}).await?;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
## Integration with Executor Service
|
||||
|
||||
The dead letter handler is started automatically by the executor service if DLQ is enabled:
|
||||
|
||||
```rust
|
||||
// crates/executor/src/service.rs
|
||||
|
||||
pub async fn start(&self) -> Result<()> {
|
||||
// ... other components ...
|
||||
|
||||
// Start dead letter handler (if enabled)
|
||||
if self.inner.mq_config.rabbitmq.dead_letter.enabled {
|
||||
let dlq_name = format!("{}.queue",
|
||||
self.inner.mq_config.rabbitmq.dead_letter.exchange);
|
||||
let dlq_consumer = Consumer::new(
|
||||
&self.inner.mq_connection,
|
||||
create_dlq_consumer_config(&dlq_name, "executor.dlq"),
|
||||
).await?;
|
||||
|
||||
let dlq_handler = Arc::new(
|
||||
DeadLetterHandler::new(self.inner.pool.clone(), dlq_consumer).await?
|
||||
);
|
||||
|
||||
handles.push(tokio::spawn(async move {
|
||||
dlq_handler.start().await
|
||||
}));
|
||||
}
|
||||
|
||||
// ... wait for completion ...
|
||||
}
|
||||
```
|
||||
|
||||
## Operational Considerations
|
||||
|
||||
### Monitoring
|
||||
|
||||
**Key Metrics:**
|
||||
- DLQ message rate (messages/sec entering DLQ)
|
||||
- DLQ queue depth (current messages in DLQ)
|
||||
- DLQ processing latency (time from DLX to handler)
|
||||
- Failed execution count (executions failed via DLQ)
|
||||
|
||||
**Alerting Thresholds:**
|
||||
- DLQ rate > 10/min: Workers may be unhealthy or TTL too aggressive
|
||||
- DLQ depth > 100: Handler may be falling behind
|
||||
- High failure rate: Systematic worker availability issues
|
||||
|
||||
### RabbitMQ Management
|
||||
|
||||
**View DLQ:**
|
||||
```bash
|
||||
# List messages in DLQ
|
||||
rabbitmqadmin list queues name messages
|
||||
|
||||
# Get DLQ details
|
||||
rabbitmqadmin show queue name=attune.dlx.queue
|
||||
|
||||
# Purge DLQ (use with caution)
|
||||
rabbitmqadmin purge queue name=attune.dlx.queue
|
||||
```
|
||||
|
||||
**View Dead Letters:**
|
||||
```bash
|
||||
# Get message from DLQ
|
||||
rabbitmqadmin get queue=attune.dlx.queue count=1
|
||||
|
||||
# Check message death history
|
||||
# Look for x-death header in message properties
|
||||
```
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
#### High DLQ Rate
|
||||
|
||||
**Symptoms:** Many executions failing via DLQ
|
||||
|
||||
**Causes:**
|
||||
1. Workers down or restarting frequently
|
||||
2. Worker queue TTL too aggressive
|
||||
3. Worker overloaded (not consuming fast enough)
|
||||
4. Network issues between executor and workers
|
||||
|
||||
**Resolution:**
|
||||
1. Check worker health and logs
|
||||
2. Verify worker heartbeats in database
|
||||
3. Consider increasing `worker_queue_ttl_ms`
|
||||
4. Scale worker fleet if overloaded
|
||||
|
||||
#### DLQ Handler Not Processing
|
||||
|
||||
**Symptoms:** DLQ depth increasing, executions stuck
|
||||
|
||||
**Causes:**
|
||||
1. Executor service not running
|
||||
2. DLQ disabled in configuration
|
||||
3. Database connection issues
|
||||
4. Handler crashed or deadlocked
|
||||
|
||||
**Resolution:**
|
||||
1. Check executor service logs
|
||||
2. Verify `dead_letter.enabled = true`
|
||||
3. Check database connectivity
|
||||
4. Restart executor service if needed
|
||||
|
||||
#### Messages Not Reaching DLQ
|
||||
|
||||
**Symptoms:** Executions stuck, DLQ empty
|
||||
|
||||
**Causes:**
|
||||
1. Worker queues not configured with DLX
|
||||
2. DLX exchange not created
|
||||
3. DLQ not bound to DLX
|
||||
4. TTL not configured on worker queues
|
||||
|
||||
**Resolution:**
|
||||
1. Restart services to recreate infrastructure
|
||||
2. Verify RabbitMQ configuration
|
||||
3. Check queue properties in RabbitMQ management UI
|
||||
|
||||
## Testing
|
||||
|
||||
### Unit Tests
|
||||
|
||||
```rust
|
||||
#[tokio::test]
|
||||
async fn test_expired_execution_handling() {
|
||||
let pool = setup_test_db().await;
|
||||
|
||||
// Create execution in SCHEDULED state
|
||||
let execution = create_test_execution(&pool, ExecutionStatus::Scheduled).await;
|
||||
|
||||
// Simulate DLQ message
|
||||
let envelope = MessageEnvelope::new(
|
||||
MessageType::ExecutionRequested,
|
||||
json!({ "execution_id": execution.id }),
|
||||
);
|
||||
|
||||
// Process message
|
||||
handle_execution_requested(&pool, &envelope).await.unwrap();
|
||||
|
||||
// Verify execution failed
|
||||
let updated = ExecutionRepository::find_by_id(&pool, execution.id).await.unwrap();
|
||||
assert_eq!(updated.status, ExecutionStatus::Failed);
|
||||
assert!(updated.result.unwrap()["error"].as_str().unwrap().contains("TTL expired"));
|
||||
}
|
||||
```
|
||||
|
||||
### Integration Tests
|
||||
|
||||
```bash
|
||||
# 1. Start all services
|
||||
docker compose up -d
|
||||
|
||||
# 2. Create execution targeting stopped worker
|
||||
curl -X POST http://localhost:8080/api/v1/executions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"action_ref": "core.echo",
|
||||
"parameters": {"message": "test"},
|
||||
"worker_id": 999 # Non-existent worker
|
||||
}'
|
||||
|
||||
# 3. Wait for TTL expiration (5+ minutes)
|
||||
sleep 330
|
||||
|
||||
# 4. Verify execution failed
|
||||
curl http://localhost:8080/api/v1/executions/{id}
|
||||
# Should show status: "failed", error: "Worker queue TTL expired"
|
||||
|
||||
# 5. Check DLQ processed the message
|
||||
rabbitmqadmin list queues name messages | grep attune.dlx.queue
|
||||
# Should show 0 messages (processed and removed)
|
||||
```
|
||||
|
||||
## Relationship to Other Phases
|
||||
|
||||
### Phase 1 (Completed)
|
||||
- Execution timeout monitor: Handles executions stuck in SCHEDULED
|
||||
- Graceful shutdown: Prevents new tasks to stopping workers
|
||||
- Reduced heartbeat: Faster stale worker detection
|
||||
|
||||
**Interaction:** Phase 1 timeout monitor acts as a backstop if DLQ processing fails
|
||||
|
||||
### Phase 2 (Current)
|
||||
- Worker queue TTL: Automatic message expiration
|
||||
- Dead letter queue: Capture expired messages
|
||||
- Dead letter handler: Process and fail expired executions
|
||||
|
||||
**Benefit:** More precise failure detection at the message queue level
|
||||
|
||||
### Phase 3 (Planned)
|
||||
- Health probes: Proactive worker health checking
|
||||
- Intelligent retry: Retry transient failures
|
||||
- Load balancing: Distribute work across healthy workers
|
||||
|
||||
**Integration:** Phase 3 will use Phase 2 DLQ data to inform routing decisions
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Automatic Failure Detection:** No manual intervention needed for unavailable workers
|
||||
2. **Precise Timing:** TTL provides exact failure window (vs polling-based Phase 1)
|
||||
3. **Resource Efficiency:** Prevents message accumulation in worker queues
|
||||
4. **Debugging Support:** DLQ retains messages for forensic analysis
|
||||
5. **Graceful Degradation:** System continues functioning even with worker failures
|
||||
|
||||
## Limitations
|
||||
|
||||
1. **TTL Precision:** RabbitMQ TTL is approximate, not guaranteed to the millisecond
|
||||
2. **Race Conditions:** Worker may start processing just as TTL expires (rare)
|
||||
3. **DLQ Capacity:** Very high failure rates may overwhelm DLQ
|
||||
4. **No Retry Logic:** Phase 2 always fails; Phase 3 will add intelligent retry
|
||||
|
||||
## Future Enhancements (Phase 3)
|
||||
|
||||
- **Conditional Retry:** Retry messages based on failure reason
|
||||
- **Priority DLQ:** Prioritize critical execution failures
|
||||
- **DLQ Analytics:** Aggregate statistics on failure patterns
|
||||
- **Auto-scaling:** Scale workers based on DLQ rate
|
||||
- **Custom TTL:** Per-action or per-execution TTL configuration
|
||||
|
||||
## References
|
||||
|
||||
- RabbitMQ Dead Letter Exchanges: https://www.rabbitmq.com/dlx.html
|
||||
- RabbitMQ TTL: https://www.rabbitmq.com/ttl.html
|
||||
- Phase 1 Documentation: `docs/architecture/worker-availability-handling.md`
|
||||
- Queue Architecture: `docs/architecture/queue-architecture.md`
|
||||
@@ -131,28 +131,38 @@ echo "Hello, $PARAM_NAME!"
|
||||
|
||||
### 4. Action Executor
|
||||
|
||||
**Purpose**: Orchestrate the complete execution flow for an action.
|
||||
**Purpose**: Orchestrate the complete execution flow for an action and own execution state after handoff.
|
||||
|
||||
**Execution Flow**:
|
||||
```
|
||||
1. Load execution record from database
|
||||
2. Update status to Running
|
||||
3. Load action definition by reference
|
||||
4. Prepare execution context (parameters, env vars, timeout)
|
||||
5. Select and execute in appropriate runtime
|
||||
6. Capture results (stdout, stderr, return value)
|
||||
7. Store artifacts (logs, results)
|
||||
8. Update execution status (Succeeded/Failed)
|
||||
9. Publish status update messages
|
||||
1. Receive execution.scheduled message from executor
|
||||
2. Load execution record from database
|
||||
3. Update status to Running (owns state after handoff)
|
||||
4. Load action definition by reference
|
||||
5. Prepare execution context (parameters, env vars, timeout)
|
||||
6. Select and execute in appropriate runtime
|
||||
7. Capture results (stdout, stderr, return value)
|
||||
8. Store artifacts (logs, results)
|
||||
9. Update execution status (Completed/Failed) in database
|
||||
10. Publish status change notifications
|
||||
11. Publish completion notification for queue management
|
||||
```
|
||||
|
||||
**Ownership Model**:
|
||||
- **Worker owns execution state** after receiving `execution.scheduled`
|
||||
- **Authoritative source** for all status updates: Running, Completed, Failed, Cancelled, etc.
|
||||
- **Updates database directly** for all state changes
|
||||
- **Publishes notifications** for orchestration and monitoring
|
||||
|
||||
**Responsibilities**:
|
||||
- Coordinate execution lifecycle
|
||||
- Load action and execution data from database
|
||||
- **Update execution state in database** (after handoff from executor)
|
||||
- Prepare execution context with parameters and environment
|
||||
- Execute action via runtime registry
|
||||
- Handle success and failure cases
|
||||
- Store execution artifacts
|
||||
- Publish status change notifications
|
||||
|
||||
**Key Implementation Details**:
|
||||
- Parameters merged: action defaults + execution overrides
|
||||
@@ -246,7 +256,10 @@ See `docs/secrets-management.md` for comprehensive documentation.
|
||||
- Register worker in database
|
||||
- Start heartbeat manager
|
||||
- Consume execution messages from worker-specific queue
|
||||
- Publish execution status updates
|
||||
- **Own execution state** after receiving scheduled executions
|
||||
- **Update execution status in database** (Running, Completed, Failed, etc.)
|
||||
- Publish execution status change notifications
|
||||
- Publish execution completion notifications
|
||||
- Handle graceful shutdown
|
||||
|
||||
**Message Flow**:
|
||||
@@ -407,8 +420,9 @@ pub struct ExecutionResult {
|
||||
### Error Propagation
|
||||
|
||||
- Runtime errors captured in `ExecutionResult.error`
|
||||
- Execution status updated to Failed in database
|
||||
- Error published in status update message
|
||||
- **Worker updates** execution status to Failed in database (owns state)
|
||||
- Error published in status change notification message
|
||||
- Error published in completion notification message
|
||||
- Artifacts still stored for failed executions
|
||||
- Logs preserved for debugging
|
||||
|
||||
|
||||
227
docs/examples/history-page-url-examples.md
Normal file
227
docs/examples/history-page-url-examples.md
Normal file
@@ -0,0 +1,227 @@
|
||||
# History Page URL Query Parameter Examples
|
||||
|
||||
This document provides practical examples of using URL query parameters to deep-link to filtered views in the Attune web UI history pages.
|
||||
|
||||
## Executions Page Examples
|
||||
|
||||
### Basic Filtering
|
||||
|
||||
**Filter by action:**
|
||||
```
|
||||
http://localhost:3000/executions?action_ref=core.echo
|
||||
```
|
||||
Shows all executions of the `core.echo` action.
|
||||
|
||||
**Filter by rule:**
|
||||
```
|
||||
http://localhost:3000/executions?rule_ref=core.on_timer
|
||||
```
|
||||
Shows all executions triggered by the `core.on_timer` rule.
|
||||
|
||||
**Filter by status:**
|
||||
```
|
||||
http://localhost:3000/executions?status=failed
|
||||
```
|
||||
Shows all failed executions.
|
||||
|
||||
**Filter by pack:**
|
||||
```
|
||||
http://localhost:3000/executions?pack_name=core
|
||||
```
|
||||
Shows all executions from the `core` pack.
|
||||
|
||||
### Combined Filters
|
||||
|
||||
**Rule + Status:**
|
||||
```
|
||||
http://localhost:3000/executions?rule_ref=core.on_timer&status=completed
|
||||
```
|
||||
Shows completed executions from a specific rule.
|
||||
|
||||
**Action + Pack:**
|
||||
```
|
||||
http://localhost:3000/executions?action_ref=core.echo&pack_name=core
|
||||
```
|
||||
Shows executions of a specific action in a pack (useful when multiple packs have similarly named actions).
|
||||
|
||||
**Multiple Filters:**
|
||||
```
|
||||
http://localhost:3000/executions?pack_name=core&status=running&trigger_ref=core.webhook
|
||||
```
|
||||
Shows currently running executions from the core pack triggered by webhooks.
|
||||
|
||||
### Troubleshooting Scenarios
|
||||
|
||||
**Find all failed executions for an action:**
|
||||
```
|
||||
http://localhost:3000/executions?action_ref=mypack.problematic_action&status=failed
|
||||
```
|
||||
|
||||
**Check running executions for a specific executor:**
|
||||
```
|
||||
http://localhost:3000/executions?executor=1&status=running
|
||||
```
|
||||
|
||||
**View all webhook-triggered executions:**
|
||||
```
|
||||
http://localhost:3000/executions?trigger_ref=core.webhook
|
||||
```
|
||||
|
||||
## Events Page Examples
|
||||
|
||||
### Basic Filtering
|
||||
|
||||
**Filter by trigger:**
|
||||
```
|
||||
http://localhost:3000/events?trigger_ref=core.webhook
|
||||
```
|
||||
Shows all webhook events.
|
||||
|
||||
**Timer events:**
|
||||
```
|
||||
http://localhost:3000/events?trigger_ref=core.timer
|
||||
```
|
||||
Shows all timer-based events.
|
||||
|
||||
**Custom trigger:**
|
||||
```
|
||||
http://localhost:3000/events?trigger_ref=mypack.custom_trigger
|
||||
```
|
||||
Shows events from a custom trigger.
|
||||
|
||||
## Enforcements Page Examples
|
||||
|
||||
### Basic Filtering
|
||||
|
||||
**Filter by rule:**
|
||||
```
|
||||
http://localhost:3000/enforcements?rule_ref=core.on_timer
|
||||
```
|
||||
Shows all enforcements (rule activations) for a specific rule.
|
||||
|
||||
**Filter by trigger:**
|
||||
```
|
||||
http://localhost:3000/enforcements?trigger_ref=core.webhook
|
||||
```
|
||||
Shows all enforcements triggered by webhook events.
|
||||
|
||||
**Filter by event:**
|
||||
```
|
||||
http://localhost:3000/enforcements?event=123
|
||||
```
|
||||
Shows the enforcement created by a specific event (useful for tracing event → enforcement → execution flow).
|
||||
|
||||
**Filter by status:**
|
||||
```
|
||||
http://localhost:3000/enforcements?status=processed
|
||||
```
|
||||
Shows processed enforcements.
|
||||
|
||||
### Combined Filters
|
||||
|
||||
**Rule + Status:**
|
||||
```
|
||||
http://localhost:3000/enforcements?rule_ref=core.on_timer&status=processed
|
||||
```
|
||||
Shows successfully processed enforcements for a specific rule.
|
||||
|
||||
**Trigger + Event:**
|
||||
```
|
||||
http://localhost:3000/enforcements?trigger_ref=core.webhook&event=456
|
||||
```
|
||||
Shows enforcements from a specific webhook event.
|
||||
|
||||
## Practical Use Cases
|
||||
|
||||
### Debugging a Rule
|
||||
|
||||
1. **Check the event was created:**
|
||||
```
|
||||
http://localhost:3000/events?trigger_ref=core.timer
|
||||
```
|
||||
|
||||
2. **Check the enforcement was created:**
|
||||
```
|
||||
http://localhost:3000/enforcements?rule_ref=core.on_timer
|
||||
```
|
||||
|
||||
3. **Check the execution was triggered:**
|
||||
```
|
||||
http://localhost:3000/executions?rule_ref=core.on_timer
|
||||
```
|
||||
|
||||
### Monitoring Action Performance
|
||||
|
||||
**See all executions of an action:**
|
||||
```
|
||||
http://localhost:3000/executions?action_ref=core.http_request
|
||||
```
|
||||
|
||||
**See failures:**
|
||||
```
|
||||
http://localhost:3000/executions?action_ref=core.http_request&status=failed
|
||||
```
|
||||
|
||||
**See currently running:**
|
||||
```
|
||||
http://localhost:3000/executions?action_ref=core.http_request&status=running
|
||||
```
|
||||
|
||||
### Auditing Webhook Activity
|
||||
|
||||
1. **View all webhook events:**
|
||||
```
|
||||
http://localhost:3000/events?trigger_ref=core.webhook
|
||||
```
|
||||
|
||||
2. **View enforcements from webhooks:**
|
||||
```
|
||||
http://localhost:3000/enforcements?trigger_ref=core.webhook
|
||||
```
|
||||
|
||||
3. **View executions triggered by webhooks:**
|
||||
```
|
||||
http://localhost:3000/executions?trigger_ref=core.webhook
|
||||
```
|
||||
|
||||
### Sharing Views with Team Members
|
||||
|
||||
**Share failed executions for investigation:**
|
||||
```
|
||||
http://localhost:3000/executions?action_ref=mypack.critical_action&status=failed
|
||||
```
|
||||
|
||||
**Share rule activity for review:**
|
||||
```
|
||||
http://localhost:3000/enforcements?rule_ref=mypack.important_rule&status=processed
|
||||
```
|
||||
|
||||
## Tips and Notes
|
||||
|
||||
1. **URL Encoding**: If your pack, action, rule, or trigger names contain special characters, they will be automatically URL-encoded by the browser.
|
||||
|
||||
2. **Case Sensitivity**: Parameter names and values are case-sensitive. Use lowercase for status values (e.g., `status=failed`, not `status=Failed`).
|
||||
|
||||
3. **Invalid Values**: Invalid parameter values are silently ignored, and the filter will default to empty (showing all results).
|
||||
|
||||
4. **Bookmarking**: Save frequently used URLs as browser bookmarks for quick access to common filtered views.
|
||||
|
||||
5. **Browser History**: The URL doesn't change as you modify filters in the UI, so the browser's back button won't undo filter changes within a page.
|
||||
|
||||
6. **Multiple Status Filters**: While the UI allows selecting multiple statuses, only one status can be specified via URL parameter. Use the UI to select multiple statuses after the page loads.
|
||||
|
||||
## Parameter Reference Quick Table
|
||||
|
||||
| Page | Parameter | Example Value |
|
||||
|------|-----------|---------------|
|
||||
| Executions | `action_ref` | `core.echo` |
|
||||
| Executions | `rule_ref` | `core.on_timer` |
|
||||
| Executions | `trigger_ref` | `core.webhook` |
|
||||
| Executions | `pack_name` | `core` |
|
||||
| Executions | `executor` | `1` |
|
||||
| Executions | `status` | `failed`, `running`, `completed` |
|
||||
| Events | `trigger_ref` | `core.webhook` |
|
||||
| Enforcements | `rule_ref` | `core.on_timer` |
|
||||
| Enforcements | `trigger_ref` | `core.webhook` |
|
||||
| Enforcements | `event` | `123` |
|
||||
| Enforcements | `status` | `processed`, `created`, `disabled` |
|
||||
365
docs/parameters/dotenv-parameter-format.md
Normal file
365
docs/parameters/dotenv-parameter-format.md
Normal file
@@ -0,0 +1,365 @@
|
||||
# DOTENV Parameter Format
|
||||
|
||||
## Overview
|
||||
|
||||
The DOTENV parameter format is used to pass action parameters securely via stdin in a shell-compatible format. This format is particularly useful for shell scripts that need to parse parameters without relying on external tools like `jq`.
|
||||
|
||||
## Format Specification
|
||||
|
||||
### Basic Format
|
||||
|
||||
Parameters are formatted as `key='value'` pairs, one per line:
|
||||
|
||||
```bash
|
||||
url='https://example.com'
|
||||
method='GET'
|
||||
timeout='30'
|
||||
verify_ssl='true'
|
||||
```
|
||||
|
||||
### Nested Object Flattening
|
||||
|
||||
Nested JSON objects are automatically flattened using dot notation. This allows shell scripts to easily parse complex parameter structures.
|
||||
|
||||
**Input JSON:**
|
||||
```json
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"headers": {
|
||||
"Content-Type": "application/json",
|
||||
"Authorization": "Bearer token123"
|
||||
},
|
||||
"query_params": {
|
||||
"page": "1",
|
||||
"size": "10"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Output DOTENV:**
|
||||
```bash
|
||||
headers.Authorization='Bearer token123'
|
||||
headers.Content-Type='application/json'
|
||||
query_params.page='1'
|
||||
query_params.size='10'
|
||||
url='https://example.com'
|
||||
```
|
||||
|
||||
### Empty Objects
|
||||
|
||||
Empty objects (`{}`) are omitted from the output entirely. They do not produce any dotenv entries.
|
||||
|
||||
**Input:**
|
||||
```json
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"headers": {},
|
||||
"query_params": {}
|
||||
}
|
||||
```
|
||||
|
||||
**Output:**
|
||||
```bash
|
||||
url='https://example.com'
|
||||
```
|
||||
|
||||
### Arrays
|
||||
|
||||
Arrays are serialized as JSON strings:
|
||||
|
||||
**Input:**
|
||||
```json
|
||||
{
|
||||
"tags": ["web", "api", "production"]
|
||||
}
|
||||
```
|
||||
|
||||
**Output:**
|
||||
```bash
|
||||
tags='["web","api","production"]'
|
||||
```
|
||||
|
||||
### Special Characters
|
||||
|
||||
Single quotes in values are escaped using the shell-safe `'\''` pattern:
|
||||
|
||||
**Input:**
|
||||
```json
|
||||
{
|
||||
"message": "It's working!"
|
||||
}
|
||||
```
|
||||
|
||||
**Output:**
|
||||
```bash
|
||||
message='It'\''s working!'
|
||||
```
|
||||
|
||||
## Shell Script Parsing
|
||||
|
||||
### Basic Parameter Parsing
|
||||
|
||||
```bash
|
||||
#!/bin/sh
|
||||
|
||||
# Read DOTENV-formatted parameters from stdin
|
||||
while IFS= read -r line; do
|
||||
case "$line" in
|
||||
*"---ATTUNE_PARAMS_END---"*) break ;;
|
||||
esac
|
||||
[ -z "$line" ] && continue
|
||||
|
||||
key="${line%%=*}"
|
||||
value="${line#*=}"
|
||||
|
||||
# Remove quotes
|
||||
case "$value" in
|
||||
\"*\") value="${value#\"}"; value="${value%\"}" ;;
|
||||
\'*\') value="${value#\'}"; value="${value%\'}" ;;
|
||||
esac
|
||||
|
||||
# Process parameters
|
||||
case "$key" in
|
||||
url) url="$value" ;;
|
||||
method) method="$value" ;;
|
||||
timeout) timeout="$value" ;;
|
||||
esac
|
||||
done
|
||||
```
|
||||
|
||||
### Parsing Nested Objects
|
||||
|
||||
For flattened nested objects, use pattern matching on the key prefix:
|
||||
|
||||
```bash
|
||||
# Create temporary files for nested data
|
||||
headers_file=$(mktemp)
|
||||
query_params_file=$(mktemp)
|
||||
|
||||
while IFS= read -r line; do
|
||||
case "$line" in
|
||||
*"---ATTUNE_PARAMS_END---"*) break ;;
|
||||
esac
|
||||
[ -z "$line" ] && continue
|
||||
|
||||
key="${line%%=*}"
|
||||
value="${line#*=}"
|
||||
|
||||
# Remove quotes
|
||||
case "$value" in
|
||||
\'*\') value="${value#\'}"; value="${value%\'}" ;;
|
||||
esac
|
||||
|
||||
# Process parameters
|
||||
case "$key" in
|
||||
url) url="$value" ;;
|
||||
method) method="$value" ;;
|
||||
headers.*)
|
||||
# Extract nested key (e.g., "Content-Type" from "headers.Content-Type")
|
||||
nested_key="${key#headers.}"
|
||||
printf '%s: %s\n' "$nested_key" "$value" >> "$headers_file"
|
||||
;;
|
||||
query_params.*)
|
||||
nested_key="${key#query_params.}"
|
||||
printf '%s=%s\n' "$nested_key" "$value" >> "$query_params_file"
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
# Use the parsed data
|
||||
if [ -s "$headers_file" ]; then
|
||||
while IFS= read -r header; do
|
||||
curl_args="$curl_args -H '$header'"
|
||||
done < "$headers_file"
|
||||
fi
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Action YAML Configuration
|
||||
|
||||
Specify DOTENV format in your action YAML:
|
||||
|
||||
```yaml
|
||||
ref: mypack.myaction
|
||||
entry_point: myaction.sh
|
||||
parameter_delivery: stdin
|
||||
parameter_format: dotenv # Use dotenv format
|
||||
output_format: json
|
||||
```
|
||||
|
||||
### Supported Formats
|
||||
|
||||
- `dotenv` - Shell-friendly key='value' format with nested object flattening
|
||||
- `json` - Standard JSON format
|
||||
- `yaml` - YAML format
|
||||
|
||||
### Supported Delivery Methods
|
||||
|
||||
- `stdin` - Parameters passed via stdin (recommended for security)
|
||||
- `file` - Parameters written to a temporary file
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Why DOTENV + STDIN?
|
||||
|
||||
This combination provides several security benefits:
|
||||
|
||||
1. **No process list exposure**: Parameters don't appear in `ps aux` output
|
||||
2. **No shell escaping issues**: Values are properly quoted
|
||||
3. **Secret protection**: Sensitive values passed via stdin, not environment variables
|
||||
4. **No external dependencies**: Pure POSIX shell parsing without `jq` or other tools
|
||||
|
||||
### Secret Handling
|
||||
|
||||
Secrets are passed separately via stdin after parameters. They are never included in environment variables or parameter files.
|
||||
|
||||
```bash
|
||||
# Parameters are sent first
|
||||
url='https://api.example.com'
|
||||
---ATTUNE_PARAMS_END---
|
||||
# Then secrets (as JSON)
|
||||
{"api_key":"secret123","password":"hunter2"}
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Example 1: HTTP Request Action
|
||||
|
||||
**Action Configuration:**
|
||||
```yaml
|
||||
ref: core.http_request
|
||||
parameter_delivery: stdin
|
||||
parameter_format: dotenv
|
||||
```
|
||||
|
||||
**Execution Parameters:**
|
||||
```json
|
||||
{
|
||||
"url": "https://api.example.com/users",
|
||||
"method": "POST",
|
||||
"headers": {
|
||||
"Content-Type": "application/json",
|
||||
"User-Agent": "Attune/1.0"
|
||||
},
|
||||
"query_params": {
|
||||
"page": "1",
|
||||
"limit": "10"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Stdin Input:**
|
||||
```bash
|
||||
headers.Content-Type='application/json'
|
||||
headers.User-Agent='Attune/1.0'
|
||||
method='POST'
|
||||
query_params.limit='10'
|
||||
query_params.page='1'
|
||||
url='https://api.example.com/users'
|
||||
---ATTUNE_PARAMS_END---
|
||||
```
|
||||
|
||||
### Example 2: Simple Shell Action
|
||||
|
||||
**Action Configuration:**
|
||||
```yaml
|
||||
ref: mypack.greet
|
||||
parameter_delivery: stdin
|
||||
parameter_format: dotenv
|
||||
```
|
||||
|
||||
**Execution Parameters:**
|
||||
```json
|
||||
{
|
||||
"name": "Alice",
|
||||
"greeting": "Hello"
|
||||
}
|
||||
```
|
||||
|
||||
**Stdin Input:**
|
||||
```bash
|
||||
greeting='Hello'
|
||||
name='Alice'
|
||||
---ATTUNE_PARAMS_END---
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: Parameters Not Received
|
||||
|
||||
**Symptom:** Action receives empty or incorrect parameter values.
|
||||
|
||||
**Solution:** Ensure you're reading until the `---ATTUNE_PARAMS_END---` delimiter:
|
||||
|
||||
```bash
|
||||
while IFS= read -r line; do
|
||||
case "$line" in
|
||||
*"---ATTUNE_PARAMS_END---"*) break ;; # Important!
|
||||
esac
|
||||
# ... parse line
|
||||
done
|
||||
```
|
||||
|
||||
### Issue: Nested Objects Not Parsed
|
||||
|
||||
**Symptom:** Headers or query params not being set correctly.
|
||||
|
||||
**Solution:** Use pattern matching to detect dotted keys:
|
||||
|
||||
```bash
|
||||
case "$key" in
|
||||
headers.*)
|
||||
nested_key="${key#headers.}"
|
||||
# Process nested key
|
||||
;;
|
||||
esac
|
||||
```
|
||||
|
||||
### Issue: Special Characters Corrupted
|
||||
|
||||
**Symptom:** Values with single quotes are malformed.
|
||||
|
||||
**Solution:** The worker automatically escapes single quotes using `'\''`. Make sure to remove quotes correctly:
|
||||
|
||||
```bash
|
||||
# Remove quotes (handles escaped quotes correctly)
|
||||
case "$value" in
|
||||
\'*\') value="${value#\'}"; value="${value%\'}" ;;
|
||||
esac
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always read until delimiter**: Don't stop reading stdin early
|
||||
2. **Handle empty objects**: Check if files are empty before processing
|
||||
3. **Use temporary files**: For nested objects, write to temp files for easier processing
|
||||
4. **Validate required parameters**: Check that required values are present
|
||||
5. **Clean up temp files**: Use `trap` to ensure cleanup on exit
|
||||
|
||||
```bash
|
||||
#!/bin/sh
|
||||
set -e
|
||||
|
||||
# Setup cleanup
|
||||
headers_file=$(mktemp)
|
||||
trap "rm -f $headers_file" EXIT
|
||||
|
||||
# Parse parameters...
|
||||
```
|
||||
|
||||
## Implementation Details
|
||||
|
||||
The parameter flattening is implemented in `crates/worker/src/runtime/parameter_passing.rs`:
|
||||
|
||||
- Nested objects are recursively flattened with dot notation
|
||||
- Empty objects produce no output entries
|
||||
- Arrays are JSON-serialized as strings
|
||||
- Output is sorted alphabetically for consistency
|
||||
- Single quotes are escaped using shell-safe `'\''` pattern
|
||||
|
||||
## See Also
|
||||
|
||||
- [Action Parameter Schema](../packs/pack-structure.md#parameters)
|
||||
- [Secrets Management](../authentication/secrets-management.md)
|
||||
- [Shell Runtime](../architecture/worker-service.md#shell-runtime)
|
||||
130
docs/web-ui/history-page-query-params.md
Normal file
130
docs/web-ui/history-page-query-params.md
Normal file
@@ -0,0 +1,130 @@
|
||||
# History Page URL Query Parameters
|
||||
|
||||
This document describes the URL query parameters supported by the history pages (Executions, Events, Enforcements) in the Attune web UI.
|
||||
|
||||
## Overview
|
||||
|
||||
All history pages support deep linking via URL query parameters. When navigating to a history page with query parameters, the page will automatically initialize its filters with the provided values.
|
||||
|
||||
## Executions Page
|
||||
|
||||
**Path**: `/executions`
|
||||
|
||||
### Supported Query Parameters
|
||||
|
||||
| Parameter | Description | Example |
|
||||
|-----------|-------------|---------|
|
||||
| `action_ref` | Filter by action reference | `?action_ref=core.echo` |
|
||||
| `rule_ref` | Filter by rule reference | `?rule_ref=core.on_timer` |
|
||||
| `trigger_ref` | Filter by trigger reference | `?trigger_ref=core.webhook` |
|
||||
| `pack_name` | Filter by pack name | `?pack_name=core` |
|
||||
| `executor` | Filter by executor ID | `?executor=1` |
|
||||
| `status` | Filter by execution status | `?status=running` |
|
||||
|
||||
### Valid Status Values
|
||||
|
||||
- `requested`
|
||||
- `scheduling`
|
||||
- `scheduled`
|
||||
- `running`
|
||||
- `completed`
|
||||
- `failed`
|
||||
- `canceling`
|
||||
- `cancelled`
|
||||
- `timeout`
|
||||
- `abandoned`
|
||||
|
||||
### Examples
|
||||
|
||||
```
|
||||
# Filter by action
|
||||
http://localhost:3000/executions?action_ref=core.echo
|
||||
|
||||
# Filter by rule and status
|
||||
http://localhost:3000/executions?rule_ref=core.on_timer&status=completed
|
||||
|
||||
# Multiple filters
|
||||
http://localhost:3000/executions?pack_name=core&status=running&action_ref=core.echo
|
||||
```
|
||||
|
||||
## Events Page
|
||||
|
||||
**Path**: `/events`
|
||||
|
||||
### Supported Query Parameters
|
||||
|
||||
| Parameter | Description | Example |
|
||||
|-----------|-------------|---------|
|
||||
| `trigger_ref` | Filter by trigger reference | `?trigger_ref=core.webhook` |
|
||||
|
||||
### Examples
|
||||
|
||||
```
|
||||
# Filter by trigger
|
||||
http://localhost:3000/events?trigger_ref=core.webhook
|
||||
|
||||
# Filter by timer trigger
|
||||
http://localhost:3000/events?trigger_ref=core.timer
|
||||
```
|
||||
|
||||
## Enforcements Page
|
||||
|
||||
**Path**: `/enforcements`
|
||||
|
||||
### Supported Query Parameters
|
||||
|
||||
| Parameter | Description | Example |
|
||||
|-----------|-------------|---------|
|
||||
| `rule_ref` | Filter by rule reference | `?rule_ref=core.on_timer` |
|
||||
| `trigger_ref` | Filter by trigger reference | `?trigger_ref=core.webhook` |
|
||||
| `event` | Filter by event ID | `?event=123` |
|
||||
| `status` | Filter by enforcement status | `?status=processed` |
|
||||
|
||||
### Valid Status Values
|
||||
|
||||
- `created`
|
||||
- `processed`
|
||||
- `disabled`
|
||||
|
||||
### Examples
|
||||
|
||||
```
|
||||
# Filter by rule
|
||||
http://localhost:3000/enforcements?rule_ref=core.on_timer
|
||||
|
||||
# Filter by event
|
||||
http://localhost:3000/enforcements?event=123
|
||||
|
||||
# Multiple filters
|
||||
http://localhost:3000/enforcements?rule_ref=core.on_timer&status=processed
|
||||
```
|
||||
|
||||
## Usage Patterns
|
||||
|
||||
### Deep Linking from Detail Pages
|
||||
|
||||
When viewing a specific execution, event, or enforcement detail page, you can click on related entities (actions, rules, triggers) to navigate to the history page with the appropriate filter pre-applied.
|
||||
|
||||
### Sharing Filtered Views
|
||||
|
||||
You can share URLs with query parameters to help others view specific filtered data sets:
|
||||
|
||||
```
|
||||
# Share a view of all failed executions for a specific action
|
||||
http://localhost:3000/executions?action_ref=core.http_request&status=failed
|
||||
|
||||
# Share enforcements for a specific rule
|
||||
http://localhost:3000/enforcements?rule_ref=my_pack.important_rule
|
||||
```
|
||||
|
||||
### Bookmarking
|
||||
|
||||
Save frequently used filter combinations as browser bookmarks for quick access.
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
- Query parameters are read on page load and initialize the filter state
|
||||
- Changing filters in the UI does **not** update the URL (stateless filtering)
|
||||
- Multiple query parameters can be combined
|
||||
- Invalid parameter values are ignored (filters default to empty)
|
||||
- Parameter names match the API field names for consistency
|
||||
Reference in New Issue
Block a user