more internal polish, resilient workers

This commit is contained in:
2026-02-09 18:32:34 -06:00
parent 588b319fec
commit e31ecb781b
62 changed files with 9872 additions and 584 deletions

View File

@@ -0,0 +1,367 @@
# Execution State Ownership Model
**Date**: 2026-02-09
**Status**: Implemented
**Related Issues**: Duplicate completion notifications, unnecessary database updates
## Overview
This document defines the **ownership model** for execution state management in Attune. It clarifies which service is responsible for updating execution records at each stage of the lifecycle, eliminating race conditions and redundant database writes.
## The Problem
Prior to this change, both the executor and worker were updating execution state in the database, causing:
1. **Race conditions** - unclear which service's update would happen first
2. **Redundant writes** - both services writing the same status value
3. **Architectural confusion** - no clear ownership boundaries
4. **Warning logs** - duplicate completion notifications
## The Solution: Lifecycle-Based Ownership
Execution state ownership is divided based on **lifecycle stage**, with a clear handoff point:
```
┌─────────────────────────────────────────────────────────────────┐
│ EXECUTOR OWNERSHIP │
│ │
│ Requested → Scheduling → Scheduled │
│ │ │
│ (includes cancellations/failures │ │
│ before execution.scheduled │ │
│ message is published) │ │
│ │ │
│ Handoff Point: │
│ execution.scheduled message PUBLISHED │
│ ▼ │
└─────────────────────────────────────────────────────────────────┘
│ Worker receives message
┌─────────────────────────────────────────────────────────────────┐
│ WORKER OWNERSHIP │
│ │
│ Running → Completed / Failed / Cancelled / Timeout │
│ │
└─────────────────────────────────────────────────────────────────┘
```
### Executor Responsibilities
The **Executor Service** owns execution state from creation through scheduling:
- ✅ Creates execution records (`Requested`)
- ✅ Updates status during scheduling (`Scheduling`)
- ✅ Updates status when scheduled to worker (`Scheduled`)
- ✅ Publishes `execution.scheduled` message **← HANDOFF POINT**
- ✅ Handles cancellations/failures BEFORE `execution.scheduled` is published
- ❌ Does NOT update status after `execution.scheduled` is published
**Lifecycle stages**: `Requested``Scheduling``Scheduled`
**Important**: If an execution is cancelled or fails before the executor publishes `execution.scheduled`, the executor is responsible for updating the status (e.g., to `Cancelled`). The worker never learns about executions that don't reach the handoff point.
### Worker Responsibilities
The **Worker Service** owns execution state after receiving the handoff:
- ✅ Receives `execution.scheduled` message **← TAKES OWNERSHIP**
- ✅ Updates status when execution starts (`Running`)
- ✅ Updates status when execution completes (`Completed`, `Failed`, etc.)
- ✅ Handles cancellations AFTER receiving `execution.scheduled`
- ✅ Updates execution result data
- ✅ Publishes `execution.status_changed` notifications
- ✅ Publishes `execution.completed` notifications
- ❌ Does NOT update status for executions it hasn't received
**Lifecycle stages**: `Running``Completed` / `Failed` / `Cancelled` / `Timeout`
**Important**: The worker only owns executions it has received via `execution.scheduled`. If a cancellation happens before this message is sent, the worker is never involved.
## Message Flow
### 1. Executor Creates and Schedules
```
Executor Service
├─> Creates execution (status: Requested)
├─> Updates status: Scheduling
├─> Selects worker
├─> Updates status: Scheduled
└─> Publishes: execution.scheduled → worker-specific queue
```
### 2. Worker Receives and Executes
```
Worker Service
├─> Receives: execution.scheduled
├─> Updates DB: Scheduled → Running
├─> Publishes: execution.status_changed (running)
├─> Executes action
├─> Updates DB: Running → Completed/Failed
├─> Publishes: execution.status_changed (completed/failed)
└─> Publishes: execution.completed
```
### 3. Executor Handles Orchestration
```
Executor Service (ExecutionManager)
├─> Receives: execution.status_changed
├─> Does NOT update database
├─> Handles orchestration logic:
│ ├─> Triggers workflow children (if parent completed)
│ ├─> Updates workflow state
│ └─> Manages parent-child relationships
└─> Logs event for monitoring
```
### 4. Queue Management
```
Executor Service (CompletionListener)
├─> Receives: execution.completed
├─> Releases queue slot
├─> Notifies waiting executions
└─> Updates queue statistics
```
## Database Update Rules
### Executor (Pre-Scheduling)
**File**: `crates/executor/src/scheduler.rs`
```rust
// ✅ Executor updates DB before scheduling
execution.status = ExecutionStatus::Scheduled;
ExecutionRepository::update(pool, execution.id, execution.into()).await?;
// Publish to worker
Self::queue_to_worker(...).await?;
```
### Worker (Post-Scheduling)
**File**: `crates/worker/src/executor.rs`
```rust
// ✅ Worker updates DB when starting
async fn execute(&self, execution_id: i64) -> Result<ExecutionResult> {
// Update status to running
self.update_execution_status(execution_id, ExecutionStatus::Running).await?;
// Execute action...
}
// ✅ Worker updates DB when completing
async fn handle_execution_success(&self, execution_id: i64, result: &ExecutionResult) -> Result<()> {
let input = UpdateExecutionInput {
status: Some(ExecutionStatus::Completed),
result: Some(result_data),
// ...
};
ExecutionRepository::update(&self.pool, execution_id, input).await?;
}
```
### Executor (Post-Scheduling)
**File**: `crates/executor/src/execution_manager.rs`
```rust
// ❌ Executor does NOT update DB after scheduling
async fn process_status_change(...) -> Result<()> {
// Fetch execution (for orchestration logic only)
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
// Handle orchestration, but do NOT update DB
match status {
ExecutionStatus::Completed | ExecutionStatus::Failed | ExecutionStatus::Cancelled => {
Self::handle_completion(pool, publisher, &execution).await?;
}
_ => {}
}
Ok(())
}
```
## Benefits
### 1. Clear Ownership Boundaries
- No ambiguity about who updates what
- Easy to reason about system behavior
- Reduced cognitive load for developers
### 2. Eliminated Race Conditions
- Only one service updates each lifecycle stage
- No competing writes to same fields
- Predictable state transitions
### 3. Better Performance
- No redundant database writes
- Reduced database contention
- Lower network overhead (fewer queries)
### 4. Cleaner Logs
Before:
```
executor | Updated execution 9061 status: Scheduled -> Running
executor | Updated execution 9061 status: Running -> Running
executor | Updated execution 9061 status: Completed -> Completed
executor | WARN: Completion notification for action 3 but active_count is 0
```
After:
```
executor | Execution 9061 scheduled to worker 29
worker | Starting execution: 9061
worker | Execution 9061 completed successfully in 142ms
executor | Execution 9061 reached terminal state: Completed, handling orchestration
```
### 5. Idempotent Message Handling
- Executor can safely receive duplicate status change messages
- Worker updates are authoritative
- No special logic needed for retries
## Edge Cases & Error Handling
### Cancellation Before Handoff
**Scenario**: Execution is queued due to concurrency policy, user cancels before scheduling.
**Handling**:
- Execution in `Requested` or `Scheduling` state
- Executor updates status: → `Cancelled`
- Worker never receives `execution.scheduled`
- No worker resources consumed ✅
### Cancellation After Handoff
**Scenario**: Execution already scheduled to worker, user cancels while running.
**Handling**:
- Worker has received `execution.scheduled` and owns execution
- Worker updates status: `Running``Cancelled`
- Worker publishes status change notification
- Executor handles orchestration (e.g., skip workflow children)
### Worker Crashes Before Updating Status
**Scenario**: Worker receives `execution.scheduled` but crashes before updating status to `Running`.
**Handling**:
- Execution remains in `Scheduled` state
- Worker owned the execution but failed to update
- Executor's heartbeat monitoring detects stale scheduled executions
- After timeout, executor can reschedule to another worker or mark as abandoned
- Idempotent: If worker already started, duplicate scheduling is rejected
### Message Delivery Delays
**Scenario**: Worker updates DB but `execution.status_changed` message is delayed.
**Handling**:
- Database reflects correct state (source of truth)
- Executor eventually receives notification and handles orchestration
- Orchestration logic is idempotent (safe to call multiple times)
- Critical: Workflows may have slight delay, but remain consistent
### Partial Failures
**Scenario**: Worker updates DB successfully but fails to publish notification.
**Handling**:
- Database has correct state (worker succeeded)
- Executor won't trigger orchestration until notification arrives
- Future enhancement: Periodic executor polling for stale completions
- Workaround: Worker retries message publishing with exponential backoff
## Migration Notes
### Changes Required
1. **Executor Service** (`execution_manager.rs`):
- ✅ Removed database updates from `process_status_change()`
- ✅ Changed to read-only orchestration handler
- ✅ Updated logs to reflect observer role
2. **Worker Service** (`service.rs`):
- ✅ Already updates DB directly (no changes needed)
- ✅ Updated comment: "we'll update the database directly"
3. **Documentation**:
- ✅ Updated module docs to reflect ownership model
- ✅ Added ownership boundaries to architecture docs
### Backward Compatibility
- ✅ No breaking changes to external APIs
- ✅ Message formats unchanged
- ✅ Database schema unchanged
- ✅ Workflow behavior unchanged
## Testing Strategy
### Unit Tests
- ✅ Executor tests verify no DB updates after scheduling
- ✅ Worker tests verify DB updates at all lifecycle stages
- ✅ Message handler tests verify orchestration without DB writes
### Integration Tests
- Test full execution lifecycle end-to-end
- Verify status transitions in database
- Confirm orchestration logic (workflow children) still works
- Test failure scenarios (worker crashes, message delays)
### Monitoring
Monitor for:
- Executions stuck in `Scheduled` state (worker not picking up)
- Large delays between status changes (message queue lag)
- Workflow children not triggering (orchestration failure)
## Future Enhancements
### 1. Executor Polling for Stale Completions
If `execution.status_changed` messages are lost, executor could periodically poll for completed executions that haven't triggered orchestration.
### 2. Worker Health Checks
More robust detection of worker failures before scheduled executions time out.
### 3. Explicit Handoff Messages
Consider adding `execution.handoff` message to explicitly mark ownership transfer point.
## References
- **Architecture Doc**: `docs/architecture/executor-service.md`
- **Work Summary**: `work-summary/2026-02-09-duplicate-completion-fix.md`
- **Bug Fix Doc**: `docs/BUGFIX-duplicate-completion-2026-02-09.md`
- **ExecutionManager**: `crates/executor/src/execution_manager.rs`
- **Worker Executor**: `crates/worker/src/executor.rs`
- **Worker Service**: `crates/worker/src/service.rs`
## Summary
The execution state ownership model provides **clear, lifecycle-based boundaries** for who updates execution records:
- **Executor**: Owns state from creation through scheduling (including pre-handoff cancellations)
- **Worker**: Owns state after receiving `execution.scheduled` message
- **Handoff**: Occurs when `execution.scheduled` message is **published to worker**
- **Key Principle**: Worker only knows about executions it receives; pre-handoff cancellations are executor's responsibility
This eliminates race conditions, reduces database load, and provides a clean architectural foundation for future enhancements.

View File

@@ -0,0 +1,342 @@
# Bug Fix: Duplicate Completion Notifications & Unnecessary Database Updates
**Date**: 2026-02-09
**Component**: Executor Service (ExecutionManager)
**Issue Type**: Performance & Correctness
## Overview
Fixed two related inefficiencies in the executor service:
1. **Duplicate completion notifications** causing queue manager warnings
2. **Unnecessary database updates** writing unchanged status values
---
## Problem 1: Duplicate Completion Notifications
### Symptom
```
WARN crates/executor/src/queue_manager.rs:320:
Completion notification for action 3 but active_count is 0
```
### Before Fix - Message Flow
```
┌─────────────────────────────────────────────────────────────────┐
│ Worker Service │
│ │
│ 1. Completes action execution │
│ 2. Updates DB: status = "Completed" │
│ 3. Publishes: execution.status_changed (status: "completed") │
│ 4. Publishes: execution.completed ────────────┐ │
└─────────────────────────────────────────────────┼───────────────┘
┌────────────────────────────────┼───────────────┐
│ │ │
▼ ▼ │
┌─────────────────────────────┐ ┌──────────────────────────────┤
│ ExecutionManager │ │ CompletionListener │
│ │ │ │
│ Receives: │ │ Receives: execution.completed│
│ execution.status_changed │ │ │
│ │ │ → notify_completion() │
│ → handle_completion() │ │ → Decrements active_count ✅ │
│ → publish_completion_notif()│ └──────────────────────────────┘
│ │
│ Publishes: execution.completed ───────┐
└─────────────────────────────┘ │
┌─────────────────────┘
┌────────────────────────────┐
│ CompletionListener (again) │
│ │
│ Receives: execution.completed (2nd time!)
│ │
│ → notify_completion() │
│ → active_count already 0 │
│ → ⚠️ WARNING LOGGED │
└────────────────────────────┘
Result: 2x completion notifications, 1x warning
```
### After Fix - Message Flow
```
┌─────────────────────────────────────────────────────────────────┐
│ Worker Service │
│ │
│ 1. Completes action execution │
│ 2. Updates DB: status = "Completed" │
│ 3. Publishes: execution.status_changed (status: "completed") │
│ 4. Publishes: execution.completed ────────────┐ │
└─────────────────────────────────────────────────┼───────────────┘
┌────────────────────────────────┼───────────────┐
│ │ │
▼ ▼ │
┌─────────────────────────────┐ ┌──────────────────────────────┤
│ ExecutionManager │ │ CompletionListener │
│ │ │ │
│ Receives: │ │ Receives: execution.completed│
│ execution.status_changed │ │ │
│ │ │ → notify_completion() │
│ → handle_completion() │ │ → Decrements active_count ✅ │
│ → Handles workflow children │ └──────────────────────────────┘
│ → NO completion publish ✅ │
└─────────────────────────────┘
Result: 1x completion notification, 0x warnings ✅
```
---
## Problem 2: Unnecessary Database Updates
### Symptom
```
INFO crates/executor/src/execution_manager.rs:108:
Updated execution 9061 status: Completed -> Completed
```
### Before Fix - Status Update Flow
```
┌─────────────────────────────────────────────────────────────────┐
│ Worker Service │
│ │
│ 1. Completes action execution │
│ 2. ExecutionRepository::update() │
│ status: Running → Completed ✅ │
│ 3. Publishes: execution.status_changed (status: "completed") │
└─────────────────────────────────┬───────────────────────────────┘
│ Message Queue
┌─────────────────────────────────────────────────────────────────┐
│ ExecutionManager │
│ │
│ 1. Receives: execution.status_changed (status: "completed") │
│ 2. Fetches execution from DB │
│ Current status: Completed │
│ 3. Sets: execution.status = Completed (same value) │
│ 4. ExecutionRepository::update() │
│ status: Completed → Completed ❌ │
│ 5. Logs: "Updated execution 9061 status: Completed -> Completed"
└─────────────────────────────────────────────────────────────────┘
Result: 2x database writes for same status value
```
### After Fix - Status Update Flow
```
┌─────────────────────────────────────────────────────────────────┐
│ Worker Service │
│ │
│ 1. Completes action execution │
│ 2. ExecutionRepository::update() │
│ status: Running → Completed ✅ │
│ 3. Publishes: execution.status_changed (status: "completed") │
└─────────────────────────────────────┬───────────────────────────┘
│ Message Queue
┌─────────────────────────────────────────────────────────────────┐
│ ExecutionManager │
│ │
│ 1. Receives: execution.status_changed (status: "completed") │
│ 2. Fetches execution from DB │
│ Current status: Completed │
│ 3. Compares: old_status (Completed) == new_status (Completed) │
│ 4. Skips database update ✅ │
│ 5. Still handles orchestration (workflow children) │
│ 6. Logs: "Execution 9061 status unchanged, skipping update" │
└─────────────────────────────────────────────────────────────────┘
Result: 1x database write (only when status changes) ✅
```
---
## Code Changes
### Change 1: Remove Duplicate Completion Publication
**File**: `crates/executor/src/execution_manager.rs`
```rust
// BEFORE
async fn handle_completion(...) -> Result<()> {
// Handle workflow children...
// Publish completion notification
Self::publish_completion_notification(pool, publisher, execution).await?;
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^
// DUPLICATE - worker already did this!
Ok(())
}
```
```rust
// AFTER
async fn handle_completion(...) -> Result<()> {
// Handle workflow children...
// NOTE: Completion notification is published by the worker, not here.
// This prevents duplicate execution.completed messages that would cause
// the queue manager to decrement active_count twice.
Ok(())
}
// Removed entire publish_completion_notification() method
```
### Change 2: Skip Unnecessary Database Updates
**File**: `crates/executor/src/execution_manager.rs`
```rust
// BEFORE
async fn process_status_change(...) -> Result<()> {
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
let old_status = execution.status.clone();
execution.status = status; // Always set, even if same
ExecutionRepository::update(pool, execution.id, execution.clone().into()).await?;
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
// ALWAYS writes, even if unchanged!
info!("Updated execution {} status: {:?} -> {:?}", execution_id, old_status, status);
// Handle completion logic...
Ok(())
}
```
```rust
// AFTER
async fn process_status_change(...) -> Result<()> {
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
let old_status = execution.status.clone();
// Skip update if status hasn't changed
if old_status == status {
debug!("Execution {} status unchanged ({:?}), skipping database update",
execution_id, status);
// Still handle completion logic for orchestration (e.g., workflow children)
if matches!(status, ExecutionStatus::Completed | ExecutionStatus::Failed | ExecutionStatus::Cancelled) {
Self::handle_completion(pool, publisher, &execution).await?;
}
return Ok(()); // Early return - no DB write
}
execution.status = status;
ExecutionRepository::update(pool, execution.id, execution.clone().into()).await?;
info!("Updated execution {} status: {:?} -> {:?}", execution_id, old_status, status);
// Handle completion logic...
Ok(())
}
```
---
## Impact & Benefits
### Performance Improvements
| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| Completion messages per execution | 2 | 1 | **50% reduction** |
| Queue manager warnings | Frequent | None | **100% elimination** |
| Database writes (no status change) | Always | Never | **100% elimination** |
| Log noise | High | Low | **Significant reduction** |
### Typical Execution Flow
**Before fixes**:
- 1x execution completed
- 2x `execution.completed` messages published
- 1x unnecessary database write (Completed → Completed)
- 1x queue manager warning
- Noisy logs with redundant "status: Completed -> Completed" messages
**After fixes**:
- 1x execution completed
- 1x `execution.completed` message published (worker only)
- 0x unnecessary database writes
- 0x queue manager warnings
- Clean, informative logs
### High-Throughput Scenarios
At **1000 executions/minute**:
**Before**:
- 2000 completion messages/min
- ~1000 unnecessary DB writes/min
- ~1000 warning logs/min
**After**:
- 1000 completion messages/min (50% reduction)
- 0 unnecessary DB writes (100% reduction)
- 0 warning logs (100% reduction)
---
## Testing
✅ All 58 executor unit tests pass
✅ Zero compiler warnings
✅ No breaking changes to external behavior
✅ Orchestration logic (workflow children) still works correctly
---
## Architecture Clarifications
### Separation of Concerns
| Component | Responsibility |
|-----------|----------------|
| **Worker** | Authoritative source for execution completion, publishes completion notifications |
| **Executor** | Orchestration (workflows, child executions), NOT completion notifications |
| **CompletionListener** | Queue management (releases slots for queued executions) |
### Idempotency
The executor is now **idempotent** with respect to status change messages:
- Receiving the same status change multiple times has no effect after the first
- Database is only written when state actually changes
- Orchestration logic (workflows) runs correctly regardless
---
## Lessons Learned
1. **Message publishers should be explicit** - Only one component should publish a given message type
2. **Always check for actual changes** - Don't blindly write to database without comparing old/new values
3. **Separate orchestration from notification** - Workflow logic shouldn't trigger duplicate notifications
4. **Log levels matter** - Changed redundant updates from INFO to DEBUG to reduce noise
5. **Trust the source** - Worker owns execution lifecycle; executor shouldn't second-guess it
---
## Related Documentation
- Work Summary: `attune/work-summary/2026-02-09-duplicate-completion-fix.md`
- Queue Manager: `attune/crates/executor/src/queue_manager.rs`
- Completion Listener: `attune/crates/executor/src/completion_listener.rs`
- Execution Manager: `attune/crates/executor/src/execution_manager.rs`

View File

@@ -0,0 +1,337 @@
# Quick Reference: DOTENV Shell Actions Pattern
**Purpose:** Standard pattern for writing portable shell actions without external dependencies like `jq`.
## Core Principles
1. **Use POSIX shell** (`#!/bin/sh`), not bash
2. **Read parameters in DOTENV format** from stdin
3. **No external JSON parsers** (jq, yq, etc.)
4. **Minimal dependencies** (only POSIX utilities + curl)
## Complete Template
```sh
#!/bin/sh
# Action Name - Core Pack
# Brief description of what this action does
#
# This script uses pure POSIX shell without external dependencies like jq.
# It reads parameters in DOTENV format from stdin until the delimiter.
set -e
# Initialize variables with defaults
param1=""
param2="default_value"
bool_param="false"
numeric_param="0"
# Read DOTENV-formatted parameters from stdin until delimiter
while IFS= read -r line; do
# Check for parameter delimiter
case "$line" in
*"---ATTUNE_PARAMS_END---"*)
break
;;
esac
[ -z "$line" ] && continue
key="${line%%=*}"
value="${line#*=}"
# Remove quotes if present (both single and double)
case "$value" in
\"*\")
value="${value#\"}"
value="${value%\"}"
;;
\'*\')
value="${value#\'}"
value="${value%\'}"
;;
esac
# Process parameters
case "$key" in
param1)
param1="$value"
;;
param2)
param2="$value"
;;
bool_param)
bool_param="$value"
;;
numeric_param)
numeric_param="$value"
;;
esac
done
# Normalize boolean values
case "$bool_param" in
true|True|TRUE|yes|Yes|YES|1) bool_param="true" ;;
*) bool_param="false" ;;
esac
# Validate numeric parameters
case "$numeric_param" in
''|*[!0-9]*)
echo "ERROR: numeric_param must be a positive integer" >&2
exit 1
;;
esac
# Validate required parameters
if [ -z "$param1" ]; then
echo "ERROR: param1 is required" >&2
exit 1
fi
# Action logic goes here
echo "Processing with param1=$param1, param2=$param2"
# Exit successfully
exit 0
```
## YAML Metadata Configuration
```yaml
ref: core.action_name
label: "Action Name"
description: "Brief description"
enabled: true
runner_type: shell
entry_point: action_name.sh
# IMPORTANT: Use dotenv format for POSIX shell compatibility
parameter_delivery: stdin
parameter_format: dotenv
# Output format (text or json)
output_format: text
parameters:
type: object
properties:
param1:
type: string
description: "First parameter"
param2:
type: string
description: "Second parameter"
default: "default_value"
bool_param:
type: boolean
description: "Boolean parameter"
default: false
required:
- param1
```
## Common Patterns
### 1. Parameter Parsing
**Read until delimiter:**
```sh
while IFS= read -r line; do
case "$line" in
*"---ATTUNE_PARAMS_END---"*) break ;;
esac
done
```
**Extract key-value:**
```sh
key="${line%%=*}" # Everything before first =
value="${line#*=}" # Everything after first =
```
**Remove quotes:**
```sh
case "$value" in
\"*\") value="${value#\"}"; value="${value%\"}" ;;
\'*\') value="${value#\'}"; value="${value%\'}" ;;
esac
```
### 2. Boolean Normalization
```sh
case "$bool_param" in
true|True|TRUE|yes|Yes|YES|1) bool_param="true" ;;
*) bool_param="false" ;;
esac
```
### 3. Numeric Validation
```sh
case "$number" in
''|*[!0-9]*)
echo "ERROR: must be a number" >&2
exit 1
;;
esac
```
### 4. JSON Output (without jq)
**Escape special characters:**
```sh
escaped=$(printf '%s' "$value" | sed 's/\\/\\\\/g; s/"/\\"/g')
```
**Build JSON:**
```sh
cat <<EOF
{
"field": "$escaped",
"boolean": $bool_value,
"number": $number
}
EOF
```
### 5. Making HTTP Requests
**With curl and temp files:**
```sh
temp_response=$(mktemp)
cleanup() { rm -f "$temp_response"; }
trap cleanup EXIT
http_code=$(curl -X POST \
-H "Content-Type: application/json" \
${api_token:+-H "Authorization: Bearer ${api_token}"} \
-d "$request_body" \
-s \
-w "%{http_code}" \
-o "$temp_response" \
--max-time 60 \
"${api_url}/api/v1/endpoint" 2>/dev/null || echo "000")
if [ "$http_code" -ge 200 ] && [ "$http_code" -lt 300 ]; then
cat "$temp_response"
exit 0
else
echo "ERROR: API call failed (HTTP $http_code)" >&2
exit 1
fi
```
### 6. Extracting JSON Fields (simple cases)
**Extract field value:**
```sh
case "$response" in
*'"field":'*)
value=$(printf '%s' "$response" | sed -n 's/.*"field":\s*"\([^"]*\)".*/\1/p')
;;
esac
```
**Note:** For complex JSON, consider having the API return the exact format needed.
## Anti-Patterns (DO NOT DO)
**Using jq:**
```sh
value=$(echo "$json" | jq -r '.field') # NO!
```
**Using bash-specific features:**
```sh
#!/bin/bash # NO! Use #!/bin/sh
[[ "$var" == "value" ]] # NO! Use [ "$var" = "value" ]
```
**Reading JSON directly from stdin:**
```yaml
parameter_format: json # NO! Use dotenv
```
**Using Python/Node.js in core pack:**
```yaml
runner_type: python # NO! Use shell for core pack
```
## Testing Checklist
- [ ] Script has `#!/bin/sh` shebang
- [ ] Script is executable (`chmod +x`)
- [ ] All parameters have defaults or validation
- [ ] Boolean values are normalized
- [ ] Numeric values are validated
- [ ] Required parameters are checked
- [ ] Error messages go to stderr (`>&2`)
- [ ] Successful output goes to stdout
- [ ] Temp files are cleaned up (trap handler)
- [ ] YAML has `parameter_format: dotenv`
- [ ] YAML has `runner_type: shell`
- [ ] No `jq`, `yq`, or bash-isms used
- [ ] Works on Alpine Linux (minimal environment)
## Examples from Core Pack
### Simple Action (echo.sh)
- Minimal parameter parsing
- Single string parameter
- Text output
### Complex Action (http_request.sh)
- Multiple parameters (headers, query params)
- HTTP client implementation
- JSON output construction
- Error handling
### API Wrapper (register_packs.sh)
- JSON request body construction
- API authentication
- Response parsing
- Structured error messages
## DOTENV Format Specification
**Format:** Each parameter on a new line as `key=value`
**Example:**
```
param1="string value"
param2=42
bool_param=true
---ATTUNE_PARAMS_END---
```
**Key Rules:**
- Parameters end with `---ATTUNE_PARAMS_END---` delimiter
- Values may be quoted (single or double quotes)
- Empty lines are skipped
- No multiline values (use base64 if needed)
- Array/object parameters passed as JSON strings
## When to Use This Pattern
**Use DOTENV shell pattern for:**
- Core pack actions
- Simple utility actions
- Actions that need maximum portability
- Actions that run in minimal containers
- Actions that don't need complex JSON parsing
**Consider other runtimes if you need:**
- Complex JSON manipulation
- External libraries (AWS SDK, etc.)
- Advanced string processing
- Parallel processing
- Language-specific features
## Further Reading
- `packs/core/actions/echo.sh` - Simplest example
- `packs/core/actions/http_request.sh` - Complex example
- `packs/core/actions/register_packs.sh` - API wrapper example
- `docs/pack-structure.md` - Pack development guide

View File

@@ -0,0 +1,204 @@
# Quick Reference: Execution State Ownership
**Last Updated**: 2026-02-09
## Ownership Model at a Glance
```
┌──────────────────────────────────────────────────────────┐
│ EXECUTOR OWNS │ WORKER OWNS │
│ Requested │ Running │
│ Scheduling │ Completed │
│ Scheduled │ Failed │
│ (+ pre-handoff Cancelled) │ (+ post-handoff │
│ │ Cancelled/Timeout/ │
│ │ Abandoned) │
└───────────────────────────────┴──────────────────────────┘
│ │
└─────── HANDOFF ──────────┘
execution.scheduled PUBLISHED
```
## Who Updates the Database?
### Executor Updates (Pre-Handoff Only)
- ✅ Creates execution record
- ✅ Updates status: `Requested``Scheduling``Scheduled`
- ✅ Publishes `execution.scheduled` message **← HANDOFF POINT**
- ✅ Handles cancellations/failures BEFORE handoff (worker never notified)
- ❌ NEVER updates after `execution.scheduled` is published
### Worker Updates (Post-Handoff Only)
- ✅ Receives `execution.scheduled` message (takes ownership)
- ✅ Updates status: `Scheduled``Running`
- ✅ Updates status: `Running``Completed`/`Failed`/`Cancelled`/etc.
- ✅ Handles cancellations/failures AFTER handoff
- ✅ Updates result data
- ✅ Writes for every status change after receiving handoff
## Who Publishes Messages?
### Executor Publishes
- `enforcement.created` (from rules)
- `execution.requested` (to scheduler)
- `execution.scheduled` (to worker) **← HANDOFF MESSAGE - OWNERSHIP TRANSFER**
### Worker Publishes
- `execution.status_changed` (for each status change after handoff)
- `execution.completed` (when done)
### Executor Receives (But Doesn't Update DB Post-Handoff)
- `execution.status_changed` → triggers orchestration logic (read-only)
- `execution.completed` → releases queue slots
## Code Locations
### Executor Updates DB
```rust
// crates/executor/src/scheduler.rs
execution.status = ExecutionStatus::Scheduled;
ExecutionRepository::update(pool, execution.id, execution.into()).await?;
```
### Worker Updates DB
```rust
// crates/worker/src/executor.rs
self.update_execution_status(execution_id, ExecutionStatus::Running).await?;
// ...
ExecutionRepository::update(&self.pool, execution_id, input).await?;
```
### Executor Orchestrates (Read-Only)
```rust
// crates/executor/src/execution_manager.rs
async fn process_status_change(...) -> Result<()> {
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
// NO UPDATE - just orchestration logic
Self::handle_completion(pool, publisher, &execution).await?;
}
```
## Decision Tree: Should I Update the DB?
```
Are you in the Executor?
├─ Have you published execution.scheduled for this execution?
│ ├─ NO → Update DB (you own it)
│ │ └─ Includes: Requested/Scheduling/Scheduled/pre-handoff Cancelled
│ └─ YES → Don't update DB (worker owns it now)
│ └─ Just orchestrate (trigger workflows, etc)
Are you in the Worker?
├─ Have you received execution.scheduled for this execution?
│ ├─ YES → Update DB for ALL status changes (you own it)
│ │ └─ Includes: Running/Completed/Failed/post-handoff Cancelled/etc.
│ └─ NO → Don't touch this execution (doesn't exist for you yet)
```
## Common Patterns
### ✅ DO: Worker Updates After Handoff
```rust
// Worker receives execution.scheduled
self.update_execution_status(execution_id, ExecutionStatus::Running).await?;
self.publish_status_update(execution_id, ExecutionStatus::Running).await?;
```
### ✅ DO: Executor Orchestrates Without DB Write
```rust
// Executor receives execution.status_changed
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
if status == ExecutionStatus::Completed {
Self::trigger_child_executions(pool, publisher, &execution).await?;
}
```
### ❌ DON'T: Executor Updates After Handoff
```rust
// Executor receives execution.status_changed
execution.status = status;
ExecutionRepository::update(pool, execution.id, execution).await?; // ❌ WRONG!
```
### ❌ DON'T: Worker Updates Before Handoff
```rust
// Worker updates execution it hasn't received via execution.scheduled
ExecutionRepository::update(&self.pool, execution_id, input).await?; // ❌ WRONG!
```
### ✅ DO: Executor Handles Pre-Handoff Cancellation
```rust
// User cancels execution before it's scheduled to worker
// Execution is still in Requested/Scheduling state
execution.status = ExecutionStatus::Cancelled;
ExecutionRepository::update(pool, execution_id, execution).await?; // ✅ CORRECT!
// Worker never receives execution.scheduled, never knows execution existed
```
### ✅ DO: Worker Handles Post-Handoff Cancellation
```rust
// Worker received execution.scheduled, now owns execution
// User cancels execution while it's running
execution.status = ExecutionStatus::Cancelled;
ExecutionRepository::update(&self.pool, execution_id, execution).await?; // ✅ CORRECT!
self.publish_status_update(execution_id, ExecutionStatus::Cancelled).await?;
```
## Handoff Checklist
When an execution is scheduled:
**Executor Must**:
- [x] Update status to `Scheduled`
- [x] Write to database
- [x] Publish `execution.scheduled` message **← HANDOFF OCCURS HERE**
- [x] Stop updating this execution (ownership transferred)
- [x] Continue to handle orchestration (read-only)
**Worker Must**:
- [x] Receive `execution.scheduled` message **← OWNERSHIP RECEIVED**
- [x] Take ownership of execution state
- [x] Update DB for all future status changes
- [x] Handle any cancellations/failures after this point
- [x] Publish status notifications
**Important**: If execution is cancelled BEFORE executor publishes `execution.scheduled`, the executor updates status to `Cancelled` and worker never learns about it.
## Benefits Summary
| Aspect | Benefit |
|--------|---------|
| **Race Conditions** | Eliminated - only one owner per stage |
| **DB Writes** | Reduced by ~50% - no duplicates |
| **Code Clarity** | Clear boundaries - easy to reason about |
| **Message Traffic** | Reduced - no duplicate completions |
| **Idempotency** | Safe to receive duplicate messages |
## Troubleshooting
### Execution Stuck in "Scheduled"
**Problem**: Worker not updating status to Running
**Check**: Was execution.scheduled published? Worker received it? Worker healthy?
### Workflow Children Not Triggering
**Problem**: Orchestration not running
**Check**: Worker published execution.status_changed? Message queue healthy?
### Duplicate Status Updates
**Problem**: Both services updating DB
**Check**: Executor should NOT update after publishing execution.scheduled
### Execution Cancelled But Status Not Updated
**Problem**: Cancellation not reflected in database
**Check**: Was it cancelled before or after handoff?
**Fix**: If before handoff → executor updates; if after handoff → worker updates
### Queue Warnings
**Problem**: Duplicate completion notifications
**Check**: Only worker should publish execution.completed
## See Also
- **Full Architecture Doc**: `docs/ARCHITECTURE-execution-state-ownership.md`
- **Bug Fix Visualization**: `docs/BUGFIX-duplicate-completion-2026-02-09.md`
- **Work Summary**: `work-summary/2026-02-09-execution-state-ownership.md`

View File

@@ -0,0 +1,460 @@
# Quick Reference: Phase 3 - Intelligent Retry & Worker Health
## Overview
Phase 3 adds intelligent retry logic and proactive worker health monitoring to automatically recover from transient failures and optimize worker selection.
**Key Features:**
- **Automatic Retry:** Failed executions automatically retry with exponential backoff
- **Health-Aware Scheduling:** Prefer healthy workers with low queue depth
- **Per-Action Configuration:** Custom timeouts and retry limits per action
- **Failure Classification:** Distinguish retriable vs non-retriable failures
## Quick Start
### Enable Retry for an Action
```yaml
# packs/mypack/actions/flaky-api.yaml
name: flaky_api_call
runtime: python
entrypoint: actions/flaky_api.py
timeout_seconds: 120 # Custom timeout (overrides global 5 min)
max_retries: 3 # Retry up to 3 times on failure
parameters:
url:
type: string
required: true
```
### Database Migration
```bash
# Apply Phase 3 schema changes
sqlx migrate run
# Or via Docker Compose
docker compose exec postgres psql -U attune -d attune -f /migrations/20260209000000_phase3_retry_and_health.sql
```
### Check Worker Health
```bash
# View healthy workers
psql -c "SELECT * FROM healthy_workers;"
# Check specific worker health
psql -c "
SELECT
name,
capabilities->'health'->>'status' as health_status,
capabilities->'health'->>'queue_depth' as queue_depth,
capabilities->'health'->>'consecutive_failures' as failures
FROM worker
WHERE id = 1;
"
```
## Retry Behavior
### Retriable Failures
Executions are automatically retried for:
- ✓ Worker unavailable (`worker_unavailable`)
- ✓ Queue timeout/TTL expired (`queue_timeout`)
- ✓ Worker heartbeat stale (`worker_heartbeat_stale`)
- ✓ Transient errors (`transient_error`)
- ✓ Manual retry requested (`manual_retry`)
### Non-Retriable Failures
These failures are NOT retried:
- ✗ Validation errors
- ✗ Permission denied
- ✗ Action not found
- ✗ Invalid parameters
- ✗ Explicit action failure
### Retry Backoff
**Strategy:** Exponential backoff with jitter
```
Attempt 0: ~1 second
Attempt 1: ~2 seconds
Attempt 2: ~4 seconds
Attempt 3: ~8 seconds
Attempt N: min(base * 2^N, 300 seconds)
```
**Jitter:** ±20% randomization to avoid thundering herd
### Retry Configuration
```rust
// Default retry configuration
RetryConfig {
enabled: true,
base_backoff_secs: 1,
max_backoff_secs: 300, // 5 minutes max
backoff_multiplier: 2.0,
jitter_factor: 0.2, // 20% jitter
}
```
## Worker Health
### Health States
**Healthy:**
- Heartbeat < 30 seconds old
- Consecutive failures < 3
- Queue depth < 50
- Failure rate < 30%
**Degraded:**
- Consecutive failures: 3-9
- Queue depth: 50-99
- Failure rate: 30-69%
- Still receives tasks but deprioritized
**Unhealthy:**
- Heartbeat > 30 seconds old
- Consecutive failures ≥ 10
- Queue depth ≥ 100
- Failure rate ≥ 70%
- Does NOT receive new tasks
### Health Metrics
Workers self-report health in capabilities:
```json
{
"runtimes": ["shell", "python"],
"health": {
"status": "healthy",
"last_check": "2026-02-09T12:00:00Z",
"consecutive_failures": 0,
"total_executions": 1000,
"failed_executions": 20,
"average_execution_time_ms": 1500,
"queue_depth": 5
}
}
```
### Worker Selection
**Selection Priority:**
1. Healthy workers (queue depth ascending)
2. Degraded workers (queue depth ascending)
3. Skip unhealthy workers
**Example:**
```
Worker A: Healthy, queue=5 ← Selected first
Worker B: Healthy, queue=20 ← Selected second
Worker C: Degraded, queue=10 ← Selected third
Worker D: Unhealthy, queue=0 ← Never selected
```
## Database Schema
### Execution Retry Fields
```sql
-- Added to execution table
retry_count INTEGER NOT NULL DEFAULT 0,
max_retries INTEGER,
retry_reason TEXT,
original_execution BIGINT REFERENCES execution(id)
```
### Action Configuration Fields
```sql
-- Added to action table
timeout_seconds INTEGER, -- Per-action timeout override
max_retries INTEGER DEFAULT 0 -- Per-action retry limit
```
### Helper Functions
```sql
-- Check if execution can be retried
SELECT is_execution_retriable(123);
-- Get worker queue depth
SELECT get_worker_queue_depth(1);
```
### Views
```sql
-- Get all healthy workers
SELECT * FROM healthy_workers;
```
## Practical Examples
### Example 1: View Retry Chain
```sql
-- Find all retries for execution 100
WITH RECURSIVE retry_chain AS (
SELECT id, retry_count, retry_reason, original_execution, status
FROM execution
WHERE id = 100
UNION ALL
SELECT e.id, e.retry_count, e.retry_reason, e.original_execution, e.status
FROM execution e
JOIN retry_chain rc ON e.original_execution = rc.id
)
SELECT * FROM retry_chain ORDER BY retry_count;
```
### Example 2: Analyze Retry Success Rate
```sql
-- Success rate of retries by reason
SELECT
config->>'retry_reason' as reason,
COUNT(*) as total_retries,
COUNT(CASE WHEN status = 'completed' THEN 1 END) as succeeded,
ROUND(100.0 * COUNT(CASE WHEN status = 'completed' THEN 1 END) / COUNT(*), 2) as success_rate
FROM execution
WHERE retry_count > 0
GROUP BY config->>'retry_reason'
ORDER BY total_retries DESC;
```
### Example 3: Find Workers by Health
```sql
-- Workers sorted by health and load
SELECT
w.name,
w.status,
(w.capabilities->'health'->>'status')::TEXT as health,
(w.capabilities->'health'->>'queue_depth')::INTEGER as queue,
(w.capabilities->'health'->>'consecutive_failures')::INTEGER as failures,
w.last_heartbeat
FROM worker w
WHERE w.status = 'active'
ORDER BY
CASE (w.capabilities->'health'->>'status')::TEXT
WHEN 'healthy' THEN 1
WHEN 'degraded' THEN 2
WHEN 'unhealthy' THEN 3
ELSE 4
END,
(w.capabilities->'health'->>'queue_depth')::INTEGER;
```
### Example 4: Manual Retry via API
```bash
# Create retry execution
curl -X POST http://localhost:8080/api/v1/executions \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"action_ref": "core.echo",
"parameters": {"message": "retry test"},
"config": {
"retry_of": 123,
"retry_count": 1,
"max_retries": 3,
"retry_reason": "manual_retry",
"original_execution": 123
}
}'
```
## Monitoring
### Key Metrics
**Retry Metrics:**
- Retry rate: % of executions that retry
- Retry success rate: % of retries that succeed
- Average retries per execution
- Retry reason distribution
**Health Metrics:**
- Healthy worker count
- Degraded worker count
- Unhealthy worker count
- Average queue depth per worker
- Average failure rate per worker
### SQL Queries
```sql
-- Retry rate over last hour
SELECT
COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END) as original_executions,
COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) as retry_executions,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN retry_count > 0 THEN id END) /
COUNT(DISTINCT CASE WHEN retry_count = 0 THEN id END), 2) as retry_rate
FROM execution
WHERE created > NOW() - INTERVAL '1 hour';
-- Worker health distribution
SELECT
COALESCE((capabilities->'health'->>'status')::TEXT, 'unknown') as health_status,
COUNT(*) as worker_count,
AVG((capabilities->'health'->>'queue_depth')::INTEGER) as avg_queue_depth
FROM worker
WHERE status = 'active'
GROUP BY health_status;
```
## Configuration
### Retry Configuration
```rust
// In executor service initialization
let retry_manager = RetryManager::new(pool.clone(), RetryConfig {
enabled: true,
base_backoff_secs: 1,
max_backoff_secs: 300,
backoff_multiplier: 2.0,
jitter_factor: 0.2,
});
```
### Health Probe Configuration
```rust
// In executor service initialization
let health_probe = WorkerHealthProbe::new(pool.clone(), HealthProbeConfig {
enabled: true,
heartbeat_max_age_secs: 30,
degraded_threshold: 3,
unhealthy_threshold: 10,
queue_depth_degraded: 50,
queue_depth_unhealthy: 100,
failure_rate_degraded: 0.3,
failure_rate_unhealthy: 0.7,
});
```
## Troubleshooting
### High Retry Rate
**Symptoms:** Many executions retrying repeatedly
**Causes:**
- Workers unstable or frequently restarting
- Network issues causing transient failures
- Actions not idempotent (retry makes things worse)
**Resolution:**
1. Check worker stability: `docker compose ps`
2. Review action idempotency
3. Adjust `max_retries` if retries are unhelpful
4. Investigate root cause of failures
### Retries Not Triggering
**Symptoms:** Failed executions not retrying despite max_retries > 0
**Causes:**
- Action doesn't have `max_retries` set
- Failure is non-retriable (validation error, etc.)
- Global retry disabled
**Resolution:**
1. Check action configuration: `SELECT timeout_seconds, max_retries FROM action WHERE ref = 'action.name';`
2. Check failure message for retriable patterns
3. Verify retry enabled in executor config
### Workers Marked Unhealthy
**Symptoms:** Workers not receiving tasks
**Causes:**
- High queue depth (overloaded)
- Consecutive failures exceed threshold
- Heartbeat stale
**Resolution:**
1. Check worker logs: `docker compose logs -f worker-shell`
2. Verify heartbeat: `SELECT name, last_heartbeat FROM worker;`
3. Check queue depth in capabilities
4. Restart worker if stuck: `docker compose restart worker-shell`
### Retry Loops
**Symptoms:** Execution retries forever or excessive retries
**Causes:**
- Bug in retry reason detection
- Action failure always classified as retriable
- max_retries not being enforced
**Resolution:**
1. Check retry chain: See Example 1 above
2. Verify max_retries: `SELECT config FROM execution WHERE id = 123;`
3. Fix retry reason classification if incorrect
4. Manually fail execution if stuck
## Integration with Previous Phases
### Phase 1 + Phase 2 + Phase 3 Together
**Defense in Depth:**
1. **Phase 1 (Timeout Monitor):** Catches stuck SCHEDULED executions (30s-5min)
2. **Phase 2 (Queue TTL/DLQ):** Expires messages in worker queues (5min)
3. **Phase 3 (Intelligent Retry):** Retries retriable failures (1s-5min backoff)
**Failure Flow:**
```
Execution dispatched → Worker unavailable (Phase 2: 5min TTL)
→ DLQ handler marks FAILED (Phase 2)
→ Retry manager creates retry (Phase 3)
→ Retry dispatched with backoff (Phase 3)
→ Success or exhaust retries
```
**Backup Safety Net:**
If Phase 3 retry fails to create retry, Phase 1 timeout monitor will still catch stuck executions.
## Best Practices
### Action Design for Retries
1. **Make actions idempotent:** Safe to run multiple times
2. **Set realistic timeouts:** Based on typical execution time
3. **Configure appropriate max_retries:**
- Network calls: 3-5 retries
- Database operations: 2-3 retries
- External APIs: 3 retries
- Local operations: 0-1 retries
### Worker Health Management
1. **Report queue depth regularly:** Update every heartbeat
2. **Track failure metrics:** Consecutive failures, total/failed counts
3. **Implement graceful degradation:** Continue working when degraded
4. **Fail fast when unhealthy:** Stop accepting work if overloaded
### Monitoring Strategy
1. **Alert on high retry rates:** > 20% of executions retrying
2. **Alert on unhealthy workers:** > 50% workers unhealthy
3. **Track retry success rate:** Should be > 70%
4. **Monitor queue depths:** Average should stay < 20
## See Also
- **Architecture:** `docs/architecture/worker-availability-handling.md`
- **Phase 1 Guide:** `docs/QUICKREF-worker-availability-phase1.md`
- **Phase 2 Guide:** `docs/QUICKREF-worker-queue-ttl-dlq.md`
- **Migration:** `migrations/20260209000000_phase3_retry_and_health.sql`

View File

@@ -0,0 +1,227 @@
# Quick Reference: Worker Heartbeat Monitoring
**Purpose**: Automatically detect and deactivate workers that have stopped sending heartbeats
## Overview
The executor service includes a background task that monitors worker heartbeats and automatically marks stale workers as inactive. This prevents the scheduler from attempting to assign work to workers that are no longer available.
## How It Works
### Background Monitor Task
- **Location**: `crates/executor/src/service.rs``worker_heartbeat_monitor_loop()`
- **Check Interval**: Every 60 seconds
- **Staleness Threshold**: 90 seconds (3x the expected 30-second heartbeat interval)
### Detection Logic
The monitor checks all workers with `status = 'active'`:
1. **No Heartbeat**: Workers with `last_heartbeat = NULL` → marked inactive
2. **Stale Heartbeat**: Workers with heartbeat older than 90 seconds → marked inactive
3. **Fresh Heartbeat**: Workers with heartbeat within 90 seconds → remain active
### Automatic Deactivation
When a stale worker is detected:
- Worker status updated to `inactive` in database
- Warning logged with worker name, ID, and heartbeat age
- Summary logged with count of deactivated workers
## Configuration
### Constants (in scheduler.rs and service.rs)
```rust
DEFAULT_HEARTBEAT_INTERVAL: 30 seconds // Expected worker heartbeat frequency
HEARTBEAT_STALENESS_MULTIPLIER: 3 // Grace period multiplier
MAX_STALENESS: 90 seconds // Calculated: 30 * 3
```
### Check Interval
Currently hardcoded to 60 seconds. Configured when spawning the monitor task:
```rust
Self::worker_heartbeat_monitor_loop(worker_pool, 60).await;
```
## Worker Lifecycle
### Normal Operation
```
Worker Starts → Registers → Sends Heartbeats (30s) → Remains Active
```
### Graceful Shutdown
```
Worker Stops → No More Heartbeats → Monitor Detects (60s) → Marked Inactive
```
### Crash/Network Failure
```
Worker Crashes → Heartbeats Stop → Monitor Detects (60s) → Marked Inactive
```
## Monitoring
### Check Active Workers
```sql
SELECT name, worker_role, status, last_heartbeat
FROM worker
WHERE status = 'active'
ORDER BY last_heartbeat DESC;
```
### Check Recent Deactivations
```sql
SELECT name, worker_role, status, last_heartbeat, updated
FROM worker
WHERE status = 'inactive'
AND updated > NOW() - INTERVAL '5 minutes'
ORDER BY updated DESC;
```
### Count Workers by Status
```sql
SELECT status, COUNT(*)
FROM worker
GROUP BY status;
```
## Logs
### Monitor Startup
```
INFO: Starting worker heartbeat monitor...
INFO: Worker heartbeat monitor started (check interval: 60s, staleness threshold: 90s)
```
### Worker Deactivation
```
WARN: Worker sensor-77cd23b50478 (ID: 27) heartbeat is stale (1289s old), marking as inactive
INFO: Deactivated 5 worker(s) with stale heartbeats
```
### Error Handling
```
ERROR: Failed to deactivate worker worker-123 (stale heartbeat): <error details>
ERROR: Failed to query active workers for heartbeat check: <error details>
```
## Scheduler Integration
The scheduler already filters out stale workers during worker selection:
```rust
// Filter by heartbeat freshness
let fresh_workers: Vec<_> = active_workers
.into_iter()
.filter(|w| Self::is_worker_heartbeat_fresh(w))
.collect();
```
**Before Heartbeat Monitor**: Scheduler filtered at selection time, but workers stayed "active" in DB
**After Heartbeat Monitor**: Workers marked inactive in DB, scheduler sees accurate state
## Troubleshooting
### Workers Constantly Becoming Inactive
**Symptoms**: Active workers being marked inactive despite running
**Causes**:
- Worker heartbeat interval > 30 seconds
- Network issues preventing heartbeat messages
- Worker service crash loop
**Solutions**:
1. Check worker logs for heartbeat send attempts
2. Verify RabbitMQ connectivity
3. Check worker configuration for heartbeat interval
### Stale Workers Not Being Deactivated
**Symptoms**: Workers with old heartbeats remain active
**Causes**:
- Executor service not running
- Monitor task crashed
**Solutions**:
1. Check executor service logs
2. Verify monitor task started: `grep "heartbeat monitor started" executor.log`
3. Restart executor service
### Too Many Inactive Workers
**Symptoms**: Database has hundreds of inactive workers
**Causes**: Historical workers from development/testing
**Solutions**:
```sql
-- Delete inactive workers older than 7 days
DELETE FROM worker
WHERE status = 'inactive'
AND updated < NOW() - INTERVAL '7 days';
```
## Best Practices
### Worker Registration
Workers should:
- Set appropriate unique name (hostname-based)
- Send heartbeat every 30 seconds
- Handle graceful shutdown (optional: mark self inactive)
### Database Maintenance
- Periodically clean up old inactive workers
- Monitor worker table growth
- Index on `status` and `last_heartbeat` for efficient queries
### Monitoring & Alerts
- Track worker deactivation rate (should be low in production)
- Alert on sudden increase in deactivations (infrastructure issue)
- Monitor active worker count vs. expected
## Related Documentation
- `docs/architecture/worker-service.md` - Worker architecture
- `docs/architecture/executor-service.md` - Executor architecture
- `docs/deployment/ops-runbook-queues.md` - Operational procedures
- `AGENTS.md` - Project rules and conventions
## Implementation Notes
### Why 90 Seconds?
- Worker sends heartbeat every 30 seconds
- 3x multiplier provides grace period for:
- Network latency
- Brief load spikes
- Temporary connectivity issues
- Balances responsiveness vs. false positives
### Why Check Every 60 Seconds?
- Allows 1.5 heartbeat intervals between checks
- Reduces database query frequency
- Adequate response time (stale workers removed within ~2 minutes)
### Thread Safety
- Monitor runs in separate tokio task
- Uses connection pool for database access
- No shared mutable state
- Safe to run multiple executor instances (each monitors independently)

View File

@@ -0,0 +1,322 @@
# Quick Reference: Worker Queue TTL and Dead Letter Queue (Phase 2)
## Overview
Phase 2 implements message TTL on worker queues and dead letter queue processing to automatically fail executions when workers are unavailable.
**Key Concept:** If a worker doesn't process an execution within 5 minutes, the message expires and the execution is automatically marked as FAILED.
## How It Works
```
Execution → Worker Queue (TTL: 5 min) → Worker Processing ✓
↓ (if timeout)
Dead Letter Exchange
Dead Letter Queue
DLQ Handler (in Executor)
Execution marked FAILED
```
## Configuration
### Default Settings (All Environments)
```yaml
message_queue:
rabbitmq:
worker_queue_ttl_ms: 300000 # 5 minutes
dead_letter:
enabled: true
exchange: attune.dlx
ttl_ms: 86400000 # 24 hours DLQ retention
```
### Tuning TTL
**Worker Queue TTL** (`worker_queue_ttl_ms`):
- **Default:** 300000 (5 minutes)
- **Purpose:** How long to wait before declaring worker unavailable
- **Tuning:** Set to 2-5x your typical execution time
- **Too short:** Slow executions fail prematurely
- **Too long:** Delayed failure detection for unavailable workers
**DLQ Retention** (`dead_letter.ttl_ms`):
- **Default:** 86400000 (24 hours)
- **Purpose:** How long to keep expired messages for debugging
- **Tuning:** Based on your debugging/forensics needs
## Components
### 1. Worker Queue TTL
- Applied to all `worker.{id}.executions` queues
- Configured via RabbitMQ queue argument `x-message-ttl`
- Messages expire if not consumed within TTL
- Expired messages routed to dead letter exchange
### 2. Dead Letter Exchange (DLX)
- **Name:** `attune.dlx`
- **Type:** `direct`
- Receives all expired messages from worker queues
- Routes to dead letter queue
### 3. Dead Letter Queue (DLQ)
- **Name:** `attune.dlx.queue`
- Stores expired messages for processing
- Retains messages for 24 hours (configurable)
- Processed by dead letter handler
### 4. Dead Letter Handler
- Runs in executor service
- Consumes messages from DLQ
- Updates executions to FAILED status
- Provides descriptive error messages
## Monitoring
### Key Metrics
```bash
# Check DLQ depth
rabbitmqadmin list queues name messages | grep attune.dlx.queue
# View DLQ rate
# Watch for sustained DLQ message rate > 10/min
# Check failed executions
curl http://localhost:8080/api/v1/executions?status=failed
```
### Health Checks
**Good:**
- DLQ depth: 0-10
- DLQ rate: < 5 messages/min
- Most executions complete successfully
**Warning:**
- DLQ depth: 10-100
- DLQ rate: 5-20 messages/min
- May indicate worker instability
**Critical:**
- DLQ depth: > 100
- DLQ rate: > 20 messages/min
- Workers likely down or overloaded
## Troubleshooting
### High DLQ Rate
**Symptoms:** Many executions failing via DLQ
**Common Causes:**
1. Workers stopped or restarting
2. Workers overloaded (not consuming fast enough)
3. TTL too aggressive for your workload
4. Network connectivity issues
**Resolution:**
```bash
# 1. Check worker status
docker compose ps | grep worker
docker compose logs -f worker-shell
# 2. Verify worker heartbeats
psql -c "SELECT name, status, last_heartbeat FROM worker;"
# 3. Check worker queue depths
rabbitmqadmin list queues name messages | grep "worker\."
# 4. Consider increasing TTL if legitimate slow executions
# Edit config and restart executor:
# worker_queue_ttl_ms: 600000 # 10 minutes
```
### DLQ Not Processing
**Symptoms:** DLQ depth increasing, executions stuck
**Common Causes:**
1. Executor service not running
2. DLQ disabled in config
3. Database connection issues
**Resolution:**
```bash
# 1. Verify executor is running
docker compose ps executor
docker compose logs -f executor | grep "dead letter"
# 2. Check configuration
grep -A 3 "dead_letter:" config.docker.yaml
# 3. Restart executor if needed
docker compose restart executor
```
### Messages Not Expiring
**Symptoms:** Executions stuck in SCHEDULED, DLQ empty
**Common Causes:**
1. Worker queues not configured with TTL
2. Worker queues not configured with DLX
3. Infrastructure setup failed
**Resolution:**
```bash
# 1. Check queue properties
rabbitmqadmin show queue name=worker.1.executions
# Look for:
# - arguments.x-message-ttl: 300000
# - arguments.x-dead-letter-exchange: attune.dlx
# 2. Recreate infrastructure (safe, idempotent)
docker compose restart executor worker-shell
```
## Testing
### Manual Test: Verify TTL Expiration
```bash
# 1. Stop all workers
docker compose stop worker-shell worker-python worker-node
# 2. Create execution
curl -X POST http://localhost:8080/api/v1/executions \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"action_ref": "core.echo",
"parameters": {"message": "test"}
}'
# 3. Wait for TTL expiration (5+ minutes)
sleep 330
# 4. Check execution status
curl http://localhost:8080/api/v1/executions/{id} | jq '.data.status'
# Should be "failed"
# 5. Check error message
curl http://localhost:8080/api/v1/executions/{id} | jq '.data.result'
# Should contain "Worker queue TTL expired"
# 6. Verify DLQ processed it
rabbitmqadmin list queues name messages | grep attune.dlx.queue
# Should show 0 messages (processed and removed)
```
## Relationship to Phase 1
**Phase 1 (Timeout Monitor):**
- Monitors executions in SCHEDULED state
- Fails executions after configured timeout
- Acts as backup safety net
**Phase 2 (Queue TTL + DLQ):**
- Expires messages at queue level
- More precise failure detection
- Provides better visibility (DLQ metrics)
**Together:** Provide defense-in-depth for worker unavailability
## Common Operations
### View DLQ Messages
```bash
# Get messages from DLQ (doesn't remove)
rabbitmqadmin get queue=attune.dlx.queue count=10
# View x-death header for expiration details
rabbitmqadmin get queue=attune.dlx.queue count=1 --format=long
```
### Manually Purge DLQ
```bash
# Use with caution - removes all messages
rabbitmqadmin purge queue name=attune.dlx.queue
```
### Temporarily Disable DLQ
```yaml
# config.docker.yaml
message_queue:
rabbitmq:
dead_letter:
enabled: false # Disables DLQ handler
```
**Note:** Messages will still expire but won't be processed
### Adjust TTL Without Restart
Not possible - queue TTL is set at queue creation time. To change:
```bash
# 1. Stop all services
docker compose down
# 2. Delete worker queues (forces recreation)
rabbitmqadmin delete queue name=worker.1.executions
# Repeat for all worker queues
# 3. Update config
# Edit worker_queue_ttl_ms
# 4. Restart services (queues recreated with new TTL)
docker compose up -d
```
## Key Files
### Configuration
- `config.docker.yaml` - Production settings
- `config.development.yaml` - Development settings
### Implementation
- `crates/common/src/mq/config.rs` - TTL configuration
- `crates/common/src/mq/connection.rs` - Queue setup with TTL
- `crates/executor/src/dead_letter_handler.rs` - DLQ processing
- `crates/executor/src/service.rs` - DLQ handler integration
### Documentation
- `docs/architecture/worker-queue-ttl-dlq.md` - Full architecture
- `docs/architecture/worker-availability-handling.md` - Phase 1 (backup)
## When to Use
**Enable DLQ (default):**
- Production environments
- Development with multiple workers
- Any environment requiring high reliability
**Disable DLQ:**
- Local development with single worker
- Testing scenarios where you want manual control
- Debugging worker behavior
## Next Steps (Phase 3)
- **Health probes:** Proactive worker health checking
- **Intelligent retry:** Retry transient failures
- **Per-action TTL:** Custom timeouts per action type
- **DLQ analytics:** Aggregate failure statistics
## See Also
- Phase 1 Documentation: `docs/architecture/worker-availability-handling.md`
- Queue Architecture: `docs/architecture/queue-architecture.md`
- RabbitMQ Dead Letter Exchanges: https://www.rabbitmq.com/dlx.html

View File

@@ -339,7 +339,7 @@ Understanding the execution lifecycle helps with monitoring and debugging:
```
1. requested → Action execution requested
2. scheduling → Finding available worker
3. scheduled → Assigned to worker, queued
3. scheduled → Assigned to worker, queued [HANDOFF TO WORKER]
4. running → Currently executing
5. completed → Finished successfully
OR
@@ -352,33 +352,78 @@ Understanding the execution lifecycle helps with monitoring and debugging:
abandoned → Worker lost
```
### State Ownership Model
Execution state is owned by different services at different lifecycle stages:
**Executor Ownership (Pre-Handoff):**
- `requested``scheduling``scheduled`
- Executor creates and updates execution records
- Executor selects worker and publishes `execution.scheduled`
- **Handles cancellations/failures BEFORE handoff** (before `execution.scheduled` is published)
**Handoff Point:**
- When `execution.scheduled` message is **published to worker**
- Before handoff: Executor owns and updates state
- After handoff: Worker owns and updates state
**Worker Ownership (Post-Handoff):**
- `running``completed` / `failed` / `cancelled` / `timeout` / `abandoned`
- Worker updates execution records directly
- Worker publishes status change notifications
- **Handles cancellations/failures AFTER handoff** (after receiving `execution.scheduled`)
- Worker only owns executions it has received
**Orchestration (Read-Only):**
- Executor receives status change notifications for orchestration
- Triggers workflow children, manages parent-child relationships
- Does NOT update execution state after handoff
### State Transitions
**Normal Flow:**
```
requested → scheduling → scheduled → running → completed
requested → scheduling → scheduled → [HANDOFF] → running → completed
└─ Executor Updates ─────────┘ └─ Worker Updates ─┘
```
**Failure Flow:**
```
requested → scheduling → scheduled → running → failed
requested → scheduling → scheduled → [HANDOFF] → running → failed
└─ Executor Updates ─────────┘ └─ Worker Updates ──┘
```
**Cancellation:**
**Cancellation (depends on handoff):**
```
(any state) → canceling → cancelled
Before handoff:
requested/scheduling/scheduled → cancelled
└─ Executor Updates (worker never notified) ──┘
After handoff:
running → canceling → cancelled
└─ Worker Updates ──┘
```
**Timeout:**
```
scheduled/running → timeout
scheduled/running → [HANDOFF] → timeout
└─ Worker Updates
```
**Abandonment:**
```
scheduled/running → abandoned
scheduled/running → [HANDOFF] → abandoned
└─ Worker Updates
```
**Key Points:**
- Only one service updates each execution stage (no race conditions)
- Handoff occurs when `execution.scheduled` is **published**, not just when status is set to `scheduled`
- If cancelled before handoff: Executor updates (worker never knows execution existed)
- If cancelled after handoff: Worker updates (worker owns execution)
- Worker is authoritative source for execution state after receiving `execution.scheduled`
- Status changes are reflected in real-time via notifications
---
## Data Fields

View File

@@ -87,32 +87,47 @@ Execution Requested → Scheduler → Worker Selection → Execution Scheduled
### 3. Execution Manager
**Purpose**: Manages execution lifecycle and status transitions.
**Purpose**: Orchestrates execution workflows and handles lifecycle events.
**Responsibilities**:
- Listens for `execution.status.*` messages from workers
- Updates execution records with status changes
- Handles execution completion (success, failure, cancellation)
- Orchestrates workflow executions (parent-child relationships)
- Publishes completion notifications for downstream consumers
- **Does NOT update execution state** (worker owns state after scheduling)
- Handles execution completion orchestration (triggering child executions)
- Manages workflow executions (parent-child relationships)
- Coordinates workflow state transitions
**Ownership Model**:
- **Executor owns**: Requested → Scheduling → Scheduled (updates DB)
- Includes pre-handoff cancellations/failures (before `execution.scheduled` is published)
- **Worker owns**: Running → Completed/Failed/Cancelled (updates DB)
- Includes post-handoff cancellations/failures (after receiving `execution.scheduled`)
- **Handoff Point**: When `execution.scheduled` message is **published** to worker
- Before publish: Executor owns and updates state
- After publish: Worker owns and updates state
**Message Flow**:
```
Worker Status Update → Execution Manager → Database Update → Completion Handler
Worker Status Update → Execution Manager → Orchestration Logic (Read-Only)
→ Trigger Child Executions
```
**Status Lifecycle**:
```
Requested → Scheduling → Scheduled → Running → Completed/Failed/Cancelled
└→ Child Executions (workflows)
Requested → Scheduling → Scheduled → [HANDOFF: execution.scheduled published] → Running → Completed/Failed/Cancelled
│ │
└─ Executor Updates ───┘ └─ Worker Updates
│ (includes pre-handoff │ (includes post-handoff
│ Cancelled) │ Cancelled/Timeout/Abandoned)
└→ Child Executions (workflows)
```
**Key Implementation Details**:
- Parses status strings to typed enums for type safety
- Receives status change notifications for orchestration purposes only
- Does not update execution state after handoff to worker
- Handles workflow orchestration (parent-child execution chaining)
- Only triggers child executions on successful parent completion
- Publishes completion events for notification service
- Read-only access to execution records for orchestration logic
## Message Queue Integration
@@ -123,12 +138,14 @@ The Executor consumes and produces several message types:
**Consumed**:
- `enforcement.created` - New enforcement from triggered rules
- `execution.requested` - Execution scheduling requests
- `execution.status.*` - Status updates from workers
- `execution.status.changed` - Status change notifications from workers (for orchestration)
- `execution.completed` - Completion notifications from workers (for queue management)
**Published**:
- `execution.requested` - To scheduler (from enforcement processor)
- `execution.scheduled` - To workers (from scheduler)
- `execution.completed` - To notifier (from execution manager)
- `execution.scheduled` - To workers (from scheduler) **← OWNERSHIP HANDOFF**
**Note**: The executor does NOT publish `execution.completed` messages. This is the worker's responsibility as the authoritative source of execution state after scheduling.
### Message Envelope Structure
@@ -186,11 +203,34 @@ use attune_common::repositories::{
};
```
### Database Update Ownership
**Executor updates execution state** from creation through handoff:
- Creates execution records (`Requested` status)
- Updates status during scheduling (`Scheduling``Scheduled`)
- Publishes `execution.scheduled` message to worker **← HANDOFF POINT**
- **Handles cancellations/failures BEFORE handoff** (before message is published)
- Example: User cancels execution while queued by concurrency policy
- Executor updates to `Cancelled`, worker never receives message
**Worker updates execution state** after receiving handoff:
- Receives `execution.scheduled` message (takes ownership)
- Updates status when execution starts (`Running`)
- Updates status when execution completes (`Completed`, `Failed`, etc.)
- **Handles cancellations/failures AFTER handoff** (after receiving message)
- Updates result data and artifacts
- Worker only owns executions it has received
**Executor reads execution state** for orchestration after handoff:
- Receives status change notifications from workers
- Reads execution records to trigger workflow children
- Does NOT update execution state after publishing `execution.scheduled`
### Transaction Support
Future implementations will use database transactions for multi-step operations:
- Creating execution + publishing message (atomic)
- Status update + completion handling (atomic)
- Enforcement processing + execution creation (atomic)
## Configuration

View File

@@ -0,0 +1,557 @@
# Worker Availability Handling
**Status**: Implementation Gap Identified
**Priority**: High
**Date**: 2026-02-09
## Problem Statement
When workers are stopped or become unavailable, the executor continues attempting to schedule executions to them, resulting in:
1. **Stuck executions**: Executions remain in `SCHEDULING` or `SCHEDULED` status indefinitely
2. **Queue buildup**: Messages accumulate in worker-specific RabbitMQ queues
3. **No failure notification**: Users don't know their executions are stuck
4. **Resource waste**: System resources consumed by queued messages and database records
## Current Architecture
### Heartbeat Mechanism
Workers send heartbeat updates to the database periodically (default: 30 seconds).
```rust
// From crates/executor/src/scheduler.rs
const DEFAULT_HEARTBEAT_INTERVAL: u64 = 30;
const HEARTBEAT_STALENESS_MULTIPLIER: u64 = 3;
fn is_worker_heartbeat_fresh(worker: &Worker) -> bool {
// Worker is fresh if heartbeat < 90 seconds old
let max_age = Duration::from_secs(
DEFAULT_HEARTBEAT_INTERVAL * HEARTBEAT_STALENESS_MULTIPLIER
);
// ...
}
```
### Scheduling Flow
```
Execution Created (REQUESTED)
Scheduler receives message
Find compatible worker with fresh heartbeat
Update execution to SCHEDULED
Publish message to worker-specific queue
Worker consumes and executes
```
### Failure Points
1. **Worker stops after heartbeat**: Worker has fresh heartbeat but is actually down
2. **Worker crashes**: No graceful shutdown, heartbeat appears fresh temporarily
3. **Network partition**: Worker isolated but appears healthy
4. **Queue accumulation**: Messages sit in worker-specific queues indefinitely
## Current Mitigations (Insufficient)
### 1. Heartbeat Staleness Check
```rust
fn select_worker(pool: &PgPool, action: &Action) -> Result<Worker> {
// Filter by active workers
let active_workers: Vec<_> = workers
.into_iter()
.filter(|w| w.status == WorkerStatus::Active)
.collect();
// Filter by heartbeat freshness
let fresh_workers: Vec<_> = active_workers
.into_iter()
.filter(|w| is_worker_heartbeat_fresh(w))
.collect();
if fresh_workers.is_empty() {
return Err(anyhow!("No workers with fresh heartbeats"));
}
// Select first available worker
Ok(fresh_workers.into_iter().next().unwrap())
}
```
**Gap**: Workers can stop within the 90-second staleness window.
### 2. Message Requeue on Error
```rust
// From crates/common/src/mq/consumer.rs
match handler(envelope.clone()).await {
Err(e) => {
let requeue = e.is_retriable();
channel.basic_nack(delivery_tag, BasicNackOptions {
requeue,
multiple: false,
}).await?;
}
}
```
**Gap**: Only requeues on retriable errors (connection/timeout), not worker unavailability.
### 3. Message TTL Configuration
```rust
// From crates/common/src/config.rs
pub struct MessageQueueConfig {
#[serde(default = "default_message_ttl")]
pub message_ttl: u64,
}
fn default_message_ttl() -> u64 {
3600 // 1 hour
}
```
**Gap**: TTL not currently applied to worker queues, and 1 hour is too long.
## Proposed Solutions
### Solution 1: Execution Timeout Mechanism (HIGH PRIORITY)
Add a background task that monitors scheduled executions and fails them if they don't start within a timeout.
**Implementation:**
```rust
// crates/executor/src/execution_timeout_monitor.rs
pub struct ExecutionTimeoutMonitor {
pool: PgPool,
publisher: Arc<Publisher>,
check_interval: Duration,
scheduled_timeout: Duration,
}
impl ExecutionTimeoutMonitor {
pub async fn start(&self) -> Result<()> {
let mut interval = tokio::time::interval(self.check_interval);
loop {
interval.tick().await;
if let Err(e) = self.check_stale_executions().await {
error!("Error checking stale executions: {}", e);
}
}
}
async fn check_stale_executions(&self) -> Result<()> {
let cutoff = Utc::now() - chrono::Duration::from_std(self.scheduled_timeout)?;
// Find executions stuck in SCHEDULED status
let stale_executions = sqlx::query_as::<_, Execution>(
"SELECT * FROM execution
WHERE status = 'scheduled'
AND updated < $1"
)
.bind(cutoff)
.fetch_all(&self.pool)
.await?;
for execution in stale_executions {
warn!(
"Execution {} has been scheduled for too long, marking as failed",
execution.id
);
self.fail_execution(
execution.id,
"Execution timeout: worker did not pick up task within timeout"
).await?;
}
Ok(())
}
async fn fail_execution(&self, execution_id: i64, reason: &str) -> Result<()> {
// Update execution status
sqlx::query(
"UPDATE execution
SET status = 'failed',
result = $2,
updated = NOW()
WHERE id = $1"
)
.bind(execution_id)
.bind(serde_json::json!({
"error": reason,
"failed_by": "execution_timeout_monitor"
}))
.execute(&self.pool)
.await?;
// Publish completion notification
let payload = ExecutionCompletedPayload {
execution_id,
status: ExecutionStatus::Failed,
result: Some(serde_json::json!({"error": reason})),
};
self.publisher
.publish_envelope(
MessageType::ExecutionCompleted,
payload,
"attune.executions",
)
.await?;
Ok(())
}
}
```
**Configuration:**
```yaml
# config.yaml
executor:
scheduled_timeout: 300 # 5 minutes (fail if not running within 5 min)
timeout_check_interval: 60 # Check every minute
```
### Solution 2: Worker Queue TTL and DLQ (MEDIUM PRIORITY)
Apply message TTL to worker-specific queues with dead letter exchange.
**Implementation:**
```rust
// When declaring worker-specific queues
let mut queue_args = FieldTable::default();
// Set message TTL (5 minutes)
queue_args.insert(
"x-message-ttl".into(),
AMQPValue::LongInt(300_000) // 5 minutes in milliseconds
);
// Set dead letter exchange
queue_args.insert(
"x-dead-letter-exchange".into(),
AMQPValue::LongString("attune.executions.dlx".into())
);
channel.queue_declare(
&format!("attune.execution.worker.{}", worker_id),
QueueDeclareOptions {
durable: true,
..Default::default()
},
queue_args,
).await?;
```
**Dead Letter Handler:**
```rust
// crates/executor/src/dead_letter_handler.rs
pub struct DeadLetterHandler {
pool: PgPool,
consumer: Arc<Consumer>,
}
impl DeadLetterHandler {
pub async fn start(&self) -> Result<()> {
self.consumer
.consume_with_handler(|envelope: MessageEnvelope<ExecutionScheduledPayload>| {
let pool = self.pool.clone();
async move {
warn!("Received dead letter for execution {}", envelope.payload.execution_id);
// Mark execution as failed
sqlx::query(
"UPDATE execution
SET status = 'failed',
result = $2,
updated = NOW()
WHERE id = $1 AND status = 'scheduled'"
)
.bind(envelope.payload.execution_id)
.bind(serde_json::json!({
"error": "Message expired in worker queue (worker unavailable)",
"failed_by": "dead_letter_handler"
}))
.execute(&pool)
.await?;
Ok(())
}
})
.await
}
}
```
### Solution 3: Worker Health Probes (LOW PRIORITY)
Add active health checking instead of relying solely on heartbeats.
**Implementation:**
```rust
// crates/executor/src/worker_health_checker.rs
pub struct WorkerHealthChecker {
pool: PgPool,
check_interval: Duration,
}
impl WorkerHealthChecker {
pub async fn start(&self) -> Result<()> {
let mut interval = tokio::time::interval(self.check_interval);
loop {
interval.tick().await;
if let Err(e) = self.check_worker_health().await {
error!("Error checking worker health: {}", e);
}
}
}
async fn check_worker_health(&self) -> Result<()> {
let workers = WorkerRepository::find_action_workers(&self.pool).await?;
for worker in workers {
// Skip if heartbeat is very stale (worker is definitely down)
if !is_heartbeat_recent(&worker) {
continue;
}
// Attempt health check
match self.ping_worker(&worker).await {
Ok(true) => {
// Worker is healthy, ensure status is Active
if worker.status != Some(WorkerStatus::Active) {
self.update_worker_status(worker.id, WorkerStatus::Active).await?;
}
}
Ok(false) | Err(_) => {
// Worker is unhealthy, mark as inactive
warn!("Worker {} failed health check", worker.name);
self.update_worker_status(worker.id, WorkerStatus::Inactive).await?;
}
}
}
Ok(())
}
async fn ping_worker(&self, worker: &Worker) -> Result<bool> {
// TODO: Implement health endpoint on worker
// For now, check if worker's queue is being consumed
Ok(true)
}
}
```
### Solution 4: Graceful Worker Shutdown (MEDIUM PRIORITY)
Ensure workers mark themselves as inactive before shutdown.
**Implementation:**
```rust
// In worker service shutdown handler
impl WorkerService {
pub async fn shutdown(&self) -> Result<()> {
info!("Worker shutting down gracefully...");
// Mark worker as inactive
sqlx::query(
"UPDATE worker SET status = 'inactive', updated = NOW() WHERE id = $1"
)
.bind(self.worker_id)
.execute(&self.pool)
.await?;
// Stop accepting new tasks
self.stop_consuming().await?;
// Wait for in-flight tasks to complete (with timeout)
let timeout = Duration::from_secs(30);
tokio::time::timeout(timeout, self.wait_for_completion()).await?;
info!("Worker shutdown complete");
Ok(())
}
}
```
**Docker Signal Handling:**
```yaml
# docker-compose.yaml
services:
worker-shell:
stop_grace_period: 45s # Give worker time to finish tasks
```
## Implementation Priority
### Phase 1: Immediate (Week 1)
1. **Execution Timeout Monitor** - Prevents stuck executions
2. **Graceful Shutdown** - Marks workers inactive on stop
### Phase 2: Short-term (Week 2)
3. **Worker Queue TTL + DLQ** - Prevents message buildup
4. **Dead Letter Handler** - Fails expired executions
### Phase 3: Long-term (Month 1)
5. **Worker Health Probes** - Active availability verification
6. **Retry Logic** - Reschedule to different worker on failure
## Configuration
### Recommended Timeouts
```yaml
executor:
# How long an execution can stay SCHEDULED before failing
scheduled_timeout: 300 # 5 minutes
# How often to check for stale executions
timeout_check_interval: 60 # 1 minute
# Message TTL in worker queues
worker_queue_ttl: 300 # 5 minutes (match scheduled_timeout)
# Worker health check interval
health_check_interval: 30 # 30 seconds
worker:
# How often to send heartbeats
heartbeat_interval: 10 # 10 seconds (more frequent)
# Grace period for shutdown
shutdown_timeout: 30 # 30 seconds
```
### Staleness Calculation
```
Heartbeat Staleness Threshold = heartbeat_interval * 3
= 10 * 3 = 30 seconds
This means:
- Worker sends heartbeat every 10s
- If heartbeat is > 30s old, worker is considered stale
- Reduces window where stopped worker appears healthy from 90s to 30s
```
## Monitoring and Observability
### Metrics to Track
1. **Execution timeout rate**: Number of executions failed due to timeout
2. **Worker downtime**: Time between last heartbeat and status change
3. **Dead letter queue depth**: Number of expired messages
4. **Average scheduling latency**: Time from REQUESTED to RUNNING
### Alerts
```yaml
alerts:
- name: high_execution_timeout_rate
condition: execution_timeouts > 10 per minute
severity: warning
- name: no_active_workers
condition: active_workers == 0
severity: critical
- name: dlq_buildup
condition: dlq_depth > 100
severity: warning
- name: stale_executions
condition: scheduled_executions_older_than_5min > 0
severity: warning
```
## Testing
### Test Scenarios
1. **Worker stops mid-execution**: Should timeout and fail
2. **Worker never picks up task**: Should timeout after 5 minutes
3. **All workers down**: Should immediately fail with "no workers available"
4. **Worker stops gracefully**: Should mark inactive and not receive new tasks
5. **Message expires in queue**: Should be moved to DLQ and execution failed
### Integration Test Example
```rust
#[tokio::test]
async fn test_execution_timeout_on_worker_down() {
let pool = setup_test_db().await;
let mq = setup_test_mq().await;
// Create worker and execution
let worker = create_test_worker(&pool).await;
let execution = create_test_execution(&pool).await;
// Schedule execution to worker
schedule_execution(&pool, &mq, execution.id, worker.id).await;
// Stop worker (simulate crash - no graceful shutdown)
stop_worker(worker.id).await;
// Wait for timeout
tokio::time::sleep(Duration::from_secs(310)).await;
// Verify execution is marked as failed
let execution = get_execution(&pool, execution.id).await;
assert_eq!(execution.status, ExecutionStatus::Failed);
assert!(execution.result.unwrap()["error"]
.as_str()
.unwrap()
.contains("timeout"));
}
```
## Migration Path
### Step 1: Add Monitoring (No Breaking Changes)
- Deploy execution timeout monitor
- Monitor logs for timeout events
- Tune timeout values based on actual workload
### Step 2: Add DLQ (Requires Queue Reconfiguration)
- Create dead letter exchange
- Update queue declarations with TTL and DLX
- Deploy dead letter handler
- Monitor DLQ depth
### Step 3: Graceful Shutdown (Worker Update)
- Add shutdown handler to worker
- Update Docker Compose stop_grace_period
- Test worker restarts
### Step 4: Health Probes (Future Enhancement)
- Add health endpoint to worker
- Deploy health checker service
- Transition from heartbeat-only to active probing
## Related Documentation
- [Queue Architecture](./queue-architecture.md)
- [Worker Service](./worker-service.md)
- [Executor Service](./executor-service.md)
- [RabbitMQ Queues Quick Reference](../docs/QUICKREF-rabbitmq-queues.md)

View File

@@ -0,0 +1,493 @@
# Worker Queue TTL and Dead Letter Queue (Phase 2)
## Overview
Phase 2 of worker availability handling implements message TTL (time-to-live) on worker-specific queues and dead letter queue (DLQ) processing. This ensures that executions sent to unavailable workers are automatically failed instead of remaining stuck indefinitely.
## Architecture
### Message Flow
```
┌─────────────┐
│ Executor │
│ Scheduler │
└──────┬──────┘
│ Publishes ExecutionRequested
│ routing_key: execution.dispatch.worker.{id}
┌──────────────────────────────────┐
│ worker.{id}.executions queue │
│ │
│ Properties: │
│ - x-message-ttl: 300000ms (5m) │
│ - x-dead-letter-exchange: dlx │
└──────┬───────────────────┬───────┘
│ │
│ Worker consumes │ TTL expires
│ (normal flow) │ (worker unavailable)
│ │
▼ ▼
┌──────────────┐ ┌──────────────────┐
│ Worker │ │ attune.dlx │
│ Service │ │ (Dead Letter │
│ │ │ Exchange) │
└──────────────┘ └────────┬─────────┘
│ Routes to DLQ
┌──────────────────────┐
│ attune.dlx.queue │
│ (Dead Letter Queue) │
└────────┬─────────────┘
│ Consumes
┌──────────────────────┐
│ Dead Letter Handler │
│ (in Executor) │
│ │
│ - Identifies exec │
│ - Marks as FAILED │
│ - Logs failure │
└──────────────────────┘
```
### Components
#### 1. Worker Queue TTL
**Configuration:**
- Default: 5 minutes (300,000 milliseconds)
- Configurable via `rabbitmq.worker_queue_ttl_ms`
**Implementation:**
- Applied during queue declaration in `Connection::setup_worker_infrastructure()`
- Uses RabbitMQ's `x-message-ttl` queue argument
- Only applies to worker-specific queues (`worker.{id}.executions`)
**Behavior:**
- When a message remains in the queue longer than TTL
- RabbitMQ automatically moves it to the configured dead letter exchange
- Original message properties and headers are preserved
- Includes `x-death` header with expiration details
#### 2. Dead Letter Exchange (DLX)
**Configuration:**
- Exchange name: `attune.dlx`
- Type: `direct`
- Durable: `true`
**Setup:**
- Created in `Connection::setup_common_infrastructure()`
- Bound to dead letter queue with routing key `#` (all messages)
- Shared across all services
#### 3. Dead Letter Queue
**Configuration:**
- Queue name: `attune.dlx.queue`
- Durable: `true`
- TTL: 24 hours (configurable via `rabbitmq.dead_letter.ttl_ms`)
**Properties:**
- Retains messages for debugging and analysis
- Messages auto-expire after retention period
- No DLX on the DLQ itself (prevents infinite loops)
#### 4. Dead Letter Handler
**Location:** `crates/executor/src/dead_letter_handler.rs`
**Responsibilities:**
1. Consume messages from `attune.dlx.queue`
2. Deserialize message envelope
3. Extract execution ID from payload
4. Verify execution is in non-terminal state
5. Update execution to FAILED status
6. Add descriptive error information
7. Acknowledge message (remove from DLQ)
**Error Handling:**
- Invalid messages: Acknowledged and discarded
- Missing executions: Acknowledged (already processed)
- Terminal state executions: Acknowledged (no action needed)
- Database errors: Nacked with requeue (retry later)
## Configuration
### RabbitMQ Configuration Structure
```yaml
message_queue:
rabbitmq:
# Worker queue TTL - how long messages wait before DLX
worker_queue_ttl_ms: 300000 # 5 minutes (default)
# Dead letter configuration
dead_letter:
enabled: true # Enable DLQ system
exchange: attune.dlx # DLX name
ttl_ms: 86400000 # DLQ retention (24 hours)
```
### Environment-Specific Settings
#### Development (`config.development.yaml`)
```yaml
message_queue:
rabbitmq:
worker_queue_ttl_ms: 300000 # 5 minutes
dead_letter:
enabled: true
exchange: attune.dlx
ttl_ms: 86400000 # 24 hours
```
#### Production (`config.docker.yaml`)
```yaml
message_queue:
rabbitmq:
worker_queue_ttl_ms: 300000 # 5 minutes
dead_letter:
enabled: true
exchange: attune.dlx
ttl_ms: 86400000 # 24 hours
```
### Tuning Guidelines
**Worker Queue TTL (`worker_queue_ttl_ms`):**
- **Too short:** Legitimate slow workers may have executions failed prematurely
- **Too long:** Unavailable workers cause delayed failure detection
- **Recommendation:** 2-5x typical execution time, minimum 2 minutes
- **Default (5 min):** Good balance for most workloads
**DLQ Retention (`dead_letter.ttl_ms`):**
- Purpose: Debugging and forensics
- **Too short:** May lose data before analysis
- **Too long:** Accumulates stale data
- **Recommendation:** 24-48 hours in production
- **Default (24 hours):** Adequate for most troubleshooting
## Code Structure
### Queue Declaration with TTL
```rust
// crates/common/src/mq/connection.rs
pub async fn declare_queue_with_dlx_and_ttl(
&self,
config: &QueueConfig,
dlx_exchange: &str,
ttl_ms: Option<u64>,
) -> MqResult<()> {
let mut args = FieldTable::default();
// Configure DLX
args.insert(
"x-dead-letter-exchange".into(),
AMQPValue::LongString(dlx_exchange.into()),
);
// Configure TTL if specified
if let Some(ttl) = ttl_ms {
args.insert(
"x-message-ttl".into(),
AMQPValue::LongInt(ttl as i64),
);
}
// Declare queue with arguments
channel.queue_declare(&config.name, options, args).await?;
Ok(())
}
```
### Dead Letter Handler
```rust
// crates/executor/src/dead_letter_handler.rs
pub struct DeadLetterHandler {
pool: Arc<PgPool>,
consumer: Consumer,
running: Arc<Mutex<bool>>,
}
impl DeadLetterHandler {
pub async fn start(&self) -> Result<(), Error> {
self.consumer.consume_with_handler(|envelope| {
match envelope.message_type {
MessageType::ExecutionRequested => {
handle_execution_requested(&pool, &envelope).await
}
_ => {
// Unexpected message type - acknowledge and discard
Ok(())
}
}
}).await
}
}
async fn handle_execution_requested(
pool: &PgPool,
envelope: &MessageEnvelope<Value>,
) -> MqResult<()> {
// Extract execution ID
let execution_id = envelope.payload.get("execution_id")
.and_then(|v| v.as_i64())
.ok_or_else(|| /* error */)?;
// Fetch current state
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
// Only fail if in non-terminal state
if !execution.status.is_terminal() {
ExecutionRepository::update(pool, execution_id, UpdateExecutionInput {
status: Some(ExecutionStatus::Failed),
result: Some(json!({
"error": "Worker queue TTL expired",
"message": "Worker did not process execution within configured TTL",
})),
ended: Some(Some(Utc::now())),
..Default::default()
}).await?;
}
Ok(())
}
```
## Integration with Executor Service
The dead letter handler is started automatically by the executor service if DLQ is enabled:
```rust
// crates/executor/src/service.rs
pub async fn start(&self) -> Result<()> {
// ... other components ...
// Start dead letter handler (if enabled)
if self.inner.mq_config.rabbitmq.dead_letter.enabled {
let dlq_name = format!("{}.queue",
self.inner.mq_config.rabbitmq.dead_letter.exchange);
let dlq_consumer = Consumer::new(
&self.inner.mq_connection,
create_dlq_consumer_config(&dlq_name, "executor.dlq"),
).await?;
let dlq_handler = Arc::new(
DeadLetterHandler::new(self.inner.pool.clone(), dlq_consumer).await?
);
handles.push(tokio::spawn(async move {
dlq_handler.start().await
}));
}
// ... wait for completion ...
}
```
## Operational Considerations
### Monitoring
**Key Metrics:**
- DLQ message rate (messages/sec entering DLQ)
- DLQ queue depth (current messages in DLQ)
- DLQ processing latency (time from DLX to handler)
- Failed execution count (executions failed via DLQ)
**Alerting Thresholds:**
- DLQ rate > 10/min: Workers may be unhealthy or TTL too aggressive
- DLQ depth > 100: Handler may be falling behind
- High failure rate: Systematic worker availability issues
### RabbitMQ Management
**View DLQ:**
```bash
# List messages in DLQ
rabbitmqadmin list queues name messages
# Get DLQ details
rabbitmqadmin show queue name=attune.dlx.queue
# Purge DLQ (use with caution)
rabbitmqadmin purge queue name=attune.dlx.queue
```
**View Dead Letters:**
```bash
# Get message from DLQ
rabbitmqadmin get queue=attune.dlx.queue count=1
# Check message death history
# Look for x-death header in message properties
```
### Troubleshooting
#### High DLQ Rate
**Symptoms:** Many executions failing via DLQ
**Causes:**
1. Workers down or restarting frequently
2. Worker queue TTL too aggressive
3. Worker overloaded (not consuming fast enough)
4. Network issues between executor and workers
**Resolution:**
1. Check worker health and logs
2. Verify worker heartbeats in database
3. Consider increasing `worker_queue_ttl_ms`
4. Scale worker fleet if overloaded
#### DLQ Handler Not Processing
**Symptoms:** DLQ depth increasing, executions stuck
**Causes:**
1. Executor service not running
2. DLQ disabled in configuration
3. Database connection issues
4. Handler crashed or deadlocked
**Resolution:**
1. Check executor service logs
2. Verify `dead_letter.enabled = true`
3. Check database connectivity
4. Restart executor service if needed
#### Messages Not Reaching DLQ
**Symptoms:** Executions stuck, DLQ empty
**Causes:**
1. Worker queues not configured with DLX
2. DLX exchange not created
3. DLQ not bound to DLX
4. TTL not configured on worker queues
**Resolution:**
1. Restart services to recreate infrastructure
2. Verify RabbitMQ configuration
3. Check queue properties in RabbitMQ management UI
## Testing
### Unit Tests
```rust
#[tokio::test]
async fn test_expired_execution_handling() {
let pool = setup_test_db().await;
// Create execution in SCHEDULED state
let execution = create_test_execution(&pool, ExecutionStatus::Scheduled).await;
// Simulate DLQ message
let envelope = MessageEnvelope::new(
MessageType::ExecutionRequested,
json!({ "execution_id": execution.id }),
);
// Process message
handle_execution_requested(&pool, &envelope).await.unwrap();
// Verify execution failed
let updated = ExecutionRepository::find_by_id(&pool, execution.id).await.unwrap();
assert_eq!(updated.status, ExecutionStatus::Failed);
assert!(updated.result.unwrap()["error"].as_str().unwrap().contains("TTL expired"));
}
```
### Integration Tests
```bash
# 1. Start all services
docker compose up -d
# 2. Create execution targeting stopped worker
curl -X POST http://localhost:8080/api/v1/executions \
-H "Content-Type: application/json" \
-d '{
"action_ref": "core.echo",
"parameters": {"message": "test"},
"worker_id": 999 # Non-existent worker
}'
# 3. Wait for TTL expiration (5+ minutes)
sleep 330
# 4. Verify execution failed
curl http://localhost:8080/api/v1/executions/{id}
# Should show status: "failed", error: "Worker queue TTL expired"
# 5. Check DLQ processed the message
rabbitmqadmin list queues name messages | grep attune.dlx.queue
# Should show 0 messages (processed and removed)
```
## Relationship to Other Phases
### Phase 1 (Completed)
- Execution timeout monitor: Handles executions stuck in SCHEDULED
- Graceful shutdown: Prevents new tasks to stopping workers
- Reduced heartbeat: Faster stale worker detection
**Interaction:** Phase 1 timeout monitor acts as a backstop if DLQ processing fails
### Phase 2 (Current)
- Worker queue TTL: Automatic message expiration
- Dead letter queue: Capture expired messages
- Dead letter handler: Process and fail expired executions
**Benefit:** More precise failure detection at the message queue level
### Phase 3 (Planned)
- Health probes: Proactive worker health checking
- Intelligent retry: Retry transient failures
- Load balancing: Distribute work across healthy workers
**Integration:** Phase 3 will use Phase 2 DLQ data to inform routing decisions
## Benefits
1. **Automatic Failure Detection:** No manual intervention needed for unavailable workers
2. **Precise Timing:** TTL provides exact failure window (vs polling-based Phase 1)
3. **Resource Efficiency:** Prevents message accumulation in worker queues
4. **Debugging Support:** DLQ retains messages for forensic analysis
5. **Graceful Degradation:** System continues functioning even with worker failures
## Limitations
1. **TTL Precision:** RabbitMQ TTL is approximate, not guaranteed to the millisecond
2. **Race Conditions:** Worker may start processing just as TTL expires (rare)
3. **DLQ Capacity:** Very high failure rates may overwhelm DLQ
4. **No Retry Logic:** Phase 2 always fails; Phase 3 will add intelligent retry
## Future Enhancements (Phase 3)
- **Conditional Retry:** Retry messages based on failure reason
- **Priority DLQ:** Prioritize critical execution failures
- **DLQ Analytics:** Aggregate statistics on failure patterns
- **Auto-scaling:** Scale workers based on DLQ rate
- **Custom TTL:** Per-action or per-execution TTL configuration
## References
- RabbitMQ Dead Letter Exchanges: https://www.rabbitmq.com/dlx.html
- RabbitMQ TTL: https://www.rabbitmq.com/ttl.html
- Phase 1 Documentation: `docs/architecture/worker-availability-handling.md`
- Queue Architecture: `docs/architecture/queue-architecture.md`

View File

@@ -131,28 +131,38 @@ echo "Hello, $PARAM_NAME!"
### 4. Action Executor
**Purpose**: Orchestrate the complete execution flow for an action.
**Purpose**: Orchestrate the complete execution flow for an action and own execution state after handoff.
**Execution Flow**:
```
1. Load execution record from database
2. Update status to Running
3. Load action definition by reference
4. Prepare execution context (parameters, env vars, timeout)
5. Select and execute in appropriate runtime
6. Capture results (stdout, stderr, return value)
7. Store artifacts (logs, results)
8. Update execution status (Succeeded/Failed)
9. Publish status update messages
1. Receive execution.scheduled message from executor
2. Load execution record from database
3. Update status to Running (owns state after handoff)
4. Load action definition by reference
5. Prepare execution context (parameters, env vars, timeout)
6. Select and execute in appropriate runtime
7. Capture results (stdout, stderr, return value)
8. Store artifacts (logs, results)
9. Update execution status (Completed/Failed) in database
10. Publish status change notifications
11. Publish completion notification for queue management
```
**Ownership Model**:
- **Worker owns execution state** after receiving `execution.scheduled`
- **Authoritative source** for all status updates: Running, Completed, Failed, Cancelled, etc.
- **Updates database directly** for all state changes
- **Publishes notifications** for orchestration and monitoring
**Responsibilities**:
- Coordinate execution lifecycle
- Load action and execution data from database
- **Update execution state in database** (after handoff from executor)
- Prepare execution context with parameters and environment
- Execute action via runtime registry
- Handle success and failure cases
- Store execution artifacts
- Publish status change notifications
**Key Implementation Details**:
- Parameters merged: action defaults + execution overrides
@@ -246,7 +256,10 @@ See `docs/secrets-management.md` for comprehensive documentation.
- Register worker in database
- Start heartbeat manager
- Consume execution messages from worker-specific queue
- Publish execution status updates
- **Own execution state** after receiving scheduled executions
- **Update execution status in database** (Running, Completed, Failed, etc.)
- Publish execution status change notifications
- Publish execution completion notifications
- Handle graceful shutdown
**Message Flow**:
@@ -407,8 +420,9 @@ pub struct ExecutionResult {
### Error Propagation
- Runtime errors captured in `ExecutionResult.error`
- Execution status updated to Failed in database
- Error published in status update message
- **Worker updates** execution status to Failed in database (owns state)
- Error published in status change notification message
- Error published in completion notification message
- Artifacts still stored for failed executions
- Logs preserved for debugging

View File

@@ -0,0 +1,227 @@
# History Page URL Query Parameter Examples
This document provides practical examples of using URL query parameters to deep-link to filtered views in the Attune web UI history pages.
## Executions Page Examples
### Basic Filtering
**Filter by action:**
```
http://localhost:3000/executions?action_ref=core.echo
```
Shows all executions of the `core.echo` action.
**Filter by rule:**
```
http://localhost:3000/executions?rule_ref=core.on_timer
```
Shows all executions triggered by the `core.on_timer` rule.
**Filter by status:**
```
http://localhost:3000/executions?status=failed
```
Shows all failed executions.
**Filter by pack:**
```
http://localhost:3000/executions?pack_name=core
```
Shows all executions from the `core` pack.
### Combined Filters
**Rule + Status:**
```
http://localhost:3000/executions?rule_ref=core.on_timer&status=completed
```
Shows completed executions from a specific rule.
**Action + Pack:**
```
http://localhost:3000/executions?action_ref=core.echo&pack_name=core
```
Shows executions of a specific action in a pack (useful when multiple packs have similarly named actions).
**Multiple Filters:**
```
http://localhost:3000/executions?pack_name=core&status=running&trigger_ref=core.webhook
```
Shows currently running executions from the core pack triggered by webhooks.
### Troubleshooting Scenarios
**Find all failed executions for an action:**
```
http://localhost:3000/executions?action_ref=mypack.problematic_action&status=failed
```
**Check running executions for a specific executor:**
```
http://localhost:3000/executions?executor=1&status=running
```
**View all webhook-triggered executions:**
```
http://localhost:3000/executions?trigger_ref=core.webhook
```
## Events Page Examples
### Basic Filtering
**Filter by trigger:**
```
http://localhost:3000/events?trigger_ref=core.webhook
```
Shows all webhook events.
**Timer events:**
```
http://localhost:3000/events?trigger_ref=core.timer
```
Shows all timer-based events.
**Custom trigger:**
```
http://localhost:3000/events?trigger_ref=mypack.custom_trigger
```
Shows events from a custom trigger.
## Enforcements Page Examples
### Basic Filtering
**Filter by rule:**
```
http://localhost:3000/enforcements?rule_ref=core.on_timer
```
Shows all enforcements (rule activations) for a specific rule.
**Filter by trigger:**
```
http://localhost:3000/enforcements?trigger_ref=core.webhook
```
Shows all enforcements triggered by webhook events.
**Filter by event:**
```
http://localhost:3000/enforcements?event=123
```
Shows the enforcement created by a specific event (useful for tracing event → enforcement → execution flow).
**Filter by status:**
```
http://localhost:3000/enforcements?status=processed
```
Shows processed enforcements.
### Combined Filters
**Rule + Status:**
```
http://localhost:3000/enforcements?rule_ref=core.on_timer&status=processed
```
Shows successfully processed enforcements for a specific rule.
**Trigger + Event:**
```
http://localhost:3000/enforcements?trigger_ref=core.webhook&event=456
```
Shows enforcements from a specific webhook event.
## Practical Use Cases
### Debugging a Rule
1. **Check the event was created:**
```
http://localhost:3000/events?trigger_ref=core.timer
```
2. **Check the enforcement was created:**
```
http://localhost:3000/enforcements?rule_ref=core.on_timer
```
3. **Check the execution was triggered:**
```
http://localhost:3000/executions?rule_ref=core.on_timer
```
### Monitoring Action Performance
**See all executions of an action:**
```
http://localhost:3000/executions?action_ref=core.http_request
```
**See failures:**
```
http://localhost:3000/executions?action_ref=core.http_request&status=failed
```
**See currently running:**
```
http://localhost:3000/executions?action_ref=core.http_request&status=running
```
### Auditing Webhook Activity
1. **View all webhook events:**
```
http://localhost:3000/events?trigger_ref=core.webhook
```
2. **View enforcements from webhooks:**
```
http://localhost:3000/enforcements?trigger_ref=core.webhook
```
3. **View executions triggered by webhooks:**
```
http://localhost:3000/executions?trigger_ref=core.webhook
```
### Sharing Views with Team Members
**Share failed executions for investigation:**
```
http://localhost:3000/executions?action_ref=mypack.critical_action&status=failed
```
**Share rule activity for review:**
```
http://localhost:3000/enforcements?rule_ref=mypack.important_rule&status=processed
```
## Tips and Notes
1. **URL Encoding**: If your pack, action, rule, or trigger names contain special characters, they will be automatically URL-encoded by the browser.
2. **Case Sensitivity**: Parameter names and values are case-sensitive. Use lowercase for status values (e.g., `status=failed`, not `status=Failed`).
3. **Invalid Values**: Invalid parameter values are silently ignored, and the filter will default to empty (showing all results).
4. **Bookmarking**: Save frequently used URLs as browser bookmarks for quick access to common filtered views.
5. **Browser History**: The URL doesn't change as you modify filters in the UI, so the browser's back button won't undo filter changes within a page.
6. **Multiple Status Filters**: While the UI allows selecting multiple statuses, only one status can be specified via URL parameter. Use the UI to select multiple statuses after the page loads.
## Parameter Reference Quick Table
| Page | Parameter | Example Value |
|------|-----------|---------------|
| Executions | `action_ref` | `core.echo` |
| Executions | `rule_ref` | `core.on_timer` |
| Executions | `trigger_ref` | `core.webhook` |
| Executions | `pack_name` | `core` |
| Executions | `executor` | `1` |
| Executions | `status` | `failed`, `running`, `completed` |
| Events | `trigger_ref` | `core.webhook` |
| Enforcements | `rule_ref` | `core.on_timer` |
| Enforcements | `trigger_ref` | `core.webhook` |
| Enforcements | `event` | `123` |
| Enforcements | `status` | `processed`, `created`, `disabled` |

View File

@@ -0,0 +1,365 @@
# DOTENV Parameter Format
## Overview
The DOTENV parameter format is used to pass action parameters securely via stdin in a shell-compatible format. This format is particularly useful for shell scripts that need to parse parameters without relying on external tools like `jq`.
## Format Specification
### Basic Format
Parameters are formatted as `key='value'` pairs, one per line:
```bash
url='https://example.com'
method='GET'
timeout='30'
verify_ssl='true'
```
### Nested Object Flattening
Nested JSON objects are automatically flattened using dot notation. This allows shell scripts to easily parse complex parameter structures.
**Input JSON:**
```json
{
"url": "https://example.com",
"headers": {
"Content-Type": "application/json",
"Authorization": "Bearer token123"
},
"query_params": {
"page": "1",
"size": "10"
}
}
```
**Output DOTENV:**
```bash
headers.Authorization='Bearer token123'
headers.Content-Type='application/json'
query_params.page='1'
query_params.size='10'
url='https://example.com'
```
### Empty Objects
Empty objects (`{}`) are omitted from the output entirely. They do not produce any dotenv entries.
**Input:**
```json
{
"url": "https://example.com",
"headers": {},
"query_params": {}
}
```
**Output:**
```bash
url='https://example.com'
```
### Arrays
Arrays are serialized as JSON strings:
**Input:**
```json
{
"tags": ["web", "api", "production"]
}
```
**Output:**
```bash
tags='["web","api","production"]'
```
### Special Characters
Single quotes in values are escaped using the shell-safe `'\''` pattern:
**Input:**
```json
{
"message": "It's working!"
}
```
**Output:**
```bash
message='It'\''s working!'
```
## Shell Script Parsing
### Basic Parameter Parsing
```bash
#!/bin/sh
# Read DOTENV-formatted parameters from stdin
while IFS= read -r line; do
case "$line" in
*"---ATTUNE_PARAMS_END---"*) break ;;
esac
[ -z "$line" ] && continue
key="${line%%=*}"
value="${line#*=}"
# Remove quotes
case "$value" in
\"*\") value="${value#\"}"; value="${value%\"}" ;;
\'*\') value="${value#\'}"; value="${value%\'}" ;;
esac
# Process parameters
case "$key" in
url) url="$value" ;;
method) method="$value" ;;
timeout) timeout="$value" ;;
esac
done
```
### Parsing Nested Objects
For flattened nested objects, use pattern matching on the key prefix:
```bash
# Create temporary files for nested data
headers_file=$(mktemp)
query_params_file=$(mktemp)
while IFS= read -r line; do
case "$line" in
*"---ATTUNE_PARAMS_END---"*) break ;;
esac
[ -z "$line" ] && continue
key="${line%%=*}"
value="${line#*=}"
# Remove quotes
case "$value" in
\'*\') value="${value#\'}"; value="${value%\'}" ;;
esac
# Process parameters
case "$key" in
url) url="$value" ;;
method) method="$value" ;;
headers.*)
# Extract nested key (e.g., "Content-Type" from "headers.Content-Type")
nested_key="${key#headers.}"
printf '%s: %s\n' "$nested_key" "$value" >> "$headers_file"
;;
query_params.*)
nested_key="${key#query_params.}"
printf '%s=%s\n' "$nested_key" "$value" >> "$query_params_file"
;;
esac
done
# Use the parsed data
if [ -s "$headers_file" ]; then
while IFS= read -r header; do
curl_args="$curl_args -H '$header'"
done < "$headers_file"
fi
```
## Configuration
### Action YAML Configuration
Specify DOTENV format in your action YAML:
```yaml
ref: mypack.myaction
entry_point: myaction.sh
parameter_delivery: stdin
parameter_format: dotenv # Use dotenv format
output_format: json
```
### Supported Formats
- `dotenv` - Shell-friendly key='value' format with nested object flattening
- `json` - Standard JSON format
- `yaml` - YAML format
### Supported Delivery Methods
- `stdin` - Parameters passed via stdin (recommended for security)
- `file` - Parameters written to a temporary file
## Security Considerations
### Why DOTENV + STDIN?
This combination provides several security benefits:
1. **No process list exposure**: Parameters don't appear in `ps aux` output
2. **No shell escaping issues**: Values are properly quoted
3. **Secret protection**: Sensitive values passed via stdin, not environment variables
4. **No external dependencies**: Pure POSIX shell parsing without `jq` or other tools
### Secret Handling
Secrets are passed separately via stdin after parameters. They are never included in environment variables or parameter files.
```bash
# Parameters are sent first
url='https://api.example.com'
---ATTUNE_PARAMS_END---
# Then secrets (as JSON)
{"api_key":"secret123","password":"hunter2"}
```
## Examples
### Example 1: HTTP Request Action
**Action Configuration:**
```yaml
ref: core.http_request
parameter_delivery: stdin
parameter_format: dotenv
```
**Execution Parameters:**
```json
{
"url": "https://api.example.com/users",
"method": "POST",
"headers": {
"Content-Type": "application/json",
"User-Agent": "Attune/1.0"
},
"query_params": {
"page": "1",
"limit": "10"
}
}
```
**Stdin Input:**
```bash
headers.Content-Type='application/json'
headers.User-Agent='Attune/1.0'
method='POST'
query_params.limit='10'
query_params.page='1'
url='https://api.example.com/users'
---ATTUNE_PARAMS_END---
```
### Example 2: Simple Shell Action
**Action Configuration:**
```yaml
ref: mypack.greet
parameter_delivery: stdin
parameter_format: dotenv
```
**Execution Parameters:**
```json
{
"name": "Alice",
"greeting": "Hello"
}
```
**Stdin Input:**
```bash
greeting='Hello'
name='Alice'
---ATTUNE_PARAMS_END---
```
## Troubleshooting
### Issue: Parameters Not Received
**Symptom:** Action receives empty or incorrect parameter values.
**Solution:** Ensure you're reading until the `---ATTUNE_PARAMS_END---` delimiter:
```bash
while IFS= read -r line; do
case "$line" in
*"---ATTUNE_PARAMS_END---"*) break ;; # Important!
esac
# ... parse line
done
```
### Issue: Nested Objects Not Parsed
**Symptom:** Headers or query params not being set correctly.
**Solution:** Use pattern matching to detect dotted keys:
```bash
case "$key" in
headers.*)
nested_key="${key#headers.}"
# Process nested key
;;
esac
```
### Issue: Special Characters Corrupted
**Symptom:** Values with single quotes are malformed.
**Solution:** The worker automatically escapes single quotes using `'\''`. Make sure to remove quotes correctly:
```bash
# Remove quotes (handles escaped quotes correctly)
case "$value" in
\'*\') value="${value#\'}"; value="${value%\'}" ;;
esac
```
## Best Practices
1. **Always read until delimiter**: Don't stop reading stdin early
2. **Handle empty objects**: Check if files are empty before processing
3. **Use temporary files**: For nested objects, write to temp files for easier processing
4. **Validate required parameters**: Check that required values are present
5. **Clean up temp files**: Use `trap` to ensure cleanup on exit
```bash
#!/bin/sh
set -e
# Setup cleanup
headers_file=$(mktemp)
trap "rm -f $headers_file" EXIT
# Parse parameters...
```
## Implementation Details
The parameter flattening is implemented in `crates/worker/src/runtime/parameter_passing.rs`:
- Nested objects are recursively flattened with dot notation
- Empty objects produce no output entries
- Arrays are JSON-serialized as strings
- Output is sorted alphabetically for consistency
- Single quotes are escaped using shell-safe `'\''` pattern
## See Also
- [Action Parameter Schema](../packs/pack-structure.md#parameters)
- [Secrets Management](../authentication/secrets-management.md)
- [Shell Runtime](../architecture/worker-service.md#shell-runtime)

View File

@@ -0,0 +1,130 @@
# History Page URL Query Parameters
This document describes the URL query parameters supported by the history pages (Executions, Events, Enforcements) in the Attune web UI.
## Overview
All history pages support deep linking via URL query parameters. When navigating to a history page with query parameters, the page will automatically initialize its filters with the provided values.
## Executions Page
**Path**: `/executions`
### Supported Query Parameters
| Parameter | Description | Example |
|-----------|-------------|---------|
| `action_ref` | Filter by action reference | `?action_ref=core.echo` |
| `rule_ref` | Filter by rule reference | `?rule_ref=core.on_timer` |
| `trigger_ref` | Filter by trigger reference | `?trigger_ref=core.webhook` |
| `pack_name` | Filter by pack name | `?pack_name=core` |
| `executor` | Filter by executor ID | `?executor=1` |
| `status` | Filter by execution status | `?status=running` |
### Valid Status Values
- `requested`
- `scheduling`
- `scheduled`
- `running`
- `completed`
- `failed`
- `canceling`
- `cancelled`
- `timeout`
- `abandoned`
### Examples
```
# Filter by action
http://localhost:3000/executions?action_ref=core.echo
# Filter by rule and status
http://localhost:3000/executions?rule_ref=core.on_timer&status=completed
# Multiple filters
http://localhost:3000/executions?pack_name=core&status=running&action_ref=core.echo
```
## Events Page
**Path**: `/events`
### Supported Query Parameters
| Parameter | Description | Example |
|-----------|-------------|---------|
| `trigger_ref` | Filter by trigger reference | `?trigger_ref=core.webhook` |
### Examples
```
# Filter by trigger
http://localhost:3000/events?trigger_ref=core.webhook
# Filter by timer trigger
http://localhost:3000/events?trigger_ref=core.timer
```
## Enforcements Page
**Path**: `/enforcements`
### Supported Query Parameters
| Parameter | Description | Example |
|-----------|-------------|---------|
| `rule_ref` | Filter by rule reference | `?rule_ref=core.on_timer` |
| `trigger_ref` | Filter by trigger reference | `?trigger_ref=core.webhook` |
| `event` | Filter by event ID | `?event=123` |
| `status` | Filter by enforcement status | `?status=processed` |
### Valid Status Values
- `created`
- `processed`
- `disabled`
### Examples
```
# Filter by rule
http://localhost:3000/enforcements?rule_ref=core.on_timer
# Filter by event
http://localhost:3000/enforcements?event=123
# Multiple filters
http://localhost:3000/enforcements?rule_ref=core.on_timer&status=processed
```
## Usage Patterns
### Deep Linking from Detail Pages
When viewing a specific execution, event, or enforcement detail page, you can click on related entities (actions, rules, triggers) to navigate to the history page with the appropriate filter pre-applied.
### Sharing Filtered Views
You can share URLs with query parameters to help others view specific filtered data sets:
```
# Share a view of all failed executions for a specific action
http://localhost:3000/executions?action_ref=core.http_request&status=failed
# Share enforcements for a specific rule
http://localhost:3000/enforcements?rule_ref=my_pack.important_rule
```
### Bookmarking
Save frequently used filter combinations as browser bookmarks for quick access.
## Implementation Notes
- Query parameters are read on page load and initialize the filter state
- Changing filters in the UI does **not** update the URL (stateless filtering)
- Multiple query parameters can be combined
- Invalid parameter values are ignored (filters default to empty)
- Parameter names match the API field names for consistency