re-uploading work
This commit is contained in:
427
docs/architecture/executor-service.md
Normal file
427
docs/architecture/executor-service.md
Normal file
@@ -0,0 +1,427 @@
|
||||
# Executor Service Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
The **Executor Service** is the core orchestration engine of the Attune automation platform. It is responsible for processing rule enforcements, scheduling executions to workers, managing execution lifecycle, and orchestrating complex workflows.
|
||||
|
||||
## Service Architecture
|
||||
|
||||
The Executor is structured as a distributed microservice with three main processing components:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Executor Service │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────────────────┐ ┌──────────────────────┐ │
|
||||
│ │ Enforcement │ │ Execution │ │
|
||||
│ │ Processor │ │ Scheduler │ │
|
||||
│ └─────────────────────┘ └──────────────────────┘ │
|
||||
│ │ │ │
|
||||
│ │ │ │
|
||||
│ v v │
|
||||
│ ┌─────────────────────────────────────────────┐ │
|
||||
│ │ Execution Manager │ │
|
||||
│ └─────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│ │ │
|
||||
v v v
|
||||
PostgreSQL RabbitMQ Workers
|
||||
```
|
||||
|
||||
## Core Components
|
||||
|
||||
### 1. Enforcement Processor
|
||||
|
||||
**Purpose**: Processes triggered rules and creates execution requests.
|
||||
|
||||
**Responsibilities**:
|
||||
- Listens for `enforcement.created` messages from triggered rules
|
||||
- Fetches enforcement, rule, and event data from the database
|
||||
- Evaluates rule conditions and policies
|
||||
- Creates execution records in the database
|
||||
- Publishes `execution.requested` messages to the scheduler
|
||||
|
||||
**Message Flow**:
|
||||
```
|
||||
Rule Triggered → Enforcement Created → Enforcement Processor → Execution Created
|
||||
```
|
||||
|
||||
**Key Implementation Details**:
|
||||
- Uses `consume_with_handler` pattern for message consumption
|
||||
- All processing methods are static to enable shared state across async handlers
|
||||
- Validates rule is enabled before creating executions
|
||||
- Links executions to enforcements for audit trail
|
||||
|
||||
### 2. Execution Scheduler
|
||||
|
||||
**Purpose**: Routes execution requests to available workers.
|
||||
|
||||
**Responsibilities**:
|
||||
- Listens for `execution.requested` messages
|
||||
- Determines runtime requirements for the action
|
||||
- Selects appropriate workers based on:
|
||||
- Runtime compatibility
|
||||
- Worker status (active only)
|
||||
- Load balancing (future: capacity, affinity, locality)
|
||||
- Updates execution status to `Scheduled`
|
||||
- Publishes `execution.scheduled` messages to worker queues
|
||||
|
||||
**Message Flow**:
|
||||
```
|
||||
Execution Requested → Scheduler → Worker Selection → Execution Scheduled → Worker
|
||||
```
|
||||
|
||||
**Worker Selection Algorithm**:
|
||||
1. Fetch all available workers
|
||||
2. Filter by runtime compatibility (if action specifies runtime)
|
||||
3. Filter by worker status (only active workers)
|
||||
4. Apply load balancing strategy (currently: first available)
|
||||
5. Future: Consider capacity, affinity, geographic locality
|
||||
|
||||
**Key Implementation Details**:
|
||||
- Supports multiple worker types (local, remote, container)
|
||||
- Handles worker unavailability with error responses
|
||||
- Plans for intelligent scheduling based on worker capabilities
|
||||
|
||||
### 3. Execution Manager
|
||||
|
||||
**Purpose**: Manages execution lifecycle and status transitions.
|
||||
|
||||
**Responsibilities**:
|
||||
- Listens for `execution.status.*` messages from workers
|
||||
- Updates execution records with status changes
|
||||
- Handles execution completion (success, failure, cancellation)
|
||||
- Orchestrates workflow executions (parent-child relationships)
|
||||
- Publishes completion notifications for downstream consumers
|
||||
|
||||
**Message Flow**:
|
||||
```
|
||||
Worker Status Update → Execution Manager → Database Update → Completion Handler
|
||||
```
|
||||
|
||||
**Status Lifecycle**:
|
||||
```
|
||||
Requested → Scheduling → Scheduled → Running → Completed/Failed/Cancelled
|
||||
│
|
||||
└→ Child Executions (workflows)
|
||||
```
|
||||
|
||||
**Key Implementation Details**:
|
||||
- Parses status strings to typed enums for type safety
|
||||
- Handles workflow orchestration (parent-child execution chaining)
|
||||
- Only triggers child executions on successful parent completion
|
||||
- Publishes completion events for notification service
|
||||
|
||||
## Message Queue Integration
|
||||
|
||||
### Message Types
|
||||
|
||||
The Executor consumes and produces several message types:
|
||||
|
||||
**Consumed**:
|
||||
- `enforcement.created` - New enforcement from triggered rules
|
||||
- `execution.requested` - Execution scheduling requests
|
||||
- `execution.status.*` - Status updates from workers
|
||||
|
||||
**Published**:
|
||||
- `execution.requested` - To scheduler (from enforcement processor)
|
||||
- `execution.scheduled` - To workers (from scheduler)
|
||||
- `execution.completed` - To notifier (from execution manager)
|
||||
|
||||
### Message Envelope Structure
|
||||
|
||||
All messages use the standardized `MessageEnvelope<T>` structure:
|
||||
|
||||
```rust
|
||||
MessageEnvelope {
|
||||
message_id: Uuid,
|
||||
message_type: MessageType,
|
||||
source: String,
|
||||
timestamp: DateTime<Utc>,
|
||||
correlation_id: Option<Uuid>,
|
||||
trace_id: Option<String>,
|
||||
payload: T,
|
||||
retry_count: u32,
|
||||
}
|
||||
```
|
||||
|
||||
### Consumer Handler Pattern
|
||||
|
||||
All processors use the `consume_with_handler` pattern for robust message consumption:
|
||||
|
||||
```rust
|
||||
consumer.consume_with_handler(move |envelope: MessageEnvelope<PayloadType>| {
|
||||
// Clone shared state
|
||||
let pool = pool.clone();
|
||||
let publisher = publisher.clone();
|
||||
|
||||
async move {
|
||||
// Process message
|
||||
Self::process_message(&pool, &publisher, &envelope).await
|
||||
.map_err(|e| format!("Error: {}", e).into())
|
||||
}
|
||||
}).await?;
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Automatic message acknowledgment on success
|
||||
- Automatic nack with requeue on retriable errors
|
||||
- Automatic dead letter queue routing on non-retriable errors
|
||||
- Built-in error handling and logging
|
||||
|
||||
## Database Integration
|
||||
|
||||
### Repository Pattern
|
||||
|
||||
All database access uses the repository layer:
|
||||
|
||||
```rust
|
||||
use attune_common::repositories::{
|
||||
enforcement::EnforcementRepository,
|
||||
execution::ExecutionRepository,
|
||||
rule::RuleRepository,
|
||||
Create, FindById, Update, List,
|
||||
};
|
||||
```
|
||||
|
||||
### Transaction Support
|
||||
|
||||
Future implementations will use database transactions for multi-step operations:
|
||||
- Creating execution + publishing message (atomic)
|
||||
- Status update + completion handling (atomic)
|
||||
|
||||
## Configuration
|
||||
|
||||
The Executor service uses the standard Attune configuration system:
|
||||
|
||||
```yaml
|
||||
# config.yaml
|
||||
database:
|
||||
url: postgresql://localhost/attune
|
||||
max_connections: 20
|
||||
|
||||
message_queue:
|
||||
url: amqp://localhost
|
||||
exchange: attune.executions
|
||||
prefetch_count: 10
|
||||
```
|
||||
|
||||
Environment variable overrides:
|
||||
```bash
|
||||
ATTUNE__DATABASE__URL=postgresql://prod-db/attune
|
||||
ATTUNE__MESSAGE_QUEUE__URL=amqp://prod-mq
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Error Types
|
||||
|
||||
The Executor handles several error categories:
|
||||
|
||||
1. **Database Errors**: Connection issues, query failures
|
||||
2. **Message Queue Errors**: Connection drops, serialization failures
|
||||
3. **Business Logic Errors**: Missing entities, invalid states
|
||||
4. **Worker Errors**: No workers available, incompatible runtimes
|
||||
|
||||
### Retry Strategy
|
||||
|
||||
- **Retriable Errors**: Requeued for retry (connection issues, timeouts)
|
||||
- **Non-Retriable Errors**: Sent to dead letter queue (invalid data, missing entities)
|
||||
- **Retry Limits**: Configured per queue (future implementation)
|
||||
|
||||
### Dead Letter Queues
|
||||
|
||||
Failed messages are automatically routed to dead letter queues for investigation:
|
||||
- `executor.enforcement.created.dlq`
|
||||
- `executor.execution.requested.dlq`
|
||||
- `executor.execution.status.dlq`
|
||||
|
||||
## Workflow Orchestration
|
||||
|
||||
### Parent-Child Executions
|
||||
|
||||
The Executor supports complex workflows through parent-child execution relationships:
|
||||
|
||||
```
|
||||
Parent Execution (Completed)
|
||||
├── Child Execution 1 (action_ref: "pack.action1")
|
||||
├── Child Execution 2 (action_ref: "pack.action2")
|
||||
└── Child Execution 3 (action_ref: "pack.action3")
|
||||
```
|
||||
|
||||
**Implementation**:
|
||||
- Parent execution stores child action references
|
||||
- On parent completion, Execution Manager creates child executions
|
||||
- Child executions inherit parent's configuration
|
||||
- Each child is independently scheduled and executed
|
||||
|
||||
### Future Enhancements
|
||||
|
||||
- **Conditional Workflows**: Execute children based on parent result
|
||||
- **Parallel vs Sequential**: Control execution order
|
||||
- **Workflow DAGs**: Complex dependency graphs
|
||||
- **Workflow Templates**: Reusable workflow definitions
|
||||
|
||||
## Policy Enforcement
|
||||
|
||||
### Planned Features
|
||||
|
||||
1. **Rate Limiting**: Limit executions per time window
|
||||
2. **Concurrency Control**: Maximum concurrent executions per action/pack
|
||||
3. **Priority Queuing**: High-priority executions jump the queue
|
||||
4. **Resource Quotas**: Limit resource consumption per tenant
|
||||
5. **Execution Windows**: Only execute during specified time periods
|
||||
|
||||
### Implementation Location
|
||||
|
||||
Policy enforcement will be implemented in:
|
||||
- Enforcement Processor (pre-execution validation)
|
||||
- Scheduler (runtime constraint checking)
|
||||
- New `PolicyEnforcer` module (future)
|
||||
|
||||
## Monitoring & Observability
|
||||
|
||||
### Metrics (Future)
|
||||
|
||||
- Executions per second (throughput)
|
||||
- Average execution duration
|
||||
- Queue depth and processing lag
|
||||
- Worker utilization
|
||||
- Error rates by type
|
||||
|
||||
### Logging
|
||||
|
||||
Structured logging at multiple levels:
|
||||
- `INFO`: Successful operations, state transitions
|
||||
- `WARN`: Degraded states, retry attempts
|
||||
- `ERROR`: Failures requiring attention
|
||||
- `DEBUG`: Detailed flow for troubleshooting
|
||||
|
||||
Example:
|
||||
```
|
||||
INFO Processing enforcement: 123
|
||||
INFO Selected worker 45 for execution 789
|
||||
INFO Execution 789 scheduled to worker 45
|
||||
```
|
||||
|
||||
### Tracing
|
||||
|
||||
Message correlation and distributed tracing:
|
||||
- `correlation_id`: Links related messages
|
||||
- `trace_id`: End-to-end request tracing (future integration with OpenTelemetry)
|
||||
|
||||
## Running the Service
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- PostgreSQL 14+ with schema initialized
|
||||
- RabbitMQ 3.12+ with exchanges and queues configured
|
||||
- Environment variables or config file set up
|
||||
|
||||
### Startup
|
||||
|
||||
```bash
|
||||
# Using cargo
|
||||
cd crates/executor
|
||||
cargo run
|
||||
|
||||
# Or with environment overrides
|
||||
ATTUNE__DATABASE__URL=postgresql://localhost/attune \
|
||||
ATTUNE__MESSAGE_QUEUE__URL=amqp://localhost \
|
||||
cargo run
|
||||
```
|
||||
|
||||
### Graceful Shutdown
|
||||
|
||||
The service supports graceful shutdown via SIGTERM/SIGINT:
|
||||
1. Stop accepting new messages
|
||||
2. Finish processing in-flight messages
|
||||
3. Close message queue connections
|
||||
4. Close database connections
|
||||
5. Exit cleanly
|
||||
|
||||
## Testing
|
||||
|
||||
### Unit Tests
|
||||
|
||||
Each module includes unit tests for business logic:
|
||||
- Rule evaluation
|
||||
- Worker selection algorithms
|
||||
- Status parsing
|
||||
- Workflow creation
|
||||
|
||||
### Integration Tests
|
||||
|
||||
Integration tests require PostgreSQL and RabbitMQ:
|
||||
- End-to-end enforcement → execution flow
|
||||
- Message queue reliability
|
||||
- Database consistency
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
# Unit tests only
|
||||
cargo test -p attune-executor --lib
|
||||
|
||||
# Integration tests (requires services)
|
||||
cargo test -p attune-executor --test '*'
|
||||
```
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Phase 1: Core Functionality (Current)
|
||||
- ✅ Enforcement processing
|
||||
- ✅ Execution scheduling
|
||||
- ✅ Lifecycle management
|
||||
- ✅ Message queue integration
|
||||
|
||||
### Phase 2: Advanced Features (Next)
|
||||
- Policy enforcement (rate limiting, concurrency)
|
||||
- Advanced workflow orchestration
|
||||
- Inquiry handling (human-in-the-loop)
|
||||
- Retry and failure handling improvements
|
||||
|
||||
### Phase 3: Production Readiness
|
||||
- Comprehensive monitoring and metrics
|
||||
- Performance optimization
|
||||
- High availability setup
|
||||
- Load testing and tuning
|
||||
|
||||
### Phase 4: Enterprise Features
|
||||
- Multi-tenancy isolation
|
||||
- Advanced scheduling algorithms
|
||||
- Resource quotas and limits
|
||||
- Audit logging and compliance
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Problem**: Executions stuck in "Requested" status
|
||||
- **Cause**: Scheduler not running or no workers available
|
||||
- **Solution**: Verify scheduler is running, check worker status
|
||||
|
||||
**Problem**: Messages not being consumed
|
||||
- **Cause**: RabbitMQ connection issues or queue misconfiguration
|
||||
- **Solution**: Check MQ connection, verify queue bindings
|
||||
|
||||
**Problem**: Database connection errors
|
||||
- **Cause**: Connection pool exhausted or database down
|
||||
- **Solution**: Increase pool size, check database health
|
||||
|
||||
### Debug Mode
|
||||
|
||||
Enable detailed logging:
|
||||
```bash
|
||||
RUST_LOG=attune_executor=debug,attune_common=debug cargo run
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [API - Executions](./api-executions.md)
|
||||
- [API - Events & Enforcements](./api-events-enforcements.md)
|
||||
- [API - Rules](./api-rules.md)
|
||||
- [Configuration](./configuration.md)
|
||||
- [Quick Start](./quick-start.md)
|
||||
726
docs/architecture/notifier-service.md
Normal file
726
docs/architecture/notifier-service.md
Normal file
@@ -0,0 +1,726 @@
|
||||
# Notifier Service
|
||||
|
||||
The **Notifier Service** provides real-time notifications to clients via WebSocket connections. It listens for PostgreSQL NOTIFY events and broadcasts them to subscribed WebSocket clients based on their subscription filters.
|
||||
|
||||
## Overview
|
||||
|
||||
The Notifier Service acts as a bridge between the Attune backend services and frontend clients, enabling real-time updates for:
|
||||
|
||||
- **Execution status changes** - When executions start, succeed, fail, or timeout
|
||||
- **Inquiry creation and responses** - Human-in-the-loop approval workflows
|
||||
- **Enforcement creation** - When rules are triggered
|
||||
- **Event generation** - When sensors detect events
|
||||
- **Workflow execution updates** - Workflow state transitions
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Notifier Service │
|
||||
│ │
|
||||
│ ┌──────────────────┐ ┌─────────────────────────┐ │
|
||||
│ │ PostgreSQL │ │ Subscriber Manager │ │
|
||||
│ │ Listener │────────▶│ (Client Management) │ │
|
||||
│ │ (LISTEN/NOTIFY) │ └─────────────────────────┘ │
|
||||
│ └──────────────────┘ │ │
|
||||
│ │ │ │
|
||||
│ │ ▼ │
|
||||
│ │ ┌─────────────────────────┐ │
|
||||
│ │ │ WebSocket Server │ │
|
||||
│ │ │ (HTTP + WS Upgrade) │ │
|
||||
│ │ └─────────────────────────┘ │
|
||||
│ │ │ │
|
||||
│ └────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
┌────────────────┴────────────────┐
|
||||
│ │
|
||||
▼ ▼
|
||||
┌──────────────┐ ┌──────────────┐
|
||||
│ WebSocket │ │ WebSocket │
|
||||
│ Client 1 │ │ Client 2 │
|
||||
└──────────────┘ └──────────────┘
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
### 1. PostgreSQL Listener
|
||||
|
||||
Connects to PostgreSQL and listens on multiple notification channels:
|
||||
|
||||
- `execution_status_changed`
|
||||
- `execution_created`
|
||||
- `inquiry_created`
|
||||
- `inquiry_responded`
|
||||
- `enforcement_created`
|
||||
- `event_created`
|
||||
- `workflow_execution_status_changed`
|
||||
|
||||
When a NOTIFY event is received, it parses the payload and broadcasts it to the Subscriber Manager.
|
||||
|
||||
**Features:**
|
||||
- Automatic reconnection on connection loss
|
||||
- Error handling and retry logic
|
||||
- Multiple channel subscription
|
||||
- JSON payload parsing
|
||||
|
||||
### 2. Subscriber Manager
|
||||
|
||||
Manages WebSocket client connections and their subscriptions.
|
||||
|
||||
**Features:**
|
||||
- Client registration/unregistration
|
||||
- Subscription filter management
|
||||
- Notification routing based on filters
|
||||
- Automatic cleanup of disconnected clients
|
||||
|
||||
**Subscription Filters:**
|
||||
- `all` - Receive all notifications
|
||||
- `entity_type:TYPE` - Filter by entity type (e.g., `entity_type:execution`)
|
||||
- `entity:TYPE:ID` - Filter by specific entity (e.g., `entity:execution:123`)
|
||||
- `user:ID` - Filter by user ID (e.g., `user:456`)
|
||||
- `notification_type:TYPE` - Filter by notification type (e.g., `notification_type:execution_status_changed`)
|
||||
|
||||
### 3. WebSocket Server
|
||||
|
||||
HTTP server with WebSocket upgrade support.
|
||||
|
||||
**Endpoints:**
|
||||
- `GET /ws` - WebSocket upgrade endpoint
|
||||
- `GET /health` - Health check endpoint
|
||||
- `GET /stats` - Service statistics (connected clients, subscriptions)
|
||||
|
||||
**Features:**
|
||||
- CORS support for cross-origin requests
|
||||
- Automatic ping/pong for connection keep-alive
|
||||
- JSON message protocol
|
||||
- Graceful connection handling
|
||||
|
||||
## Usage
|
||||
|
||||
### Starting the Service
|
||||
|
||||
```bash
|
||||
# Using default configuration
|
||||
cargo run --bin attune-notifier
|
||||
|
||||
# Using custom configuration file
|
||||
cargo run --bin attune-notifier -- --config /path/to/config.yaml
|
||||
|
||||
# With custom log level
|
||||
cargo run --bin attune-notifier -- --log-level debug
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
Create a `config.notifier.yaml` file:
|
||||
|
||||
```yaml
|
||||
service_name: attune-notifier
|
||||
environment: development
|
||||
|
||||
database:
|
||||
url: postgresql://postgres:postgres@localhost:5432/attune
|
||||
max_connections: 10
|
||||
|
||||
notifier:
|
||||
host: 0.0.0.0
|
||||
port: 8081
|
||||
max_connections: 10000
|
||||
|
||||
log:
|
||||
level: info
|
||||
format: json
|
||||
console: true
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
Configuration can be overridden with environment variables:
|
||||
|
||||
```bash
|
||||
# Database URL
|
||||
export ATTUNE__DATABASE__URL="postgresql://user:pass@host:5432/db"
|
||||
|
||||
# Notifier service settings
|
||||
export ATTUNE__NOTIFIER__HOST="0.0.0.0"
|
||||
export ATTUNE__NOTIFIER__PORT="8081"
|
||||
export ATTUNE__NOTIFIER__MAX_CONNECTIONS="10000"
|
||||
|
||||
# Log level
|
||||
export ATTUNE__LOG__LEVEL="debug"
|
||||
```
|
||||
|
||||
## WebSocket Protocol
|
||||
|
||||
### Client Connection
|
||||
|
||||
Connect to the WebSocket endpoint:
|
||||
|
||||
```javascript
|
||||
const ws = new WebSocket('ws://localhost:8081/ws');
|
||||
|
||||
ws.onopen = () => {
|
||||
console.log('Connected to Attune Notifier');
|
||||
};
|
||||
|
||||
ws.onmessage = (event) => {
|
||||
const message = JSON.parse(event.data);
|
||||
console.log('Received message:', message);
|
||||
};
|
||||
|
||||
ws.onerror = (error) => {
|
||||
console.error('WebSocket error:', error);
|
||||
};
|
||||
|
||||
ws.onclose = () => {
|
||||
console.log('Disconnected from Attune Notifier');
|
||||
};
|
||||
```
|
||||
|
||||
### Welcome Message
|
||||
|
||||
Upon connection, the server sends a welcome message:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "welcome",
|
||||
"client_id": "client_1",
|
||||
"message": "Connected to Attune Notifier"
|
||||
}
|
||||
```
|
||||
|
||||
### Subscribing to Notifications
|
||||
|
||||
Send a subscribe message:
|
||||
|
||||
```javascript
|
||||
// Subscribe to all notifications
|
||||
ws.send(JSON.stringify({
|
||||
"type": "subscribe",
|
||||
"filter": "all"
|
||||
}));
|
||||
|
||||
// Subscribe to execution notifications only
|
||||
ws.send(JSON.stringify({
|
||||
"type": "subscribe",
|
||||
"filter": "entity_type:execution"
|
||||
}));
|
||||
|
||||
// Subscribe to a specific execution
|
||||
ws.send(JSON.stringify({
|
||||
"type": "subscribe",
|
||||
"filter": "entity:execution:123"
|
||||
}));
|
||||
|
||||
// Subscribe to your user's notifications
|
||||
ws.send(JSON.stringify({
|
||||
"type": "subscribe",
|
||||
"filter": "user:456"
|
||||
}));
|
||||
|
||||
// Subscribe to specific notification types
|
||||
ws.send(JSON.stringify({
|
||||
"type": "subscribe",
|
||||
"filter": "notification_type:execution_status_changed"
|
||||
}));
|
||||
```
|
||||
|
||||
### Unsubscribing
|
||||
|
||||
Send an unsubscribe message:
|
||||
|
||||
```javascript
|
||||
ws.send(JSON.stringify({
|
||||
"type": "unsubscribe",
|
||||
"filter": "entity_type:execution"
|
||||
}));
|
||||
```
|
||||
|
||||
### Receiving Notifications
|
||||
|
||||
Notifications are sent as JSON messages:
|
||||
|
||||
```json
|
||||
{
|
||||
"notification_type": "execution_status_changed",
|
||||
"entity_type": "execution",
|
||||
"entity_id": 123,
|
||||
"user_id": 456,
|
||||
"payload": {
|
||||
"entity_type": "execution",
|
||||
"entity_id": 123,
|
||||
"status": "succeeded",
|
||||
"action": "core.echo",
|
||||
"result": {"output": "hello world"}
|
||||
},
|
||||
"timestamp": "2024-01-15T10:30:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
### Ping/Pong
|
||||
|
||||
Keep the connection alive by sending ping messages:
|
||||
|
||||
```javascript
|
||||
// Send ping
|
||||
ws.send(JSON.stringify({"type": "ping"}));
|
||||
|
||||
// Pong is handled automatically by the WebSocket protocol
|
||||
```
|
||||
|
||||
## Message Format
|
||||
|
||||
### Client → Server Messages
|
||||
|
||||
```typescript
|
||||
// Subscribe to notifications
|
||||
{
|
||||
"type": "subscribe",
|
||||
"filter": string // Subscription filter string
|
||||
}
|
||||
|
||||
// Unsubscribe from notifications
|
||||
{
|
||||
"type": "unsubscribe",
|
||||
"filter": string // Subscription filter string
|
||||
}
|
||||
|
||||
// Ping
|
||||
{
|
||||
"type": "ping"
|
||||
}
|
||||
```
|
||||
|
||||
### Server → Client Messages
|
||||
|
||||
```typescript
|
||||
// Welcome message
|
||||
{
|
||||
"type": "welcome",
|
||||
"client_id": string,
|
||||
"message": string
|
||||
}
|
||||
|
||||
// Notification
|
||||
{
|
||||
"notification_type": string, // Type of notification
|
||||
"entity_type": string, // Entity type (execution, inquiry, etc.)
|
||||
"entity_id": number, // Entity ID
|
||||
"user_id": number | null, // Optional user ID
|
||||
"payload": object, // Notification payload (varies by type)
|
||||
"timestamp": string // ISO 8601 timestamp
|
||||
}
|
||||
|
||||
// Error (future)
|
||||
{
|
||||
"type": "error",
|
||||
"message": string
|
||||
}
|
||||
```
|
||||
|
||||
## Notification Types
|
||||
|
||||
### Execution Status Changed
|
||||
|
||||
```json
|
||||
{
|
||||
"notification_type": "execution_status_changed",
|
||||
"entity_type": "execution",
|
||||
"entity_id": 123,
|
||||
"user_id": 456,
|
||||
"payload": {
|
||||
"entity_type": "execution",
|
||||
"entity_id": 123,
|
||||
"status": "succeeded",
|
||||
"action": "slack.post_message",
|
||||
"result": {"message_id": "abc123"}
|
||||
},
|
||||
"timestamp": "2024-01-15T10:30:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
### Inquiry Created
|
||||
|
||||
```json
|
||||
{
|
||||
"notification_type": "inquiry_created",
|
||||
"entity_type": "inquiry",
|
||||
"entity_id": 789,
|
||||
"user_id": 456,
|
||||
"payload": {
|
||||
"entity_type": "inquiry",
|
||||
"entity_id": 789,
|
||||
"execution_id": 123,
|
||||
"schema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"approve": {"type": "boolean"}
|
||||
}
|
||||
},
|
||||
"ttl": 3600
|
||||
},
|
||||
"timestamp": "2024-01-15T10:31:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
### Workflow Execution Status Changed
|
||||
|
||||
```json
|
||||
{
|
||||
"notification_type": "workflow_execution_status_changed",
|
||||
"entity_type": "workflow_execution",
|
||||
"entity_id": 456,
|
||||
"user_id": 123,
|
||||
"payload": {
|
||||
"entity_type": "workflow_execution",
|
||||
"entity_id": 456,
|
||||
"workflow_ref": "incident.response",
|
||||
"status": "running",
|
||||
"current_tasks": ["notify_team", "create_ticket"]
|
||||
},
|
||||
"timestamp": "2024-01-15T10:32:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
## Example Client Implementations
|
||||
|
||||
### JavaScript/Browser
|
||||
|
||||
```javascript
|
||||
class AttuneNotifier {
|
||||
constructor(url) {
|
||||
this.url = url;
|
||||
this.ws = null;
|
||||
this.handlers = new Map();
|
||||
}
|
||||
|
||||
connect() {
|
||||
this.ws = new WebSocket(this.url);
|
||||
|
||||
this.ws.onopen = () => {
|
||||
console.log('Connected to Attune Notifier');
|
||||
};
|
||||
|
||||
this.ws.onmessage = (event) => {
|
||||
const message = JSON.parse(event.data);
|
||||
|
||||
if (message.type === 'welcome') {
|
||||
console.log('Welcome:', message.message);
|
||||
return;
|
||||
}
|
||||
|
||||
// Route notification to handlers
|
||||
const type = message.notification_type;
|
||||
if (this.handlers.has(type)) {
|
||||
this.handlers.get(type)(message);
|
||||
}
|
||||
};
|
||||
|
||||
this.ws.onerror = (error) => {
|
||||
console.error('WebSocket error:', error);
|
||||
};
|
||||
|
||||
this.ws.onclose = () => {
|
||||
console.log('Disconnected from Attune Notifier');
|
||||
// Implement reconnection logic here
|
||||
};
|
||||
}
|
||||
|
||||
subscribe(filter) {
|
||||
this.ws.send(JSON.stringify({
|
||||
type: 'subscribe',
|
||||
filter: filter
|
||||
}));
|
||||
}
|
||||
|
||||
unsubscribe(filter) {
|
||||
this.ws.send(JSON.stringify({
|
||||
type: 'unsubscribe',
|
||||
filter: filter
|
||||
}));
|
||||
}
|
||||
|
||||
on(notificationType, handler) {
|
||||
this.handlers.set(notificationType, handler);
|
||||
}
|
||||
|
||||
disconnect() {
|
||||
if (this.ws) {
|
||||
this.ws.close();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Usage
|
||||
const notifier = new AttuneNotifier('ws://localhost:8081/ws');
|
||||
notifier.connect();
|
||||
|
||||
// Subscribe to execution updates
|
||||
notifier.subscribe('entity_type:execution');
|
||||
|
||||
// Handle execution status changes
|
||||
notifier.on('execution_status_changed', (notification) => {
|
||||
console.log('Execution updated:', notification.payload);
|
||||
// Update UI with new execution status
|
||||
});
|
||||
```
|
||||
|
||||
### Python
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import json
|
||||
import websockets
|
||||
|
||||
async def notifier_client():
|
||||
uri = "ws://localhost:8081/ws"
|
||||
|
||||
async with websockets.connect(uri) as websocket:
|
||||
# Wait for welcome message
|
||||
welcome = await websocket.recv()
|
||||
print(f"Connected: {welcome}")
|
||||
|
||||
# Subscribe to execution notifications
|
||||
await websocket.send(json.dumps({
|
||||
"type": "subscribe",
|
||||
"filter": "entity_type:execution"
|
||||
}))
|
||||
|
||||
# Listen for notifications
|
||||
async for message in websocket:
|
||||
notification = json.loads(message)
|
||||
print(f"Received: {notification['notification_type']}")
|
||||
print(f"Payload: {notification['payload']}")
|
||||
|
||||
# Run the client
|
||||
asyncio.run(notifier_client())
|
||||
```
|
||||
|
||||
## Monitoring and Statistics
|
||||
|
||||
### Health Check
|
||||
|
||||
```bash
|
||||
curl http://localhost:8081/health
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"status": "ok"
|
||||
}
|
||||
```
|
||||
|
||||
### Service Statistics
|
||||
|
||||
```bash
|
||||
curl http://localhost:8081/stats
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"connected_clients": 42,
|
||||
"total_subscriptions": 156
|
||||
}
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
### Unit Tests
|
||||
|
||||
Run the unit tests:
|
||||
|
||||
```bash
|
||||
cargo test -p attune-notifier
|
||||
```
|
||||
|
||||
All components have comprehensive unit tests:
|
||||
- PostgreSQL listener notification parsing (4 tests)
|
||||
- Subscription filter matching (4 tests)
|
||||
- Subscriber management (6 tests)
|
||||
- WebSocket message parsing (7 tests)
|
||||
|
||||
### Integration Testing
|
||||
|
||||
To test the notifier service:
|
||||
|
||||
1. **Start PostgreSQL** with the Attune database
|
||||
2. **Run the notifier service**:
|
||||
```bash
|
||||
cargo run --bin attune-notifier -- --log-level debug
|
||||
```
|
||||
3. **Connect a WebSocket client** (using browser console or tool like `websocat`)
|
||||
4. **Trigger a notification** from PostgreSQL:
|
||||
```sql
|
||||
NOTIFY execution_status_changed, '{"entity_type":"execution","entity_id":123,"status":"succeeded"}';
|
||||
```
|
||||
5. **Verify the client receives the notification**
|
||||
|
||||
### WebSocket Testing Tools
|
||||
|
||||
- **websocat**: `websocat ws://localhost:8081/ws`
|
||||
- **wscat**: `wscat -c ws://localhost:8081/ws`
|
||||
- **Browser DevTools**: Use the Console to test WebSocket connections
|
||||
|
||||
## Production Deployment
|
||||
|
||||
### Docker
|
||||
|
||||
Create a `Dockerfile`:
|
||||
|
||||
```dockerfile
|
||||
FROM rust:1.75 as builder
|
||||
WORKDIR /app
|
||||
COPY . .
|
||||
RUN cargo build --release --bin attune-notifier
|
||||
|
||||
FROM debian:bookworm-slim
|
||||
RUN apt-get update && apt-get install -y libssl3 ca-certificates && rm -rf /var/lib/apt/lists/*
|
||||
COPY --from=builder /app/target/release/attune-notifier /usr/local/bin/
|
||||
COPY config.notifier.yaml /etc/attune/config.yaml
|
||||
CMD ["attune-notifier", "--config", "/etc/attune/config.yaml"]
|
||||
```
|
||||
|
||||
### Docker Compose
|
||||
|
||||
Add to `docker-compose.yml`:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
notifier:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.notifier
|
||||
ports:
|
||||
- "8081:8081"
|
||||
environment:
|
||||
- ATTUNE__DATABASE__URL=postgresql://postgres:postgres@db:5432/attune
|
||||
- ATTUNE__NOTIFIER__PORT=8081
|
||||
- ATTUNE__LOG__LEVEL=info
|
||||
depends_on:
|
||||
- db
|
||||
restart: unless-stopped
|
||||
```
|
||||
|
||||
### Systemd Service
|
||||
|
||||
Create `/etc/systemd/system/attune-notifier.service`:
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Attune Notifier Service
|
||||
After=network.target postgresql.service
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=attune
|
||||
Group=attune
|
||||
WorkingDirectory=/opt/attune
|
||||
ExecStart=/opt/attune/bin/attune-notifier --config /etc/attune/config.notifier.yaml
|
||||
Restart=always
|
||||
RestartSec=10
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
Enable and start:
|
||||
|
||||
```bash
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable attune-notifier
|
||||
sudo systemctl start attune-notifier
|
||||
sudo systemctl status attune-notifier
|
||||
```
|
||||
|
||||
## Scaling Considerations
|
||||
|
||||
### Horizontal Scaling (Future Enhancement)
|
||||
|
||||
For high-availability deployments with multiple notifier instances:
|
||||
|
||||
1. **Use Redis Pub/Sub** for distributed notification broadcasting
|
||||
2. **Load balance WebSocket connections** using a reverse proxy (nginx, HAProxy)
|
||||
3. **Sticky sessions** to maintain client connections to the same instance
|
||||
|
||||
### Performance Tuning
|
||||
|
||||
- **max_connections**: Adjust based on expected concurrent clients
|
||||
- **PostgreSQL connection pool**: Keep small (10-20 connections)
|
||||
- **Message buffer sizes**: Tune broadcast channel capacity for high-throughput scenarios
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Clients Not Receiving Notifications
|
||||
|
||||
1. **Check client subscriptions**: Ensure filters are correct
|
||||
2. **Verify PostgreSQL NOTIFY**: Test with `psql` and manual NOTIFY
|
||||
3. **Check logs**: Set log level to `debug` for detailed information
|
||||
4. **Network/Firewall**: Ensure WebSocket port (8081) is accessible
|
||||
|
||||
### Connection Drops
|
||||
|
||||
1. **Implement reconnection logic** in clients
|
||||
2. **Check network stability**
|
||||
3. **Monitor PostgreSQL connection** health
|
||||
4. **Increase ping/pong frequency** for keep-alive
|
||||
|
||||
### High Memory Usage
|
||||
|
||||
1. **Check number of connected clients**: Use `/stats` endpoint
|
||||
2. **Limit max_connections** in configuration
|
||||
3. **Monitor subscription counts**: Too many filters per client
|
||||
4. **Check for memory leaks**: Monitor over time
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### WebSocket Authentication (Future Enhancement)
|
||||
|
||||
Currently, WebSocket connections are unauthenticated. For production deployments:
|
||||
|
||||
1. **Implement JWT authentication** on WebSocket upgrade
|
||||
2. **Validate tokens** before accepting connections
|
||||
3. **Filter notifications** based on user permissions
|
||||
4. **Rate limiting** to prevent abuse
|
||||
|
||||
### TLS/SSL
|
||||
|
||||
Use a reverse proxy (nginx, Caddy) for TLS termination:
|
||||
|
||||
```nginx
|
||||
server {
|
||||
listen 443 ssl;
|
||||
server_name notifier.example.com;
|
||||
|
||||
ssl_certificate /path/to/cert.pem;
|
||||
ssl_certificate_key /path/to/key.pem;
|
||||
|
||||
location /ws {
|
||||
proxy_pass http://localhost:8081/ws;
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
proxy_set_header Host $host;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
- [ ] **Redis Pub/Sub support** for distributed deployments
|
||||
- [ ] **WebSocket authentication** with JWT validation
|
||||
- [ ] **Permission-based filtering** for secure multi-tenancy
|
||||
- [ ] **Message persistence** for offline clients
|
||||
- [ ] **Metrics and monitoring** (Prometheus, Grafana)
|
||||
- [ ] **Admin API** for managing connections and subscriptions
|
||||
- [ ] **Message acknowledgment** for guaranteed delivery
|
||||
- [ ] **Binary protocol** for improved performance
|
||||
|
||||
## References
|
||||
|
||||
- [PostgreSQL LISTEN/NOTIFY Documentation](https://www.postgresql.org/docs/current/sql-notify.html)
|
||||
- [WebSocket Protocol RFC 6455](https://datatracker.ietf.org/doc/html/rfc6455)
|
||||
- [Axum WebSocket Guide](https://docs.rs/axum/latest/axum/extract/ws/index.html)
|
||||
- [Tokio Broadcast Channels](https://docs.rs/tokio/latest/tokio/sync/broadcast/index.html)
|
||||
327
docs/architecture/pack-management-architecture.md
Normal file
327
docs/architecture/pack-management-architecture.md
Normal file
@@ -0,0 +1,327 @@
|
||||
# Pack Management Architecture
|
||||
|
||||
**Last Updated**: 2026-01-19
|
||||
**Status**: Architectural Guidelines
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Attune uses a **pack-based architecture** where most automation components (actions, sensors, triggers) are defined as code and bundled into packs. This document clarifies which entities are code-based vs. UI-configurable and explains the rationale behind this design.
|
||||
|
||||
---
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### Pack-Based Components (Code-Defined)
|
||||
|
||||
Components that are **implemented as code** and registered when a pack is loaded/installed:
|
||||
|
||||
1. **Actions** - Executable tasks with entry points (Python/Node.js/Shell scripts)
|
||||
2. **Sensors** - Event monitoring code with poll intervals and trigger generation
|
||||
3. **Triggers** (Pack-Based) - Event type definitions associated with sensors
|
||||
|
||||
**Key Characteristics**:
|
||||
- Defined in pack manifest/metadata files
|
||||
- Implemented as executable code (Python, Node.js, Shell, etc.)
|
||||
- Registered during pack installation/loading
|
||||
- **Not editable** through the Web UI
|
||||
- Managed through pack lifecycle (install, update, uninstall)
|
||||
|
||||
### UI-Configurable Components (Data-Defined)
|
||||
|
||||
Components that are **configured through the UI** and stored as data:
|
||||
|
||||
1. **Rules** - Connect triggers to actions with criteria and parameters
|
||||
2. **Packs (Ad-Hoc)** - User-created packs for custom automation
|
||||
3. **Triggers (Ad-Hoc)** - Custom event type definitions for ad-hoc packs
|
||||
4. **Workflows** - Multi-step automation sequences (future)
|
||||
|
||||
**Key Characteristics**:
|
||||
- Defined through Web UI forms or API calls
|
||||
- Stored as data in PostgreSQL
|
||||
- Editable at runtime
|
||||
- No code deployment required
|
||||
|
||||
---
|
||||
|
||||
## Pack Types
|
||||
|
||||
### 1. System Packs
|
||||
|
||||
**Definition**: Pre-built, standard packs that ship with Attune or are installed from a registry.
|
||||
|
||||
**Characteristics**:
|
||||
- `system: true` flag in database
|
||||
- Contain code-based actions, sensors, and triggers
|
||||
- Installed via pack management tools
|
||||
- **Not editable** through Web UI (code-based)
|
||||
- Examples: `core`, `slack`, `aws`, `github`
|
||||
|
||||
**Components**:
|
||||
- ✅ Actions (code-based)
|
||||
- ✅ Sensors (code-based)
|
||||
- ✅ Triggers (pack-defined)
|
||||
- ❌ Rules (configured separately)
|
||||
|
||||
### 2. Ad-Hoc Packs
|
||||
|
||||
**Definition**: User-created packs for custom automation without deploying code.
|
||||
|
||||
**Characteristics**:
|
||||
- `system: false` flag in database
|
||||
- Registered through Web UI (`/packs/new`)
|
||||
- May contain only triggers (no actions/sensors)
|
||||
- Configuration schema for pack-level settings
|
||||
- Examples: Custom webhook handlers, third-party integrations
|
||||
|
||||
**Components**:
|
||||
- ✅ Triggers (UI-configurable)
|
||||
- ❌ Actions (requires code, use system pack actions)
|
||||
- ❌ Sensors (requires code, use system pack sensors)
|
||||
- ❌ Rules (configured separately)
|
||||
|
||||
---
|
||||
|
||||
## Entity Management Matrix
|
||||
|
||||
| Entity Type | System Packs | Ad-Hoc Packs | Standalone | UI Editable |
|
||||
|-------------|--------------|--------------|------------|-------------|
|
||||
| Pack | Code Install | ✅ UI Form | N/A | ✅ Ad-Hoc Only |
|
||||
| Action | Code Install | ❌ Not Allowed | ❌ No | ❌ No |
|
||||
| Sensor | Code Install | ❌ Not Allowed | ❌ No | ❌ No |
|
||||
| Trigger | Pack Manifest | ✅ UI Form | ❌ No | ✅ Ad-Hoc Only |
|
||||
| Rule | N/A | N/A | ✅ Yes | ✅ Yes |
|
||||
| Workflow | N/A | N/A | ✅ Yes | ✅ Future |
|
||||
|
||||
---
|
||||
|
||||
## Rationale
|
||||
|
||||
### Why Are Actions/Sensors Code-Based?
|
||||
|
||||
**Security**:
|
||||
- Actions execute arbitrary code; UI-based code editing would be a security risk
|
||||
- Sensors run continuously; code quality and safety is critical
|
||||
|
||||
**Complexity**:
|
||||
- Actions may have complex dependencies (Python packages, Node modules)
|
||||
- Sensors require event loop integration and error handling
|
||||
- Runtime selection (Python vs Node.js vs Shell) requires proper sandboxing
|
||||
|
||||
**Testing and Quality**:
|
||||
- Code-based components can be version-controlled
|
||||
- Automated testing in CI/CD pipelines
|
||||
- Code review processes before deployment
|
||||
|
||||
**Performance**:
|
||||
- Compiled/optimized code runs faster
|
||||
- Dependency management is cleaner (requirements.txt, package.json)
|
||||
|
||||
### Why Are Triggers Mixed?
|
||||
|
||||
**Pack-Based Triggers**:
|
||||
- Tightly coupled to sensors that generate them
|
||||
- Schema definitions for event payloads
|
||||
- Example: `slack.message_received` trigger from `slack` pack
|
||||
|
||||
**Ad-Hoc Triggers**:
|
||||
- Allow custom event types for external systems
|
||||
- Webhook handlers that generate custom events
|
||||
- Integration with third-party services without writing code
|
||||
- Example: `custom.payment_received` for Stripe webhooks
|
||||
|
||||
### Why Are Rules Always UI-Configurable?
|
||||
|
||||
**Purpose**:
|
||||
- Rules are **glue logic** connecting triggers to actions
|
||||
- Users need to configure conditions and parameters dynamically
|
||||
- No executable code required (just data mapping)
|
||||
|
||||
**Flexibility**:
|
||||
- Business logic changes frequently
|
||||
- Non-developers should be able to create rules
|
||||
- Testing and iteration is easier with UI configuration
|
||||
|
||||
---
|
||||
|
||||
## Web UI Form Requirements
|
||||
|
||||
Based on this architecture, the Web UI should provide:
|
||||
|
||||
### ✅ Required Forms
|
||||
|
||||
1. **Rule Form** (`/rules/new`, `/rules/:id/edit`)
|
||||
- Select trigger (from any pack)
|
||||
- Define match criteria (JSON conditions)
|
||||
- Select action (from any pack)
|
||||
- Configure action parameters
|
||||
|
||||
2. **Pack Registration Form** (`/packs/new`, `/packs/:name/edit`)
|
||||
- Register ad-hoc pack
|
||||
- Define configuration schema (JSON Schema)
|
||||
- Set pack metadata
|
||||
|
||||
3. **Trigger Form** (`/triggers/new`, `/triggers/:id/edit`) - **Future**
|
||||
- Only for ad-hoc packs (`system: false`)
|
||||
- Define parameters schema
|
||||
- Define payload schema
|
||||
- Associate with ad-hoc pack
|
||||
|
||||
4. **Workflow Form** (`/workflows/new`, `/workflows/:ref/edit`) - **Future**
|
||||
- Visual workflow editor (React Flow)
|
||||
- Configure workflow actions (special type of action)
|
||||
- Define task dependencies and transitions
|
||||
|
||||
### ❌ NOT Required Forms
|
||||
|
||||
1. **Action Form** - Actions are code-based, registered via pack installation
|
||||
2. **Sensor Form** - Sensors are code-based, registered via pack installation
|
||||
|
||||
---
|
||||
|
||||
## Pack Installation Process
|
||||
|
||||
### System Pack Installation (Future)
|
||||
|
||||
```bash
|
||||
# Install from registry
|
||||
attune pack install slack
|
||||
|
||||
# Install from local directory
|
||||
attune pack install ./my-custom-pack
|
||||
|
||||
# Install from Git repository
|
||||
attune pack install git+https://github.com/org/attune-pack-aws.git
|
||||
```
|
||||
|
||||
**What Gets Registered**:
|
||||
1. Pack metadata (name, version, description)
|
||||
2. Actions (code files, entry points, parameter schemas)
|
||||
3. Sensors (code files, poll intervals, trigger types)
|
||||
4. Triggers (event type definitions, payload schemas)
|
||||
|
||||
### Ad-Hoc Pack Registration (Current)
|
||||
|
||||
```
|
||||
Web UI: /packs/new
|
||||
- Enter pack name
|
||||
- Define config schema
|
||||
- Save (no code required)
|
||||
```
|
||||
|
||||
**What Gets Registered**:
|
||||
1. Pack metadata (name, version, description)
|
||||
2. Configuration schema (for pack-level settings)
|
||||
|
||||
**Then Add Triggers**:
|
||||
```
|
||||
Web UI: /triggers/new (Future)
|
||||
- Select ad-hoc pack
|
||||
- Define trigger name and schemas
|
||||
- Save
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Example Workflows
|
||||
|
||||
### Scenario 1: Using System Pack
|
||||
|
||||
**Goal**: Send Slack notification when error event occurs
|
||||
|
||||
**Steps**:
|
||||
1. Install `core` pack (provides `core.error_event` trigger)
|
||||
2. Install `slack` pack (provides `slack.send_message` action)
|
||||
3. Create rule via UI:
|
||||
- Trigger: `core.error_event`
|
||||
- Criteria: `{ "var": "payload.severity", ">=": 3 }`
|
||||
- Action: `slack.send_message`
|
||||
- Parameters: `{ "channel": "#alerts", "message": "..." }`
|
||||
|
||||
**No code required** - both packs are pre-built.
|
||||
|
||||
### Scenario 2: Custom Webhook Integration
|
||||
|
||||
**Goal**: Trigger automation from Stripe webhook
|
||||
|
||||
**Steps**:
|
||||
1. Register ad-hoc pack via UI (`/packs/new`):
|
||||
- Name: `stripe-integration`
|
||||
- Config schema: `{ "webhook_secret": { "type": "string" } }`
|
||||
2. Create ad-hoc trigger via UI (`/triggers/new`):
|
||||
- Pack: `stripe-integration`
|
||||
- Name: `payment.succeeded`
|
||||
- Payload schema: `{ "amount": "number", "customer": "string" }`
|
||||
3. Configure webhook sensor (system pack provides generic webhook sensor)
|
||||
4. Create rule via UI:
|
||||
- Trigger: `stripe-integration.payment.succeeded`
|
||||
- Action: `slack.send_message` (from system pack)
|
||||
|
||||
**Minimal code** - leverage existing webhook sensor, only define trigger schema.
|
||||
|
||||
### Scenario 3: Custom Action (Requires Code)
|
||||
|
||||
**Goal**: Custom Python action for proprietary API
|
||||
|
||||
**Steps**:
|
||||
1. Create pack directory structure:
|
||||
```
|
||||
my-company-pack/
|
||||
├── pack.yaml
|
||||
├── actions/
|
||||
│ └── send_alert.py
|
||||
└── requirements.txt
|
||||
```
|
||||
2. Install pack: `attune pack install ./my-company-pack`
|
||||
3. Create rule via UI using `my-company.send_alert` action
|
||||
|
||||
**Code required** - custom business logic needs implementation.
|
||||
|
||||
---
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Pack Registry (Phase 1)
|
||||
|
||||
- Central repository of Attune packs
|
||||
- Version management and updates
|
||||
- Pack discovery and browsing
|
||||
- Dependency resolution
|
||||
|
||||
### Visual Workflow Editor (Phase 2)
|
||||
|
||||
- Drag-and-drop workflow designer
|
||||
- Workflow actions (special configurable actions)
|
||||
- Conditional logic and branching
|
||||
- Sub-workflows and reusable components
|
||||
|
||||
### Pack Marketplace (Phase 3)
|
||||
|
||||
- Community-contributed packs
|
||||
- Rating and reviews
|
||||
- Documentation and examples
|
||||
- Automated testing and validation
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
**Key Principles**:
|
||||
|
||||
1. **Code for execution** - Actions and sensors are implemented as code for security, performance, and maintainability
|
||||
2. **Data for configuration** - Rules and workflows are UI-configurable for flexibility
|
||||
3. **Hybrid for triggers** - Pack-based for sensors, ad-hoc for custom integrations
|
||||
4. **Pack-centric design** - Components are bundled and versioned together
|
||||
5. **Progressive enhancement** - Start with system packs, extend with ad-hoc components
|
||||
|
||||
This architecture balances **flexibility** (users can configure automation without code) with **safety** (executable code is version-controlled and reviewed).
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Pack Management API](./api-packs.md)
|
||||
- [Rule Management API](./api-rules.md)
|
||||
- [Trigger and Sensor Architecture](./trigger-sensor-architecture.md)
|
||||
- [Web UI Architecture](./web-ui-architecture.md)
|
||||
564
docs/architecture/queue-architecture.md
Normal file
564
docs/architecture/queue-architecture.md
Normal file
@@ -0,0 +1,564 @@
|
||||
# Queue Architecture and FIFO Execution Ordering
|
||||
|
||||
**Status**: Production Ready (v0.1)
|
||||
**Last Updated**: 2025-01-27
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Attune implements a **per-action FIFO queue system** to guarantee deterministic execution ordering when policy limits (concurrency, delays) are enforced. This ensures fairness, predictability, and correct workflow execution.
|
||||
|
||||
### Why Queue Ordering Matters
|
||||
|
||||
**Problem**: Without ordered queuing, when multiple executions are blocked by policies, they proceed in **random order** based on tokio's task scheduling. This causes:
|
||||
|
||||
- ❌ **Fairness Violations**: Later requests execute before earlier ones
|
||||
- ❌ **Non-determinism**: Same workflow produces different orders across runs
|
||||
- ❌ **Broken Dependencies**: Parent executions may proceed after children
|
||||
- ❌ **Poor UX**: Unpredictable queue behavior frustrates users
|
||||
|
||||
**Solution**: FIFO queues with async notification ensure executions proceed in strict request order.
|
||||
|
||||
---
|
||||
|
||||
## Architecture Components
|
||||
|
||||
### 1. ExecutionQueueManager
|
||||
|
||||
**Location**: `crates/executor/src/queue_manager.rs`
|
||||
|
||||
The central component managing all execution queues.
|
||||
|
||||
```rust
|
||||
pub struct ExecutionQueueManager {
|
||||
queues: DashMap<i64, Arc<Mutex<ActionQueue>>>, // Key: action_id
|
||||
config: QueueConfig,
|
||||
db_pool: Option<PgPool>,
|
||||
}
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- **One queue per action**: Isolated FIFO queues prevent cross-action interference
|
||||
- **Thread-safe**: Uses `DashMap` for lock-free map access
|
||||
- **Async-friendly**: Uses `tokio::Notify` for efficient waiting
|
||||
- **Observable**: Tracks statistics for monitoring
|
||||
|
||||
### 2. ActionQueue
|
||||
|
||||
Per-action queue structure with FIFO ordering guarantees.
|
||||
|
||||
```rust
|
||||
struct ActionQueue {
|
||||
queue: VecDeque<QueueEntry>, // FIFO queue
|
||||
active_count: u32, // Currently running
|
||||
max_concurrent: u32, // Policy limit
|
||||
total_enqueued: u64, // Lifetime counter
|
||||
total_completed: u64, // Lifetime counter
|
||||
}
|
||||
```
|
||||
|
||||
### 3. QueueEntry
|
||||
|
||||
Individual execution waiting in queue.
|
||||
|
||||
```rust
|
||||
struct QueueEntry {
|
||||
execution_id: i64,
|
||||
enqueued_at: DateTime<Utc>,
|
||||
notifier: Arc<Notify>, // Async notification
|
||||
}
|
||||
```
|
||||
|
||||
**Notification Mechanism**:
|
||||
- Each queued execution gets a `tokio::Notify` handle
|
||||
- Worker completion triggers `notify.notify_one()` on next waiter
|
||||
- No polling required - efficient async waiting
|
||||
|
||||
---
|
||||
|
||||
## Execution Flow
|
||||
|
||||
### Normal Flow (With Capacity)
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 1. EnforcementProcessor receives enforcement.created │
|
||||
│ └─ Enforcement: rule fired, needs execution │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 2. PolicyEnforcer.check_policies(action_id) │
|
||||
│ └─ Verify rate limits, quotas │
|
||||
│ └─ Return: None (no violation) │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 3. QueueManager.enqueue_and_wait(action_id, exec_id, limit)│
|
||||
│ └─ Check: active_count < max_concurrent? │
|
||||
│ └─ YES: Increment active_count │
|
||||
│ └─ Return immediately (no waiting) │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 4. Create Execution record in database │
|
||||
│ └─ Status: REQUESTED │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 5. Publish execution.requested to scheduler │
|
||||
│ └─ Scheduler selects worker and forwards │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 6. Worker executes action │
|
||||
│ └─ Status: RUNNING → SUCCEEDED/FAILED │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 7. Worker publishes execution.completed │
|
||||
│ └─ Payload: { execution_id, action_id, status, result } │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 8. CompletionListener receives message │
|
||||
│ └─ QueueManager.notify_completion(action_id) │
|
||||
│ └─ Decrement active_count │
|
||||
│ └─ Notify next waiter in queue (if any) │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Queued Flow (At Capacity)
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 1-2. Same as normal flow (enforcement, policy check) │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 3. QueueManager.enqueue_and_wait(action_id, exec_id, limit)│
|
||||
│ └─ Check: active_count < max_concurrent? │
|
||||
│ └─ NO: Queue is at capacity │
|
||||
│ └─ Create QueueEntry with Notify handle │
|
||||
│ └─ Push to VecDeque (FIFO position) │
|
||||
│ └─ await notifier.notified() ← BLOCKS HERE │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
│ (waits for notification)
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ WORKER COMPLETES EARLIER EXECUTION │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ CompletionListener.notify_completion(action_id) │
|
||||
│ └─ Lock queue │
|
||||
│ └─ Pop front QueueEntry (FIFO!) │
|
||||
│ └─ Decrement active_count (was N) │
|
||||
│ └─ entry.notifier.notify_one() ← WAKES WAITER │
|
||||
│ └─ Increment active_count (back to N) │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 3. (continued) enqueue_and_wait() resumes │
|
||||
│ └─ Return Ok(()) - slot acquired │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 4-8. Same as normal flow (create execution, execute, etc.) │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## FIFO Guarantee
|
||||
|
||||
### How FIFO is Maintained
|
||||
|
||||
1. **Single Queue per Action**: Each action has independent `VecDeque<QueueEntry>`
|
||||
2. **Push Back, Pop Front**: New entries added to back, next waiter from front
|
||||
3. **Locked Mutations**: All queue operations protected by `Mutex`
|
||||
4. **No Reordering**: No priority, no jumping - strict first-in-first-out
|
||||
|
||||
### Example Scenario
|
||||
|
||||
```
|
||||
Action: core.http.get (max_concurrent = 2)
|
||||
|
||||
T=0: Exec A arrives → active_count=0 → proceeds immediately (active=1)
|
||||
T=1: Exec B arrives → active_count=1 → proceeds immediately (active=2)
|
||||
T=2: Exec C arrives → active_count=2 → QUEUED at position 0
|
||||
T=3: Exec D arrives → active_count=2 → QUEUED at position 1
|
||||
T=4: Exec E arrives → active_count=2 → QUEUED at position 2
|
||||
|
||||
Queue state: [C, D, E]
|
||||
|
||||
T=5: A completes → pop C from front → C proceeds (active=2, queue=[D, E])
|
||||
T=6: B completes → pop D from front → D proceeds (active=2, queue=[E])
|
||||
T=7: C completes → pop E from front → E proceeds (active=2, queue=[])
|
||||
T=8: D completes → (queue empty, active=1)
|
||||
T=9: E completes → (queue empty, active=0)
|
||||
|
||||
Result: Executions proceeded in exact order: A, B, C, D, E ✅
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Queue Statistics
|
||||
|
||||
### Data Model
|
||||
|
||||
```rust
|
||||
pub struct QueueStats {
|
||||
pub action_id: i64,
|
||||
pub queue_length: usize, // Waiting count
|
||||
pub active_count: u32, // Running count
|
||||
pub max_concurrent: u32, // Policy limit
|
||||
pub oldest_enqueued_at: Option<DateTime<Utc>>,
|
||||
pub total_enqueued: u64, // Lifetime counter
|
||||
pub total_completed: u64, // Lifetime counter
|
||||
}
|
||||
```
|
||||
|
||||
### Persistence
|
||||
|
||||
Queue statistics are persisted to the `attune.queue_stats` table for:
|
||||
- **API visibility**: Real-time queue monitoring
|
||||
- **Historical tracking**: Execution patterns over time
|
||||
- **Alerting**: Detect stuck or growing queues
|
||||
|
||||
**Update Frequency**: On every queue state change (enqueue, dequeue, complete)
|
||||
|
||||
### Accessing Stats
|
||||
|
||||
**In-Memory** (Executor service):
|
||||
```rust
|
||||
let stats = queue_manager.get_queue_stats(action_id).await;
|
||||
```
|
||||
|
||||
**Database** (Any service):
|
||||
```rust
|
||||
let stats = QueueStatsRepository::find_by_action(pool, action_id).await?;
|
||||
```
|
||||
|
||||
**API Endpoint**:
|
||||
```bash
|
||||
GET /api/v1/actions/core.http.get/queue-stats
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
### Executor Configuration
|
||||
|
||||
```yaml
|
||||
executor:
|
||||
queue:
|
||||
# Maximum executions per queue (prevents memory exhaustion)
|
||||
max_queue_length: 10000
|
||||
|
||||
# Maximum time an execution can wait in queue (seconds)
|
||||
queue_timeout_seconds: 3600
|
||||
|
||||
# Enable/disable queue metrics persistence
|
||||
enable_metrics: true
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```bash
|
||||
# Override via environment
|
||||
export ATTUNE__EXECUTOR__QUEUE__MAX_QUEUE_LENGTH=5000
|
||||
export ATTUNE__EXECUTOR__QUEUE__QUEUE_TIMEOUT_SECONDS=1800
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
### Memory Usage
|
||||
|
||||
**Per Queue**: ~128 bytes (DashMap entry + Arc + Mutex overhead)
|
||||
**Per Queued Execution**: ~80 bytes (QueueEntry + Arc<Notify>)
|
||||
|
||||
**Example**: 100 actions with 50 queued executions each:
|
||||
- Queue overhead: 100 × 128 bytes = ~12 KB
|
||||
- Entry overhead: 5000 × 80 bytes = ~400 KB
|
||||
- **Total**: ~412 KB (negligible)
|
||||
|
||||
### Latency
|
||||
|
||||
- **Enqueue (with capacity)**: < 1 μs (just increment counter)
|
||||
- **Enqueue (at capacity)**: O(1) to queue, then async wait
|
||||
- **Dequeue (notify)**: < 10 μs (pop + notify)
|
||||
- **Stats lookup**: < 1 μs (DashMap read)
|
||||
|
||||
### Throughput
|
||||
|
||||
**Measured Performance** (from stress tests):
|
||||
- 1,000 executions (concurrency=5): **~200 exec/sec**
|
||||
- 10,000 executions (concurrency=10): **~500 exec/sec**
|
||||
|
||||
**Bottleneck**: Database writes and worker execution time, not queue overhead
|
||||
|
||||
---
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### Health Indicators
|
||||
|
||||
**Healthy Queue**:
|
||||
- ✅ `queue_length` is 0 or low (< 10% of max)
|
||||
- ✅ `active_count` ≈ `max_concurrent` during load
|
||||
- ✅ `oldest_enqueued_at` is recent (< 5 minutes)
|
||||
- ✅ `total_completed` increases steadily
|
||||
|
||||
**Unhealthy Queue**:
|
||||
- ⚠️ `queue_length` consistently high (> 50% of max)
|
||||
- ⚠️ `oldest_enqueued_at` is old (> 30 minutes)
|
||||
- 🚨 `queue_length` approaches `max_queue_length`
|
||||
- 🚨 `active_count` < `max_concurrent` (workers stuck)
|
||||
|
||||
### Monitoring Queries
|
||||
|
||||
**Active queues**:
|
||||
```sql
|
||||
SELECT action_id, queue_length, active_count, max_concurrent,
|
||||
oldest_enqueued_at, last_updated
|
||||
FROM attune.queue_stats
|
||||
WHERE queue_length > 0 OR active_count > 0
|
||||
ORDER BY queue_length DESC;
|
||||
```
|
||||
|
||||
**Stuck queues** (not progressing):
|
||||
```sql
|
||||
SELECT a.ref, qs.queue_length, qs.active_count,
|
||||
qs.oldest_enqueued_at,
|
||||
NOW() - qs.last_updated AS stale_duration
|
||||
FROM attune.queue_stats qs
|
||||
JOIN attune.action a ON a.id = qs.action_id
|
||||
WHERE (queue_length > 0 OR active_count > 0)
|
||||
AND last_updated < NOW() - INTERVAL '10 minutes';
|
||||
```
|
||||
|
||||
**Queue throughput**:
|
||||
```sql
|
||||
SELECT a.ref, qs.total_completed, qs.total_enqueued,
|
||||
qs.total_completed::float / NULLIF(qs.total_enqueued, 0) * 100 AS completion_rate
|
||||
FROM attune.queue_stats qs
|
||||
JOIN attune.action a ON a.id = qs.action_id
|
||||
WHERE total_enqueued > 0
|
||||
ORDER BY total_enqueued DESC;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Queue Not Progressing
|
||||
|
||||
**Symptom**: `queue_length` stays constant, executions don't proceed
|
||||
|
||||
**Possible Causes**:
|
||||
1. **Workers not completing**: Check worker logs for crashes/hangs
|
||||
2. **Completion messages not publishing**: Check worker MQ connection
|
||||
3. **CompletionListener not running**: Check executor service logs
|
||||
4. **Database deadlock**: Check PostgreSQL logs
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check active executions for this action
|
||||
psql -c "SELECT id, status, created FROM attune.execution
|
||||
WHERE action = <action_id> AND status IN ('running', 'requested')
|
||||
ORDER BY created DESC LIMIT 10;"
|
||||
|
||||
# Check worker logs
|
||||
tail -f /var/log/attune/worker.log | grep "execution_id"
|
||||
|
||||
# Check completion messages
|
||||
rabbitmqctl list_queues name messages
|
||||
```
|
||||
|
||||
### Queue Full Errors
|
||||
|
||||
**Symptom**: `Error: Queue full (max length: 10000)`
|
||||
|
||||
**Causes**:
|
||||
- Action is overwhelmed with requests
|
||||
- Workers are too slow or stuck
|
||||
- `max_queue_length` is too low
|
||||
|
||||
**Solutions**:
|
||||
1. **Increase limit** (short-term):
|
||||
```yaml
|
||||
executor:
|
||||
queue:
|
||||
max_queue_length: 20000
|
||||
```
|
||||
|
||||
2. **Add more workers** (medium-term):
|
||||
- Scale worker service horizontally
|
||||
- Increase worker concurrency
|
||||
|
||||
3. **Increase concurrency limit** (if safe):
|
||||
- Adjust action-specific policy
|
||||
- Higher `max_concurrent` = more parallel executions
|
||||
|
||||
4. **Rate limit at API** (long-term):
|
||||
- Add API-level rate limiting
|
||||
- Reject requests before they enter system
|
||||
|
||||
### Memory Exhaustion
|
||||
|
||||
**Symptom**: Executor OOM killed, high memory usage
|
||||
|
||||
**Causes**:
|
||||
- Too many queues with large queue lengths
|
||||
- Memory leak in queue entries
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check queue stats in database
|
||||
psql -c "SELECT SUM(queue_length) as total_queued,
|
||||
COUNT(*) as num_actions,
|
||||
MAX(queue_length) as max_queue
|
||||
FROM attune.queue_stats;"
|
||||
|
||||
# Monitor executor memory
|
||||
ps aux | grep attune-executor
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
- Reduce `max_queue_length`
|
||||
- Clear old queues: `queue_manager.clear_all_queues()`
|
||||
- Restart executor service (queues rebuild from DB)
|
||||
|
||||
### FIFO Violation (Critical Bug)
|
||||
|
||||
**Symptom**: Executions complete out of order
|
||||
|
||||
**This should NEVER happen** - indicates a critical bug.
|
||||
|
||||
**Diagnosis**:
|
||||
1. Enable detailed logging:
|
||||
```rust
|
||||
// In queue_manager.rs
|
||||
tracing::debug!(
|
||||
"Enqueued exec {} at position {} for action {}",
|
||||
execution_id, queue.len(), action_id
|
||||
);
|
||||
```
|
||||
|
||||
2. Check for race conditions:
|
||||
- Multiple threads modifying same queue
|
||||
- Lock not held during entire operation
|
||||
- Notify called before entry dequeued
|
||||
|
||||
**Report immediately** with:
|
||||
- Executor logs with timestamps
|
||||
- Database query showing execution order
|
||||
- Queue stats at time of violation
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
### For Operators
|
||||
|
||||
1. **Monitor queue depths**: Alert on `queue_length > 100`
|
||||
2. **Set reasonable limits**: Don't set `max_queue_length` too high
|
||||
3. **Scale workers**: Add workers when queues consistently fill
|
||||
4. **Regular cleanup**: Run cleanup jobs to remove stale stats
|
||||
5. **Test policies**: Validate concurrency limits in staging first
|
||||
|
||||
### For Developers
|
||||
|
||||
1. **Test with queues**: Always test actions with concurrency limits
|
||||
2. **Handle timeouts**: Implement proper timeout handling in actions
|
||||
3. **Idempotent actions**: Design actions to be safely retried
|
||||
4. **Log execution order**: Log start/end times for debugging
|
||||
5. **Monitor completion rate**: Track `total_completed / total_enqueued`
|
||||
|
||||
### For Action Authors
|
||||
|
||||
1. **Know your limits**: Understand action's concurrency safety
|
||||
2. **Fast completions**: Minimize action execution time
|
||||
3. **Proper error handling**: Always complete (success or failure)
|
||||
4. **No indefinite blocking**: Use timeouts on external calls
|
||||
5. **Test at scale**: Stress test with many concurrent requests
|
||||
|
||||
---
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Queue Exhaustion DoS
|
||||
|
||||
**Attack**: Attacker floods system with action requests to fill queues
|
||||
|
||||
**Mitigations**:
|
||||
- **Rate limiting**: API-level request throttling
|
||||
- **Authentication**: Require auth for action triggers
|
||||
- **Queue limits**: `max_queue_length` prevents unbounded growth
|
||||
- **Queue timeouts**: `queue_timeout_seconds` evicts old entries
|
||||
- **Monitoring**: Alert on sudden queue growth
|
||||
|
||||
### Priority Escalation
|
||||
|
||||
**Non-Issue**: FIFO prevents priority jumping - no user can skip the queue
|
||||
|
||||
### Information Disclosure
|
||||
|
||||
**Concern**: Queue stats reveal system load
|
||||
|
||||
**Mitigation**: Restrict `/queue-stats` endpoint to authenticated users with appropriate RBAC
|
||||
|
||||
---
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Planned Features
|
||||
|
||||
- [ ] **Priority queues**: Allow high-priority executions to jump queue
|
||||
- [ ] **Queue pausing**: Temporarily stop processing specific actions
|
||||
- [ ] **Batch notifications**: Notify multiple waiters at once
|
||||
- [ ] **Queue persistence**: Survive executor restarts
|
||||
- [ ] **Cross-executor coordination**: Distributed queue management
|
||||
- [ ] **Advanced metrics**: Latency percentiles, queue age histograms
|
||||
- [ ] **Auto-scaling**: Automatically adjust `max_concurrent` based on load
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Executor Service Architecture](./executor-service.md)
|
||||
- [Policy Enforcement](./policy-enforcement.md)
|
||||
- [Worker Service](./worker-service.md)
|
||||
- [API: Actions - Queue Stats Endpoint](./api-actions.md#queue-statistics)
|
||||
- [Operational Runbook](./ops-runbook.md)
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- Implementation: `crates/executor/src/queue_manager.rs`
|
||||
- Tests: `crates/executor/tests/fifo_ordering_integration_test.rs`
|
||||
- Implementation Plan: `work-summary/2025-01-policy-ordering-plan.md`
|
||||
- Status: `work-summary/FIFO-ORDERING-STATUS.md`
|
||||
|
||||
---
|
||||
|
||||
**Version**: 1.0
|
||||
**Status**: Production Ready
|
||||
**Last Updated**: 2025-01-27
|
||||
762
docs/architecture/sensor-service.md
Normal file
762
docs/architecture/sensor-service.md
Normal file
@@ -0,0 +1,762 @@
|
||||
# Attune Sensor Service
|
||||
|
||||
## Overview
|
||||
|
||||
The **Sensor Service** is responsible for monitoring trigger conditions and generating events in the Attune automation platform. It bridges the gap between external systems and the rule-based automation engine.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Core Responsibilities
|
||||
|
||||
1. **Sensor Lifecycle Management**: Load, start, stop, and restart sensors
|
||||
2. **Event Monitoring**: Execute sensors to detect trigger conditions
|
||||
3. **Event Generation**: Create event records when triggers fire
|
||||
4. **Rule Matching**: Find matching rules and create enforcements
|
||||
5. **Event Publishing**: Publish events to the message queue for processing
|
||||
|
||||
### Service Components
|
||||
|
||||
```
|
||||
sensor/src/
|
||||
├── main.rs # Service entry point
|
||||
├── service.rs # Main service orchestrator
|
||||
├── sensor_manager.rs # Manage sensor instances
|
||||
├── event_generator.rs # Generate events from sensor data
|
||||
├── rule_matcher.rs # Match events to rules
|
||||
├── monitors/ # Different trigger monitor types
|
||||
│ ├── mod.rs
|
||||
│ ├── custom.rs # Execute custom sensor code
|
||||
│ ├── timer.rs # Cron/interval triggers
|
||||
│ ├── webhook.rs # HTTP webhook triggers
|
||||
│ └── file.rs # File watch triggers
|
||||
└── runtime/ # Sensor runtime execution
|
||||
├── mod.rs
|
||||
└── sensor_executor.rs # Execute sensor code in runtime
|
||||
```
|
||||
|
||||
## Event Flow
|
||||
|
||||
```
|
||||
Sensor Poll → Condition Met → Generate Event → Match Rules → Create Enforcements
|
||||
↓ ↓ ↓ ↓ ↓
|
||||
Database Sensor Code attune.event Rule Query attune.enforcement
|
||||
↓ ↓
|
||||
EventCreated Msg EnforcementCreated Msg
|
||||
↓ ↓
|
||||
(to Notifier) (to Executor)
|
||||
```
|
||||
|
||||
## Database Schema
|
||||
|
||||
### Trigger Table
|
||||
|
||||
```sql
|
||||
CREATE TABLE attune.trigger (
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
ref TEXT NOT NULL UNIQUE, -- Format: pack.name
|
||||
pack BIGINT REFERENCES attune.pack(id),
|
||||
pack_ref TEXT,
|
||||
label TEXT NOT NULL,
|
||||
description TEXT,
|
||||
enabled BOOLEAN NOT NULL DEFAULT TRUE,
|
||||
param_schema JSONB, -- Configuration schema
|
||||
out_schema JSONB, -- Output payload schema
|
||||
created TIMESTAMPTZ NOT NULL,
|
||||
updated TIMESTAMPTZ NOT NULL
|
||||
);
|
||||
```
|
||||
|
||||
### Sensor Table
|
||||
|
||||
```sql
|
||||
CREATE TABLE attune.sensor (
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
ref TEXT NOT NULL UNIQUE, -- Format: pack.name
|
||||
pack BIGINT REFERENCES attune.pack(id),
|
||||
pack_ref TEXT,
|
||||
label TEXT NOT NULL,
|
||||
description TEXT NOT NULL,
|
||||
entrypoint TEXT NOT NULL, -- Code entry point
|
||||
runtime BIGINT NOT NULL REFERENCES attune.runtime(id),
|
||||
runtime_ref TEXT NOT NULL, -- e.g., "core.sensor.python3"
|
||||
trigger BIGINT NOT NULL REFERENCES attune.trigger(id),
|
||||
trigger_ref TEXT NOT NULL, -- e.g., "core.webhook"
|
||||
enabled BOOLEAN NOT NULL,
|
||||
param_schema JSONB, -- Sensor configuration schema
|
||||
created TIMESTAMPTZ NOT NULL,
|
||||
updated TIMESTAMPTZ NOT NULL
|
||||
);
|
||||
```
|
||||
|
||||
### Event Table
|
||||
|
||||
```sql
|
||||
CREATE TABLE attune.event (
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
trigger BIGINT REFERENCES attune.trigger(id),
|
||||
trigger_ref TEXT NOT NULL, -- Preserved even if trigger deleted
|
||||
config JSONB, -- Snapshot of trigger/sensor config
|
||||
payload JSONB, -- Event data
|
||||
source BIGINT REFERENCES attune.sensor(id),
|
||||
source_ref TEXT, -- Sensor that generated event
|
||||
created TIMESTAMPTZ NOT NULL,
|
||||
updated TIMESTAMPTZ NOT NULL
|
||||
);
|
||||
```
|
||||
|
||||
## Sensor Types
|
||||
|
||||
### 1. Custom Sensors
|
||||
|
||||
Custom sensors execute user-defined code that polls for conditions:
|
||||
|
||||
```python
|
||||
# Example: GitHub webhook sensor
|
||||
def poll():
|
||||
# Check for new GitHub events
|
||||
events = check_github_api()
|
||||
|
||||
for event in events:
|
||||
# Return event payload
|
||||
yield {
|
||||
"event_type": event.type,
|
||||
"repository": event.repo,
|
||||
"author": event.author,
|
||||
"data": event.data
|
||||
}
|
||||
```
|
||||
|
||||
**Features**:
|
||||
- Support multiple runtimes (Python, Node.js)
|
||||
- Poll on configurable intervals
|
||||
- Handle failures and retries
|
||||
- Restart on errors
|
||||
|
||||
### 2. Timer Triggers (Built-in)
|
||||
|
||||
Execute actions on a schedule:
|
||||
|
||||
```yaml
|
||||
trigger:
|
||||
ref: core.timer
|
||||
type: cron
|
||||
schedule: "0 0 * * *" # Daily at midnight
|
||||
```
|
||||
|
||||
**Features**:
|
||||
- Cron expressions
|
||||
- Interval-based (every N seconds/minutes/hours)
|
||||
- Timezone support
|
||||
|
||||
### 3. Webhook Triggers (Built-in)
|
||||
|
||||
HTTP endpoints for external systems:
|
||||
|
||||
```yaml
|
||||
trigger:
|
||||
ref: core.webhook
|
||||
path: /webhook/github
|
||||
method: POST
|
||||
auth: bearer_token
|
||||
```
|
||||
|
||||
**Features**:
|
||||
- Dynamic webhook URL generation
|
||||
- Authentication (API key, bearer token, HMAC)
|
||||
- Payload validation
|
||||
- Path parameters
|
||||
|
||||
### 4. File Watch Triggers (Future)
|
||||
|
||||
Monitor filesystem changes:
|
||||
|
||||
```yaml
|
||||
trigger:
|
||||
ref: core.file_watch
|
||||
path: /var/log/app.log
|
||||
patterns: ["ERROR", "FATAL"]
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Service Configuration
|
||||
|
||||
```yaml
|
||||
sensor:
|
||||
enabled: true
|
||||
poll_interval: 30 # Default poll interval (seconds)
|
||||
max_concurrent_sensors: 100 # Max sensors running concurrently
|
||||
sensor_timeout: 300 # Sensor execution timeout (seconds)
|
||||
restart_on_error: true # Restart sensors on error
|
||||
max_restart_attempts: 3 # Max restart attempts before disabling
|
||||
|
||||
# Webhook server (if enabled)
|
||||
webhook:
|
||||
enabled: false
|
||||
host: 0.0.0.0
|
||||
port: 8083
|
||||
base_path: /webhooks
|
||||
|
||||
# Timer triggers (if enabled)
|
||||
timer:
|
||||
enabled: false
|
||||
tick_interval: 1 # Check timers every N seconds
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```bash
|
||||
# Override service settings
|
||||
ATTUNE__SENSOR__ENABLED=true
|
||||
ATTUNE__SENSOR__POLL_INTERVAL=30
|
||||
ATTUNE__SENSOR__MAX_CONCURRENT_SENSORS=100
|
||||
ATTUNE__SENSOR__WEBHOOK__ENABLED=true
|
||||
ATTUNE__SENSOR__WEBHOOK__PORT=8083
|
||||
```
|
||||
|
||||
## Message Queue Integration
|
||||
|
||||
### Consumes From
|
||||
|
||||
No messages consumed initially (standalone operation).
|
||||
|
||||
### Publishes To
|
||||
|
||||
#### EventCreated Message
|
||||
|
||||
Published to `attune.events` exchange with routing key `event.created`:
|
||||
|
||||
```json
|
||||
{
|
||||
"message_id": "uuid",
|
||||
"correlation_id": "uuid",
|
||||
"message_type": "EventCreated",
|
||||
"timestamp": "2024-01-15T10:30:00Z",
|
||||
"payload": {
|
||||
"event_id": 123,
|
||||
"trigger_ref": "github.webhook",
|
||||
"trigger_id": 45,
|
||||
"sensor_ref": "github.listener",
|
||||
"sensor_id": 67,
|
||||
"payload": {
|
||||
"event_type": "push",
|
||||
"repository": "user/repo",
|
||||
"author": "johndoe"
|
||||
},
|
||||
"config": {
|
||||
"repo_url": "https://github.com/user/repo"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### EnforcementCreated Message
|
||||
|
||||
Published to `attune.events` exchange with routing key `enforcement.created`:
|
||||
|
||||
```json
|
||||
{
|
||||
"message_id": "uuid",
|
||||
"correlation_id": "uuid",
|
||||
"message_type": "EnforcementCreated",
|
||||
"timestamp": "2024-01-15T10:30:00Z",
|
||||
"payload": {
|
||||
"enforcement_id": 456,
|
||||
"rule_id": 78,
|
||||
"rule_ref": "github.deploy_on_push",
|
||||
"event_id": 123,
|
||||
"trigger_ref": "github.webhook",
|
||||
"payload": {
|
||||
"event_type": "push",
|
||||
"repository": "user/repo",
|
||||
"branch": "main"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Sensor Execution
|
||||
|
||||
### Sensor Manager
|
||||
|
||||
The `SensorManager` component:
|
||||
|
||||
1. **Loads Sensors**: Query database for enabled sensors
|
||||
2. **Starts Sensors**: Spawn async tasks for each sensor
|
||||
3. **Monitors Health**: Track sensor status and restarts
|
||||
4. **Handles Errors**: Retry logic and failure tracking
|
||||
|
||||
```rust
|
||||
pub struct SensorManager {
|
||||
sensors: Arc<RwLock<HashMap<i64, SensorInstance>>>,
|
||||
config: Arc<SensorConfig>,
|
||||
db: PgPool,
|
||||
}
|
||||
|
||||
impl SensorManager {
|
||||
pub async fn start(&self) -> Result<()> {
|
||||
let sensors = self.load_enabled_sensors().await?;
|
||||
|
||||
for sensor in sensors {
|
||||
self.start_sensor(sensor).await?;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
async fn start_sensor(&self, sensor: Sensor) -> Result<()> {
|
||||
let instance = SensorInstance::new(sensor, self.config.clone());
|
||||
instance.start().await?;
|
||||
|
||||
self.sensors.write().await.insert(sensor.id, instance);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Sensor Instance
|
||||
|
||||
Each sensor runs in its own async task:
|
||||
|
||||
```rust
|
||||
pub struct SensorInstance {
|
||||
sensor: Sensor,
|
||||
runtime: RuntimeConfig,
|
||||
poll_interval: Duration,
|
||||
status: Arc<RwLock<SensorStatus>>,
|
||||
}
|
||||
|
||||
impl SensorInstance {
|
||||
pub async fn start(&self) -> Result<()> {
|
||||
tokio::spawn(self.run_loop());
|
||||
Ok(())
|
||||
}
|
||||
|
||||
async fn run_loop(&self) {
|
||||
loop {
|
||||
match self.poll().await {
|
||||
Ok(events) => {
|
||||
for event_data in events {
|
||||
self.generate_event(event_data).await;
|
||||
}
|
||||
}
|
||||
Err(e) => {
|
||||
self.handle_error(e).await;
|
||||
}
|
||||
}
|
||||
|
||||
tokio::time::sleep(self.poll_interval).await;
|
||||
}
|
||||
}
|
||||
|
||||
async fn poll(&self) -> Result<Vec<JsonValue>> {
|
||||
// Execute sensor code in runtime
|
||||
// Similar to Worker's ActionExecutor
|
||||
todo!()
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Runtime Execution
|
||||
|
||||
Sensors execute in runtimes (Python, Node.js) similar to actions:
|
||||
|
||||
```rust
|
||||
pub struct SensorExecutor {
|
||||
runtime_manager: RuntimeManager,
|
||||
}
|
||||
|
||||
impl SensorExecutor {
|
||||
pub async fn execute(&self, sensor: &Sensor, config: JsonValue) -> Result<Vec<JsonValue>> {
|
||||
// 1. Prepare execution environment
|
||||
// 2. Inject sensor code and configuration
|
||||
// 3. Execute sensor
|
||||
// 4. Collect yielded events
|
||||
// 5. Return event data
|
||||
|
||||
let runtime = self.runtime_manager.get_runtime(&sensor.runtime_ref)?;
|
||||
let result = runtime.execute_sensor(sensor, config).await?;
|
||||
|
||||
Ok(result)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Event Generation
|
||||
|
||||
When a sensor detects a trigger condition:
|
||||
|
||||
1. **Create Event Record**: Insert into `attune.event` table
|
||||
2. **Snapshot Configuration**: Capture trigger/sensor config at event time
|
||||
3. **Store Payload**: Save event data from sensor
|
||||
4. **Publish Message**: Send `EventCreated` to message queue
|
||||
|
||||
```rust
|
||||
pub struct EventGenerator {
|
||||
db: PgPool,
|
||||
mq: MessageQueue,
|
||||
}
|
||||
|
||||
impl EventGenerator {
|
||||
pub async fn generate_event(
|
||||
&self,
|
||||
sensor: &Sensor,
|
||||
trigger: &Trigger,
|
||||
payload: JsonValue,
|
||||
) -> Result<i64> {
|
||||
// Create event record
|
||||
let event_id = sqlx::query_scalar!(
|
||||
r#"
|
||||
INSERT INTO attune.event
|
||||
(trigger, trigger_ref, config, payload, source, source_ref)
|
||||
VALUES ($1, $2, $3, $4, $5, $6)
|
||||
RETURNING id
|
||||
"#,
|
||||
Some(trigger.id),
|
||||
&trigger.r#ref,
|
||||
self.build_config_snapshot(trigger, sensor),
|
||||
&payload,
|
||||
Some(sensor.id),
|
||||
Some(&sensor.r#ref)
|
||||
)
|
||||
.fetch_one(&self.db)
|
||||
.await?;
|
||||
|
||||
// Publish EventCreated message
|
||||
self.publish_event_created(event_id, trigger, sensor, &payload).await?;
|
||||
|
||||
Ok(event_id)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Rule Matching
|
||||
|
||||
After generating an event, find matching rules:
|
||||
|
||||
1. **Query Rules**: Find rules for the trigger
|
||||
2. **Evaluate Conditions**: Check if event matches rule conditions
|
||||
3. **Create Enforcements**: Insert enforcement records
|
||||
4. **Publish Messages**: Send `EnforcementCreated` to executor
|
||||
|
||||
```rust
|
||||
pub struct RuleMatcher {
|
||||
db: PgPool,
|
||||
mq: MessageQueue,
|
||||
}
|
||||
|
||||
impl RuleMatcher {
|
||||
pub async fn match_rules(&self, event: &Event) -> Result<Vec<i64>> {
|
||||
// Find enabled rules for this trigger
|
||||
let rules = sqlx::query_as!(
|
||||
Rule,
|
||||
r#"
|
||||
SELECT * FROM attune.rule
|
||||
WHERE trigger_ref = $1 AND enabled = true
|
||||
"#,
|
||||
&event.trigger_ref
|
||||
)
|
||||
.fetch_all(&self.db)
|
||||
.await?;
|
||||
|
||||
let mut enforcement_ids = Vec::new();
|
||||
|
||||
for rule in rules {
|
||||
// Evaluate rule conditions
|
||||
if self.evaluate_conditions(&rule, event).await? {
|
||||
let enforcement_id = self.create_enforcement(&rule, event).await?;
|
||||
enforcement_ids.push(enforcement_id);
|
||||
}
|
||||
}
|
||||
|
||||
Ok(enforcement_ids)
|
||||
}
|
||||
|
||||
async fn evaluate_conditions(&self, rule: &Rule, event: &Event) -> Result<bool> {
|
||||
// Evaluate JSON conditions against event payload
|
||||
// Simple implementation: check if all conditions match
|
||||
todo!()
|
||||
}
|
||||
|
||||
async fn create_enforcement(&self, rule: &Rule, event: &Event) -> Result<i64> {
|
||||
let enforcement_id = sqlx::query_scalar!(
|
||||
r#"
|
||||
INSERT INTO attune.enforcement
|
||||
(rule, rule_ref, trigger_ref, event, status, payload, condition, conditions)
|
||||
VALUES ($1, $2, $3, $4, 'created', $5, $6, $7)
|
||||
RETURNING id
|
||||
"#,
|
||||
Some(rule.id),
|
||||
&rule.r#ref,
|
||||
&rule.trigger_ref,
|
||||
Some(event.id),
|
||||
event.payload.clone().unwrap_or_default(),
|
||||
rule.condition,
|
||||
&rule.conditions
|
||||
)
|
||||
.fetch_one(&self.db)
|
||||
.await?;
|
||||
|
||||
// Publish EnforcementCreated message
|
||||
self.publish_enforcement_created(enforcement_id, rule, event).await?;
|
||||
|
||||
Ok(enforcement_id)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Condition Evaluation
|
||||
|
||||
Rule conditions are evaluated against event payloads:
|
||||
|
||||
### Condition Format
|
||||
|
||||
```json
|
||||
{
|
||||
"conditions": [
|
||||
{
|
||||
"field": "payload.branch",
|
||||
"operator": "equals",
|
||||
"value": "main"
|
||||
},
|
||||
{
|
||||
"field": "payload.author",
|
||||
"operator": "not_equals",
|
||||
"value": "bot"
|
||||
}
|
||||
],
|
||||
"condition": "all" // or "any"
|
||||
}
|
||||
```
|
||||
|
||||
### Supported Operators
|
||||
|
||||
- `equals`: Exact match
|
||||
- `not_equals`: Not equal
|
||||
- `contains`: String contains
|
||||
- `starts_with`: String prefix
|
||||
- `ends_with`: String suffix
|
||||
- `matches`: Regex match
|
||||
- `greater_than`: Numeric comparison
|
||||
- `less_than`: Numeric comparison
|
||||
- `in`: Value in array
|
||||
- `not_in`: Value not in array
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Sensor Failures
|
||||
|
||||
When a sensor fails:
|
||||
|
||||
1. Log error with context
|
||||
2. Increment failure count
|
||||
3. Restart sensor (if configured)
|
||||
4. Disable sensor after max retries
|
||||
5. Create notification
|
||||
|
||||
```rust
|
||||
async fn handle_sensor_error(&self, sensor_id: i64, error: Error) {
|
||||
error!("Sensor {} failed: {}", sensor_id, error);
|
||||
|
||||
let failure_count = self.increment_failure_count(sensor_id).await;
|
||||
|
||||
if failure_count >= self.config.max_restart_attempts {
|
||||
warn!("Sensor {} exceeded max restart attempts, disabling", sensor_id);
|
||||
self.disable_sensor(sensor_id).await;
|
||||
} else {
|
||||
info!("Restarting sensor {} (attempt {})", sensor_id, failure_count);
|
||||
self.restart_sensor(sensor_id).await;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Event Generation Failures
|
||||
|
||||
If event generation fails:
|
||||
|
||||
1. Log error
|
||||
2. Retry with backoff
|
||||
3. Create alert notification
|
||||
4. Continue sensor operation
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Metrics
|
||||
|
||||
- `sensors_active`: Number of active sensors
|
||||
- `sensors_failed`: Number of failed sensors
|
||||
- `events_generated_total`: Total events generated
|
||||
- `events_generated_rate`: Events per second
|
||||
- `enforcements_created_total`: Total enforcements created
|
||||
- `sensor_poll_duration`: Time to poll sensor
|
||||
- `event_generation_duration`: Time to generate event
|
||||
- `rule_matching_duration`: Time to match rules
|
||||
|
||||
### Health Checks
|
||||
|
||||
```rust
|
||||
pub struct HealthCheck {
|
||||
sensor_manager: Arc<SensorManager>,
|
||||
}
|
||||
|
||||
impl HealthCheck {
|
||||
pub async fn check(&self) -> HealthStatus {
|
||||
let active_sensors = self.sensor_manager.active_count().await;
|
||||
let failed_sensors = self.sensor_manager.failed_count().await;
|
||||
|
||||
if active_sensors == 0 {
|
||||
HealthStatus::Unhealthy("No active sensors".to_string())
|
||||
} else if failed_sensors > 10 {
|
||||
HealthStatus::Degraded(format!("{} sensors failed", failed_sensors))
|
||||
} else {
|
||||
HealthStatus::Healthy
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
### Unit Tests
|
||||
|
||||
```rust
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_event_generation() {
|
||||
let generator = EventGenerator::new(db_pool(), mq_client());
|
||||
|
||||
let event_id = generator.generate_event(
|
||||
&test_sensor(),
|
||||
&test_trigger(),
|
||||
json!({"test": "data"}),
|
||||
).await.unwrap();
|
||||
|
||||
assert!(event_id > 0);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_rule_matching() {
|
||||
let matcher = RuleMatcher::new(db_pool(), mq_client());
|
||||
|
||||
let event = test_event_with_payload(json!({
|
||||
"branch": "main",
|
||||
"author": "alice"
|
||||
}));
|
||||
|
||||
let enforcements = matcher.match_rules(&event).await.unwrap();
|
||||
assert_eq!(enforcements.len(), 1);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Integration Tests
|
||||
|
||||
```rust
|
||||
#[tokio::test]
|
||||
async fn test_sensor_to_enforcement_flow() {
|
||||
// 1. Create sensor
|
||||
let sensor = create_test_sensor().await;
|
||||
|
||||
// 2. Create trigger
|
||||
let trigger = create_test_trigger().await;
|
||||
|
||||
// 3. Create rule
|
||||
let rule = create_test_rule(trigger.id, action.id).await;
|
||||
|
||||
// 4. Start sensor
|
||||
sensor_manager.start_sensor(sensor).await;
|
||||
|
||||
// 5. Wait for event
|
||||
let event = wait_for_event(trigger.id).await;
|
||||
|
||||
// 6. Verify enforcement created
|
||||
let enforcement = wait_for_enforcement(rule.id).await;
|
||||
|
||||
assert_eq!(enforcement.event, Some(event.id));
|
||||
}
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Docker
|
||||
|
||||
```dockerfile
|
||||
FROM rust:1.75 as builder
|
||||
WORKDIR /app
|
||||
COPY . .
|
||||
RUN cargo build --release --bin attune-sensor
|
||||
|
||||
FROM debian:bookworm-slim
|
||||
RUN apt-get update && apt-get install -y \
|
||||
ca-certificates \
|
||||
python3 \
|
||||
nodejs \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
COPY --from=builder /app/target/release/attune-sensor /usr/local/bin/
|
||||
CMD ["attune-sensor"]
|
||||
```
|
||||
|
||||
### Kubernetes
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: attune-sensor
|
||||
spec:
|
||||
replicas: 2
|
||||
selector:
|
||||
matchLabels:
|
||||
app: attune-sensor
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: attune-sensor
|
||||
spec:
|
||||
containers:
|
||||
- name: sensor
|
||||
image: attune/sensor:latest
|
||||
env:
|
||||
- name: ATTUNE__DATABASE__URL
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: attune-db
|
||||
key: url
|
||||
- name: ATTUNE__MESSAGE_QUEUE__URL
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: attune-mq
|
||||
key: url
|
||||
resources:
|
||||
requests:
|
||||
memory: "256Mi"
|
||||
cpu: "250m"
|
||||
limits:
|
||||
memory: "512Mi"
|
||||
cpu: "500m"
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **Sensor Code Isolation**: Execute sensor code in sandboxed environments
|
||||
2. **Secret Management**: Use secrets for sensor authentication (API keys, tokens)
|
||||
3. **Rate Limiting**: Limit sensor poll frequency to prevent abuse
|
||||
4. **Input Validation**: Validate event payloads before storage
|
||||
5. **Access Control**: Restrict sensor management to authorized users
|
||||
6. **Audit Logging**: Log all sensor operations and events
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **Distributed Sensors**: Run sensors across multiple nodes
|
||||
2. **Sensor Clustering**: Group related sensors for coordination
|
||||
3. **Event Deduplication**: Prevent duplicate events
|
||||
4. **Event Filtering**: Pre-filter events before rule matching
|
||||
5. **Sensor Hot Reload**: Update sensor code without restart
|
||||
6. **Advanced Scheduling**: Complex polling schedules
|
||||
7. **Webhook Security**: HMAC validation, IP whitelisting
|
||||
8. **Sensor Metrics Dashboard**: Real-time sensor monitoring UI
|
||||
280
docs/architecture/trigger-sensor-architecture.md
Normal file
280
docs/architecture/trigger-sensor-architecture.md
Normal file
@@ -0,0 +1,280 @@
|
||||
# Trigger and Sensor Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
Attune uses a two-level architecture for event detection:
|
||||
- **Triggers** define event types (templates/schemas)
|
||||
- **Sensors** are configured instances that monitor for those events
|
||||
|
||||
This architecture was introduced in migration `20240103000002_restructure_timer_triggers.sql`.
|
||||
|
||||
---
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### Trigger (Event Type Definition)
|
||||
A **trigger** is a generic event type definition that specifies:
|
||||
- What parameters are needed to configure monitoring (`param_schema`)
|
||||
- What data will be in the event payload when it fires (`out_schema`)
|
||||
|
||||
**Example:** `core.intervaltimer` is a trigger type that defines how interval-based timers work.
|
||||
|
||||
### Sensor (Configured Instance)
|
||||
A **sensor** is a specific instance of a trigger with actual configuration values:
|
||||
- References a trigger type
|
||||
- Provides concrete configuration values (conforming to the trigger's `param_schema`)
|
||||
- Actually monitors and fires events
|
||||
|
||||
**Example:** `core.timer_10s_sensor` is a sensor instance configured to fire `core.intervaltimer` every 10 seconds.
|
||||
|
||||
### Rule (Event Handler)
|
||||
A **rule** connects a trigger type to an action:
|
||||
- References the trigger type (not the sensor instance)
|
||||
- Specifies which action to execute when the trigger fires
|
||||
- Can include parameter mappings from event payload to action parameters
|
||||
|
||||
---
|
||||
|
||||
## Architecture Flow
|
||||
|
||||
```
|
||||
Sensor Instance (with config)
|
||||
↓ monitors and detects
|
||||
Trigger Type (fires event)
|
||||
↓ evaluated by
|
||||
Rule (matches trigger type)
|
||||
↓ creates
|
||||
Enforcement (rule activation)
|
||||
↓ schedules
|
||||
Execution (action run)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Core Timer Triggers
|
||||
|
||||
The core pack provides three generic timer trigger types:
|
||||
|
||||
### 1. Interval Timer (`core.intervaltimer`)
|
||||
Fires at regular intervals.
|
||||
|
||||
**Param Schema:**
|
||||
```json
|
||||
{
|
||||
"unit": "seconds|minutes|hours",
|
||||
"interval": <integer>
|
||||
}
|
||||
```
|
||||
|
||||
**Example Sensor Config:**
|
||||
```json
|
||||
{
|
||||
"unit": "seconds",
|
||||
"interval": 10
|
||||
}
|
||||
```
|
||||
|
||||
**Event Payload:**
|
||||
```json
|
||||
{
|
||||
"type": "interval",
|
||||
"interval_seconds": 10,
|
||||
"fired_at": "2026-01-17T15:30:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Cron Timer (`core.crontimer`)
|
||||
Fires based on cron schedule expressions.
|
||||
|
||||
**Param Schema:**
|
||||
```json
|
||||
{
|
||||
"expression": "<cron expression>"
|
||||
}
|
||||
```
|
||||
|
||||
**Example Sensor Config:**
|
||||
```json
|
||||
{
|
||||
"expression": "0 0 * * * *"
|
||||
}
|
||||
```
|
||||
|
||||
**Event Payload:**
|
||||
```json
|
||||
{
|
||||
"type": "cron",
|
||||
"fired_at": "2026-01-17T15:00:00Z",
|
||||
"scheduled_at": "2026-01-17T15:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Datetime Timer (`core.datetimetimer`)
|
||||
Fires once at a specific date and time.
|
||||
|
||||
**Param Schema:**
|
||||
```json
|
||||
{
|
||||
"fire_at": "<ISO 8601 timestamp>"
|
||||
}
|
||||
```
|
||||
|
||||
**Example Sensor Config:**
|
||||
```json
|
||||
{
|
||||
"fire_at": "2026-12-31T23:59:59Z"
|
||||
}
|
||||
```
|
||||
|
||||
**Event Payload:**
|
||||
```json
|
||||
{
|
||||
"type": "one_shot",
|
||||
"fire_at": "2026-12-31T23:59:59Z",
|
||||
"fired_at": "2026-12-31T23:59:59Z"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Creating a Complete Example
|
||||
|
||||
### Step 1: Trigger Type Already Exists
|
||||
The `core.intervaltimer` trigger type is created by the seed script.
|
||||
|
||||
### Step 2: Create a Sensor Instance
|
||||
```sql
|
||||
INSERT INTO attune.sensor (
|
||||
ref, pack, pack_ref, label, description,
|
||||
entrypoint, runtime, runtime_ref,
|
||||
trigger, trigger_ref, enabled, config
|
||||
)
|
||||
VALUES (
|
||||
'mypack.every_30s_sensor',
|
||||
<pack_id>,
|
||||
'mypack',
|
||||
'30 Second Timer',
|
||||
'Fires every 30 seconds',
|
||||
'builtin:interval_timer',
|
||||
<sensor_runtime_id>,
|
||||
'core.sensor.builtin',
|
||||
<intervaltimer_trigger_id>,
|
||||
'core.intervaltimer',
|
||||
true,
|
||||
'{"unit": "seconds", "interval": 30}'::jsonb
|
||||
);
|
||||
```
|
||||
|
||||
### Step 3: Create a Rule
|
||||
```sql
|
||||
INSERT INTO attune.rule (
|
||||
ref, pack, pack_ref, label, description,
|
||||
action, action_ref,
|
||||
trigger, trigger_ref,
|
||||
conditions, action_params, enabled
|
||||
)
|
||||
VALUES (
|
||||
'mypack.my_rule',
|
||||
<pack_id>,
|
||||
'mypack',
|
||||
'My Rule',
|
||||
'Does something every 30 seconds',
|
||||
<action_id>,
|
||||
'mypack.my_action',
|
||||
<intervaltimer_trigger_id>, -- References the trigger type, not the sensor
|
||||
'core.intervaltimer',
|
||||
'{}'::jsonb,
|
||||
'{"message": "Timer fired!"}'::jsonb,
|
||||
true
|
||||
);
|
||||
```
|
||||
|
||||
**Important:** The rule references the trigger type (`core.intervaltimer`), not the specific sensor instance. Any sensor that fires `core.intervaltimer` events will match this rule.
|
||||
|
||||
---
|
||||
|
||||
## Why This Architecture?
|
||||
|
||||
### Advantages
|
||||
1. **Reusability:** One trigger type, many sensor instances with different configs
|
||||
2. **Flexibility:** Multiple sensors can fire the same trigger type
|
||||
3. **Separation of Concerns:**
|
||||
- Triggers define what events look like
|
||||
- Sensors handle how to detect them
|
||||
- Rules define what to do when they occur
|
||||
4. **Consistency:** All events of a type have the same payload schema
|
||||
|
||||
### Example Use Cases
|
||||
- **Multiple timers:** Create multiple sensor instances with different intervals, all using `core.intervaltimer`
|
||||
- **Webhook triggers:** One webhook trigger type, multiple sensor instances for different endpoints
|
||||
- **File watchers:** One file change trigger type, multiple sensors watching different directories
|
||||
|
||||
---
|
||||
|
||||
## Migration from Old Architecture
|
||||
|
||||
The old architecture had specific triggers like `core.timer_10s`, `core.timer_1m`, etc. These were removed in migration `20240103000002` and replaced with:
|
||||
- Generic trigger types: `core.intervaltimer`, `core.crontimer`, `core.datetimetimer`
|
||||
- Sensor instances: `core.timer_10s_sensor`, etc., configured to use the generic types
|
||||
|
||||
If you have old rules referencing specific timer triggers, you'll need to:
|
||||
1. Update the rule to reference the appropriate generic trigger type
|
||||
2. Ensure a sensor instance exists with the desired configuration
|
||||
|
||||
---
|
||||
|
||||
## Database Schema
|
||||
|
||||
### Trigger Table
|
||||
```sql
|
||||
CREATE TABLE attune.trigger (
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
ref TEXT NOT NULL UNIQUE,
|
||||
pack BIGINT REFERENCES attune.pack(id),
|
||||
pack_ref TEXT NOT NULL,
|
||||
label TEXT NOT NULL,
|
||||
description TEXT,
|
||||
enabled BOOLEAN DEFAULT true,
|
||||
param_schema JSONB NOT NULL, -- Schema for sensor config
|
||||
out_schema JSONB NOT NULL -- Schema for event payloads
|
||||
);
|
||||
```
|
||||
|
||||
### Sensor Table
|
||||
```sql
|
||||
CREATE TABLE attune.sensor (
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
ref TEXT NOT NULL UNIQUE,
|
||||
pack BIGINT REFERENCES attune.pack(id),
|
||||
pack_ref TEXT NOT NULL,
|
||||
trigger BIGINT REFERENCES attune.trigger(id),
|
||||
trigger_ref TEXT NOT NULL,
|
||||
runtime BIGINT REFERENCES attune.runtime(id),
|
||||
runtime_ref TEXT NOT NULL,
|
||||
config JSONB NOT NULL, -- Actual config values
|
||||
enabled BOOLEAN DEFAULT true
|
||||
);
|
||||
```
|
||||
|
||||
### Rule Table
|
||||
```sql
|
||||
CREATE TABLE attune.rule (
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
ref TEXT NOT NULL UNIQUE,
|
||||
pack BIGINT REFERENCES attune.pack(id),
|
||||
pack_ref TEXT NOT NULL,
|
||||
trigger BIGINT REFERENCES attune.trigger(id),
|
||||
trigger_ref TEXT NOT NULL, -- References trigger type
|
||||
action BIGINT REFERENCES attune.action(id),
|
||||
action_ref TEXT NOT NULL,
|
||||
action_params JSONB,
|
||||
enabled BOOLEAN DEFAULT true
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
- `migrations/20240103000002_restructure_timer_triggers.sql` - Migration that introduced this architecture
|
||||
- `scripts/seed_core_pack.sql` - Seeds the core trigger types and example sensors
|
||||
- `docs/examples/rule-parameter-examples.md` - Examples of rules using triggers
|
||||
1249
docs/architecture/web-ui-architecture.md
Normal file
1249
docs/architecture/web-ui-architecture.md
Normal file
File diff suppressed because it is too large
Load Diff
789
docs/architecture/webhook-system-architecture.md
Normal file
789
docs/architecture/webhook-system-architecture.md
Normal file
@@ -0,0 +1,789 @@
|
||||
# Webhook System Architecture
|
||||
|
||||
**Last Updated**: 2026-01-20
|
||||
**Status**: Phase 3 Complete - Advanced Security Features Implemented
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Attune provides built-in webhook support as a first-class feature of the trigger system. Any trigger can be webhook-enabled, allowing external systems to fire events by posting to a unique webhook URL. This eliminates the need for generic webhook triggers and provides better security and traceability.
|
||||
|
||||
---
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### Webhook-Enabled Triggers
|
||||
|
||||
Any trigger in Attune can have webhooks enabled:
|
||||
|
||||
1. **Pack declares trigger** (e.g., `github.push`, `stripe.payment_succeeded`)
|
||||
2. **User enables webhooks** via toggle in UI or API
|
||||
3. **System generates unique webhook key** (secure random token)
|
||||
4. **Webhook URL is provided** for external system configuration
|
||||
5. **External system POSTs to webhook URL** with payload
|
||||
6. **Attune creates event** from webhook payload
|
||||
7. **Rules evaluate normally** against the event
|
||||
|
||||
### Key Benefits
|
||||
|
||||
- **Per-Trigger Security**: Each trigger has its own unique webhook key
|
||||
- **No Generic Triggers**: Webhooks are a feature, not a trigger type
|
||||
- **Better Traceability**: Clear association between webhook and trigger
|
||||
- **Flexible Payloads**: Each trigger defines its own payload schema
|
||||
- **Multi-Tenancy Ready**: Webhook keys can be scoped to identities/organizations
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ External Systems │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ GitHub │ Stripe │ Slack │ Custom Apps │ etc. │
|
||||
└────┬──────────┬──────────┬──────────┬───────────────────────┘
|
||||
│ │ │ │
|
||||
│ POST │ POST │ POST │ POST
|
||||
│ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Attune API - Webhook Receiver Endpoint │
|
||||
│ POST /api/v1/webhooks/:webhook_key │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ 1. Validate webhook key │
|
||||
│ 2. Look up associated trigger │
|
||||
│ 3. Parse and validate payload │
|
||||
│ 4. Create event in database │
|
||||
│ 5. Return 200 OK │
|
||||
└────────────────────────┬────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ PostgreSQL Database │
|
||||
│ ┌────────┐ ┌───────┐ ┌───────┐ │
|
||||
│ │Trigger │───▶│ Event │───▶│ Rule │ │
|
||||
│ │webhook │ │ │ │ │ │
|
||||
│ │enabled │ │ │ │ │ │
|
||||
│ │webhook │ │ │ │ │ │
|
||||
│ │ key │ │ │ │ │ │
|
||||
│ └────────┘ └───────┘ └───────┘ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Rule Evaluation │
|
||||
│ Execution Scheduling │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Database Schema
|
||||
|
||||
### Existing `attune.trigger` Table Extensions
|
||||
|
||||
Add webhook-related columns to the trigger table:
|
||||
|
||||
```sql
|
||||
ALTER TABLE attune.trigger ADD COLUMN IF NOT EXISTS
|
||||
webhook_enabled BOOLEAN NOT NULL DEFAULT FALSE;
|
||||
|
||||
ALTER TABLE attune.trigger ADD COLUMN IF NOT EXISTS
|
||||
webhook_key VARCHAR(64) UNIQUE;
|
||||
|
||||
ALTER TABLE attune.trigger ADD COLUMN IF NOT EXISTS
|
||||
webhook_secret VARCHAR(128); -- For HMAC signature verification (optional)
|
||||
|
||||
-- Index for fast webhook key lookup
|
||||
CREATE INDEX IF NOT EXISTS idx_trigger_webhook_key
|
||||
ON attune.trigger(webhook_key)
|
||||
WHERE webhook_key IS NOT NULL;
|
||||
```
|
||||
|
||||
### Webhook Event Metadata
|
||||
|
||||
Events created from webhooks include additional metadata:
|
||||
|
||||
```json
|
||||
{
|
||||
"source": "webhook",
|
||||
"webhook_key": "wh_abc123...",
|
||||
"webhook_metadata": {
|
||||
"received_at": "2024-01-20T12:00:00Z",
|
||||
"source_ip": "192.168.1.100",
|
||||
"user_agent": "GitHub-Hookshot/abc123",
|
||||
"headers": {
|
||||
"X-GitHub-Event": "push",
|
||||
"X-GitHub-Delivery": "12345-67890"
|
||||
}
|
||||
},
|
||||
"payload": {
|
||||
// Original webhook payload from external system
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Webhook Receiver
|
||||
|
||||
**Receive Webhook Event**
|
||||
|
||||
```http
|
||||
POST /api/v1/webhooks/:webhook_key
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"ref": "refs/heads/main",
|
||||
"commits": [...],
|
||||
"repository": {...}
|
||||
}
|
||||
```
|
||||
|
||||
**Response (Success)**
|
||||
```http
|
||||
HTTP/1.1 200 OK
|
||||
Content-Type: application/json
|
||||
|
||||
```
|
||||
|
||||
**Response (Invalid Key)**
|
||||
```http
|
||||
HTTP/1.1 404 Not Found
|
||||
Content-Type: application/json
|
||||
|
||||
```
|
||||
|
||||
**Response (Disabled)**
|
||||
```http
|
||||
HTTP/1.1 403 Forbidden
|
||||
Content-Type: application/json
|
||||
|
||||
```
|
||||
|
||||
### Webhook Management
|
||||
|
||||
**Enable Webhooks for Trigger**
|
||||
|
||||
```http
|
||||
POST /api/v1/triggers/:id/webhook/enable
|
||||
Authorization: Bearer <token>
|
||||
```
|
||||
|
||||
**Response**
|
||||
```json
|
||||
{
|
||||
"data": {
|
||||
"id": 123,
|
||||
"ref": "github.push",
|
||||
"webhook_enabled": true,
|
||||
"webhook_key": "wh_abc123xyz789...",
|
||||
"webhook_url": "https://attune.example.com/api/v1/webhooks/wh_abc123xyz789..."
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Disable Webhooks for Trigger**
|
||||
|
||||
```http
|
||||
POST /api/v1/triggers/:id/webhook/disable
|
||||
Authorization: Bearer <token>
|
||||
```
|
||||
|
||||
**Regenerate Webhook Key**
|
||||
|
||||
```http
|
||||
POST /api/v1/triggers/:id/webhook/regenerate
|
||||
Authorization: Bearer <token>
|
||||
```
|
||||
|
||||
**Response**
|
||||
```json
|
||||
{
|
||||
"data": {
|
||||
"webhook_key": "wh_new_key_here...",
|
||||
"webhook_url": "https://attune.example.com/api/v1/webhooks/wh_new_key_here...",
|
||||
"previous_key_revoked": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Get Webhook Info**
|
||||
|
||||
```http
|
||||
GET /api/v1/triggers/:id/webhook
|
||||
Authorization: Bearer <token>
|
||||
```
|
||||
|
||||
**Response**
|
||||
```json
|
||||
{
|
||||
"data": {
|
||||
"enabled": true,
|
||||
"webhook_key": "wh_abc123xyz789...",
|
||||
"webhook_url": "https://attune.example.com/api/v1/webhooks/wh_abc123xyz789...",
|
||||
"created_at": "2024-01-20T10:00:00Z",
|
||||
"last_used_at": "2024-01-20T12:30:00Z",
|
||||
"total_events": 145
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Webhook Key Format
|
||||
|
||||
Webhook keys use a recognizable prefix and secure random suffix:
|
||||
|
||||
```
|
||||
wh_[32 random alphanumeric characters]
|
||||
```
|
||||
|
||||
**Example**: `wh_k7j2n9p4m8q1r5w3x6z0a2b5c8d1e4f7`
|
||||
|
||||
**Generation (Rust)**:
|
||||
```rust
|
||||
use rand::Rng;
|
||||
use rand::distributions::Alphanumeric;
|
||||
|
||||
fn generate_webhook_key() -> String {
|
||||
let random_part: String = rand::thread_rng()
|
||||
.sample_iter(&Alphanumeric)
|
||||
.take(32)
|
||||
.map(char::from)
|
||||
.collect();
|
||||
|
||||
format!("wh_{}", random_part)
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### 1. Webhook Key as Bearer Token
|
||||
|
||||
The webhook key acts as a bearer token - anyone with the key can post events. Therefore:
|
||||
|
||||
- Keys must be long and random (32+ characters)
|
||||
- Keys must be stored securely
|
||||
- Keys should be transmitted over HTTPS only
|
||||
- Keys can be regenerated if compromised
|
||||
|
||||
### 2. Optional Signature Verification
|
||||
|
||||
For enhanced security, triggers can require HMAC signature verification:
|
||||
|
||||
```http
|
||||
POST /api/v1/webhooks/:webhook_key
|
||||
X-Webhook-Signature: sha256=abc123...
|
||||
Content-Type: application/json
|
||||
|
||||
{...}
|
||||
```
|
||||
|
||||
The signature is computed as:
|
||||
```
|
||||
HMAC-SHA256(webhook_secret, request_body)
|
||||
```
|
||||
|
||||
This prevents replay attacks and ensures payload integrity.
|
||||
|
||||
### 3. IP Whitelisting (Future)
|
||||
|
||||
Triggers can optionally restrict webhooks to specific IP ranges:
|
||||
|
||||
```json
|
||||
{
|
||||
"webhook_enabled": true,
|
||||
"webhook_ip_whitelist": [
|
||||
"192.30.252.0/22", // GitHub
|
||||
"185.199.108.0/22" // GitHub
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Rate Limiting
|
||||
|
||||
Apply rate limits to prevent abuse:
|
||||
|
||||
- Per webhook key: 100 requests per minute
|
||||
- Per IP address: 1000 requests per minute
|
||||
- Global: 10,000 requests per minute
|
||||
|
||||
### 5. Payload Size Limits
|
||||
|
||||
Limit webhook payload sizes:
|
||||
|
||||
- Maximum payload size: 1 MB
|
||||
- Reject larger payloads with 413 Payload Too Large
|
||||
|
||||
---
|
||||
|
||||
## Event Creation from Webhooks
|
||||
|
||||
### Event Structure
|
||||
|
||||
```sql
|
||||
INSERT INTO attune.event (
|
||||
trigger,
|
||||
trigger_ref,
|
||||
payload,
|
||||
metadata,
|
||||
source
|
||||
) VALUES (
|
||||
<trigger_id>,
|
||||
<trigger_ref>,
|
||||
<webhook_payload>,
|
||||
jsonb_build_object(
|
||||
'source', 'webhook',
|
||||
'webhook_key', <webhook_key>,
|
||||
'received_at', NOW(),
|
||||
'source_ip', <client_ip>,
|
||||
'user_agent', <user_agent>,
|
||||
'headers', <selected_headers>
|
||||
),
|
||||
'webhook'
|
||||
);
|
||||
```
|
||||
|
||||
### Payload Transformation
|
||||
|
||||
Webhooks can optionally transform payloads before creating events:
|
||||
|
||||
1. **Direct Pass-Through** (default): Entire webhook body becomes event payload
|
||||
2. **JSONPath Extraction**: Extract specific fields from webhook payload
|
||||
3. **Template Transformation**: Use templates to reshape payload
|
||||
|
||||
**Example (JSONPath)**:
|
||||
```json
|
||||
{
|
||||
"webhook_payload_mapping": {
|
||||
"commit_sha": "$.head_commit.id",
|
||||
"branch": "$.ref",
|
||||
"author": "$.head_commit.author.name"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Web UI Integration
|
||||
|
||||
### Trigger Detail Page
|
||||
|
||||
Display webhook status for each trigger:
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────┐
|
||||
│ Trigger: github.push │
|
||||
├──────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Webhooks [Toggle: ● ON ] │
|
||||
│ │
|
||||
│ Webhook URL: │
|
||||
│ ┌──────────────────────────────────────────────────────┐ │
|
||||
│ │ https://attune.example.com/api/v1/webhooks/wh_k7j... │ │
|
||||
│ └──────────────────────────────────────────────────────┘ │
|
||||
│ [Copy URL] [Show Key] [Regenerate] │
|
||||
│ │
|
||||
│ Stats: │
|
||||
│ • Events received: 145 │
|
||||
│ • Last event: 2 minutes ago │
|
||||
│ • Created: 2024-01-15 10:30:00 │
|
||||
│ │
|
||||
│ Configuration: │
|
||||
│ □ Require signature verification │
|
||||
│ □ Enable IP whitelisting │
|
||||
│ │
|
||||
└──────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Webhook Key Display
|
||||
|
||||
Show webhook key with copy button and security warning:
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────┐
|
||||
│ Webhook Key │
|
||||
├──────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ wh_k7j2n9p4m8q1r5w3x6z0a2b5c8d1e4f7g9h2 │
|
||||
│ │
|
||||
│ [Copy Key] [Hide] │
|
||||
│ │
|
||||
│ ⚠️ Keep this key secret. Anyone with this key can │
|
||||
│ trigger events. If compromised, regenerate │
|
||||
│ immediately. │
|
||||
│ │
|
||||
└──────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Status
|
||||
|
||||
### ✅ Phase 1: Database & Core (Complete)
|
||||
|
||||
1. ✅ Add webhook columns to `attune.trigger` table
|
||||
2. ✅ Create migration with indexes
|
||||
3. ✅ Add webhook key generation function
|
||||
4. ✅ Update trigger repository with webhook methods
|
||||
5. ✅ All integration tests passing (6/6)
|
||||
|
||||
### ✅ Phase 2: API Endpoints (Complete)
|
||||
|
||||
1. ✅ Webhook receiver endpoint: `POST /api/v1/webhooks/:webhook_key`
|
||||
2. ✅ Webhook management endpoints:
|
||||
- `POST /api/v1/triggers/:ref/webhooks/enable`
|
||||
- `POST /api/v1/triggers/:ref/webhooks/disable`
|
||||
- `POST /api/v1/triggers/:ref/webhooks/regenerate`
|
||||
3. ✅ Event creation logic with webhook metadata
|
||||
4. ✅ Error handling and validation
|
||||
5. ✅ OpenAPI documentation
|
||||
6. ✅ Integration tests created
|
||||
|
||||
**Files Added/Modified:**
|
||||
- `crates/api/src/routes/webhooks.rs` - Webhook routes implementation
|
||||
- `crates/api/src/dto/webhook.rs` - Webhook DTOs
|
||||
- `crates/api/src/dto/trigger.rs` - Added webhook fields to TriggerResponse
|
||||
- `crates/api/src/openapi.rs` - Added webhook endpoints to OpenAPI spec
|
||||
- `crates/api/tests/webhook_api_tests.rs` - Comprehensive integration tests
|
||||
|
||||
### ✅ Phase 3: Advanced Security Features (Complete)
|
||||
|
||||
1. ✅ HMAC signature verification (SHA256, SHA512, SHA1)
|
||||
2. ✅ Rate limiting per webhook key with configurable windows
|
||||
3. ✅ IP whitelist support with CIDR notation
|
||||
4. ✅ Payload size limits (configurable per trigger)
|
||||
5. ✅ Webhook event logging for audit and analytics
|
||||
6. ✅ Database functions for security configuration
|
||||
7. ✅ Repository methods for all Phase 3 features
|
||||
8. ✅ Enhanced webhook receiver with security checks
|
||||
9. ✅ Comprehensive error handling and logging
|
||||
|
||||
**Database Schema Extensions:**
|
||||
- `webhook_hmac_enabled`, `webhook_hmac_secret`, `webhook_hmac_algorithm` columns
|
||||
- `webhook_rate_limit_enabled`, `webhook_rate_limit_requests`, `webhook_rate_limit_window_seconds` columns
|
||||
- `webhook_ip_whitelist_enabled`, `webhook_ip_whitelist` columns
|
||||
- `webhook_payload_size_limit_kb` column
|
||||
- `webhook_event_log` table for audit trail
|
||||
- `webhook_rate_limit` table for rate limit tracking
|
||||
- `webhook_stats_detailed` view for analytics
|
||||
|
||||
**Repository Methods Added:**
|
||||
- `enable_webhook_hmac()` - Enable HMAC with secret generation
|
||||
- `disable_webhook_hmac()` - Disable HMAC verification
|
||||
- `configure_webhook_rate_limit()` - Configure rate limiting
|
||||
- `configure_webhook_ip_whitelist()` - Configure IP whitelist
|
||||
- `check_webhook_rate_limit()` - Check if request within limit
|
||||
- `check_webhook_ip_whitelist()` - Verify IP against whitelist
|
||||
- `log_webhook_event()` - Log webhook requests for analytics
|
||||
|
||||
**Security Module:**
|
||||
- HMAC signature verification for SHA256, SHA512, SHA1
|
||||
- Constant-time comparison for signatures
|
||||
- CIDR notation support for IP whitelists (IPv4 and IPv6)
|
||||
- Signature format: `sha256=<hex>` or just `<hex>`
|
||||
- Headers: `X-Webhook-Signature` or `X-Hub-Signature-256`
|
||||
|
||||
**Webhook Receiver Enhancements:**
|
||||
- Payload size limit enforcement (returns 413 if exceeded)
|
||||
- IP whitelist validation (returns 403 if not allowed)
|
||||
- Rate limit enforcement (returns 429 if exceeded)
|
||||
- HMAC signature verification (returns 401 if invalid)
|
||||
- Comprehensive event logging for all requests (success and failure)
|
||||
- Processing time tracking
|
||||
- Detailed error messages with proper HTTP status codes
|
||||
|
||||
**Files Added/Modified:**
|
||||
- `attune/migrations/20260120000002_webhook_advanced_features.sql` (362 lines)
|
||||
- `crates/common/src/models.rs` - Added Phase 3 fields and WebhookEventLog model
|
||||
- `crates/common/src/repositories/trigger.rs` - Added Phase 3 methods (215 lines)
|
||||
- `crates/api/src/webhook_security.rs` - HMAC and IP validation (274 lines)
|
||||
- `crates/api/src/routes/webhooks.rs` - Enhanced receiver with security (350+ lines)
|
||||
- `crates/api/src/middleware/error.rs` - Added TooManyRequests error type
|
||||
- `crates/api/Cargo.toml` - Added hmac, sha1, sha2, hex dependencies
|
||||
|
||||
### 📋 Phase 4: Web UI Integration (In Progress)
|
||||
|
||||
1. ✅ Add webhook toggle to trigger detail page
|
||||
2. ✅ Display webhook URL and key
|
||||
3. ✅ Add copy-to-clipboard functionality
|
||||
4. Show webhook statistics from `webhook_stats_detailed` view
|
||||
5. Add regenerate key button with confirmation
|
||||
6. HMAC configuration UI (enable/disable, view secret)
|
||||
7. Rate limit configuration UI
|
||||
8. IP whitelist management UI
|
||||
9. Webhook event log viewer
|
||||
10. Real-time webhook testing tool
|
||||
|
||||
### 📋 Phase 5: Additional Features (TODO)
|
||||
|
||||
1. Webhook retry on failure with exponential backoff
|
||||
2. Payload transformation/mapping with JSONPath
|
||||
3. Multiple webhook keys per trigger
|
||||
4. Webhook health monitoring and alerts
|
||||
5. Batch webhook processing
|
||||
6. Webhook response validation
|
||||
7. Custom header injection
|
||||
8. Webhook forwarding/proxying
|
||||
|
||||
---
|
||||
|
||||
## Example Use Cases
|
||||
|
||||
### 1. GitHub Push Events
|
||||
|
||||
**Pack Definition:**
|
||||
```yaml
|
||||
# packs/github/triggers/push.yaml
|
||||
name: push
|
||||
ref: github.push
|
||||
description: "Triggered when code is pushed to a repository"
|
||||
type: webhook
|
||||
|
||||
payload_schema:
|
||||
type: object
|
||||
properties:
|
||||
ref:
|
||||
type: string
|
||||
description: "Git reference (branch/tag)"
|
||||
commits:
|
||||
type: array
|
||||
description: "Array of commits"
|
||||
repository:
|
||||
type: object
|
||||
description: "Repository information"
|
||||
```
|
||||
|
||||
**User Workflow:**
|
||||
1. Navigate to trigger `github.push` in UI
|
||||
2. Enable webhooks (toggle ON)
|
||||
3. Copy webhook URL
|
||||
4. Configure in GitHub repository settings:
|
||||
- Payload URL: `https://attune.example.com/api/v1/webhooks/wh_abc123...`
|
||||
- Content type: `application/json`
|
||||
- Events: Just the push event
|
||||
5. GitHub sends webhook on push
|
||||
6. Attune creates event
|
||||
7. Rules evaluate and trigger actions
|
||||
|
||||
### 2. Stripe Payment Events
|
||||
|
||||
**Pack Definition:**
|
||||
```yaml
|
||||
# packs/stripe/triggers/payment_succeeded.yaml
|
||||
name: payment_succeeded
|
||||
ref: stripe.payment_succeeded
|
||||
description: "Triggered when a payment succeeds"
|
||||
type: webhook
|
||||
|
||||
payload_schema:
|
||||
type: object
|
||||
properties:
|
||||
id:
|
||||
type: string
|
||||
amount:
|
||||
type: integer
|
||||
currency:
|
||||
type: string
|
||||
customer:
|
||||
type: string
|
||||
```
|
||||
|
||||
**User Workflow:**
|
||||
1. Enable webhooks for `stripe.payment_succeeded` trigger
|
||||
2. Copy webhook URL
|
||||
3. Configure in Stripe dashboard
|
||||
4. Enable signature verification (recommended for Stripe)
|
||||
5. Set webhook secret in Attune
|
||||
6. Stripe sends webhook on successful payment
|
||||
7. Attune verifies signature and creates event
|
||||
8. Rules trigger actions (send receipt, update CRM, etc.)
|
||||
|
||||
### 3. Custom Application Events
|
||||
|
||||
**Pack Definition:**
|
||||
```yaml
|
||||
# packs/myapp/triggers/deployment_complete.yaml
|
||||
name: deployment_complete
|
||||
ref: myapp.deployment_complete
|
||||
description: "Triggered when application deployment completes"
|
||||
type: webhook
|
||||
|
||||
payload_schema:
|
||||
type: object
|
||||
properties:
|
||||
environment:
|
||||
type: string
|
||||
enum: [dev, staging, production]
|
||||
version:
|
||||
type: string
|
||||
deployed_by:
|
||||
type: string
|
||||
status:
|
||||
type: string
|
||||
enum: [success, failure]
|
||||
```
|
||||
|
||||
**User Workflow:**
|
||||
1. Enable webhooks for `myapp.deployment_complete` trigger
|
||||
2. Get webhook URL
|
||||
3. Add to CI/CD pipeline:
|
||||
```bash
|
||||
curl -X POST https://attune.example.com/api/v1/webhooks/wh_xyz789... \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"environment": "production",
|
||||
"version": "v2.1.0",
|
||||
"deployed_by": "jenkins",
|
||||
"status": "success"
|
||||
}'
|
||||
```
|
||||
4. Attune receives webhook and creates event
|
||||
5. Rules trigger notifications, health checks, etc.
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
|
||||
### Manual Testing
|
||||
|
||||
```bash
|
||||
# Enable webhooks for a trigger
|
||||
curl -X POST http://localhost:8080/api/v1/triggers/123/webhook/enable \
|
||||
-H "Authorization: Bearer $TOKEN"
|
||||
|
||||
# Get webhook info
|
||||
curl http://localhost:8080/api/v1/triggers/123/webhook \
|
||||
-H "Authorization: Bearer $TOKEN"
|
||||
|
||||
# Send test webhook
|
||||
WEBHOOK_KEY="wh_k7j2n9p4m8q1r5w3x6z0a2b5c8d1e4f7"
|
||||
curl -X POST http://localhost:8080/api/v1/webhooks/$WEBHOOK_KEY \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"test": "payload", "value": 123}'
|
||||
|
||||
# Verify event was created
|
||||
curl http://localhost:8080/api/v1/events?limit=1 \
|
||||
-H "Authorization: Bearer $TOKEN"
|
||||
```
|
||||
|
||||
### Integration Tests
|
||||
|
||||
```rust
|
||||
#[tokio::test]
|
||||
async fn test_webhook_enable_disable() {
|
||||
// Create trigger
|
||||
// Enable webhooks
|
||||
// Verify webhook key generated
|
||||
// Disable webhooks
|
||||
// Verify key removed
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_webhook_event_creation() {
|
||||
// Enable webhooks for trigger
|
||||
// POST to webhook endpoint
|
||||
// Verify event created in database
|
||||
// Verify event has correct payload and metadata
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_webhook_key_regeneration() {
|
||||
// Enable webhooks
|
||||
// Save original key
|
||||
// Regenerate key
|
||||
// Verify new key is different
|
||||
// Verify old key no longer works
|
||||
// Verify new key works
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_webhook_invalid_key() {
|
||||
// POST to webhook endpoint with invalid key
|
||||
// Verify 404 response
|
||||
// Verify no event created
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_webhook_rate_limiting() {
|
||||
// Send 101 requests in 1 minute
|
||||
// Verify rate limit exceeded error
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Migration from Generic Webhooks
|
||||
|
||||
If Attune previously had generic webhook triggers, migration steps:
|
||||
|
||||
1. Create new webhook-enabled triggers for each webhook use case
|
||||
2. Enable webhooks for new triggers
|
||||
3. Provide mapping tool in UI to migrate old webhook URLs
|
||||
4. Run migration script to update external systems
|
||||
5. Deprecate generic webhook triggers
|
||||
|
||||
---
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Webhook Endpoint Optimization
|
||||
|
||||
- Async processing: Return 200 OK immediately, process event async
|
||||
- Connection pooling: Reuse database connections
|
||||
- Caching: Cache webhook key lookups (with TTL)
|
||||
- Bulk event creation: Batch multiple webhook events
|
||||
|
||||
### Database Indexes
|
||||
|
||||
```sql
|
||||
-- Fast webhook key lookup
|
||||
CREATE INDEX idx_trigger_webhook_key ON attune.trigger(webhook_key);
|
||||
|
||||
-- Webhook event queries
|
||||
CREATE INDEX idx_event_source ON attune.event(source) WHERE source = 'webhook';
|
||||
CREATE INDEX idx_event_webhook_key ON attune.event((metadata->>'webhook_key'));
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Trigger and Sensor Architecture](./trigger-sensor-architecture.md)
|
||||
- [Event System](./api-events-enforcements.md)
|
||||
- [Pack Structure](./pack-structure.md)
|
||||
- [Security Review](./security-review-2024-01-02.md)
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
Built-in webhook support as a trigger feature provides:
|
||||
|
||||
- ✅ Better security with per-trigger webhook keys
|
||||
- ✅ Clear association between webhooks and triggers
|
||||
- ✅ Flexible payload handling per trigger type
|
||||
- ✅ Easy external system integration
|
||||
- ✅ Full audit trail and traceability
|
||||
|
||||
This design eliminates the need for generic webhook triggers while providing a more robust and maintainable webhook system.
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
Built-in webhook support as a trigger feature provides:
|
||||
|
||||
- ✅ Better security with per-trigger webhook keys
|
||||
- ✅ Clear association between webhooks and triggers
|
||||
- ✅ Flexible payload handling per trigger type
|
||||
- ✅ Easy external system integration
|
||||
- ✅ Full audit trail and traceability
|
||||
|
||||
This design eliminates the need for generic webhook triggers while providing a more robust and maintainable webhook system.
|
||||
548
docs/architecture/worker-service.md
Normal file
548
docs/architecture/worker-service.md
Normal file
@@ -0,0 +1,548 @@
|
||||
# Worker Service Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
The **Worker Service** is responsible for executing automation actions in the Attune platform. It receives execution requests from the Executor service, runs actions in appropriate runtime environments (Python, Shell, Node.js, containers), and reports results back.
|
||||
|
||||
## Service Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Worker Service │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────────────────┐ ┌──────────────────────┐ │
|
||||
│ │ Worker │ │ Heartbeat │ │
|
||||
│ │ Registration │ │ Manager │ │
|
||||
│ └─────────────────────┘ └──────────────────────┘ │
|
||||
│ │ │ │
|
||||
│ v v │
|
||||
│ ┌─────────────────────────────────────────────┐ │
|
||||
│ │ Action Executor │ │
|
||||
│ │ ┌─────────────────────────────────────┐ │ │
|
||||
│ │ │ Runtime Registry │ │ │
|
||||
│ │ │ - Python Runtime │ │ │
|
||||
│ │ │ - Shell Runtime │ │ │
|
||||
│ │ │ - Local Runtime (Facade) │ │ │
|
||||
│ │ │ - Container Runtime (Future) │ │ │
|
||||
│ │ └─────────────────────────────────────┘ │ │
|
||||
│ └─────────────────────────────────────────────┘ │
|
||||
│ │ │ │
|
||||
│ v v │
|
||||
│ ┌─────────────────────┐ ┌──────────────────────┐ │
|
||||
│ │ Artifact │ │ Message Queue │ │
|
||||
│ │ Manager │ │ Consumer/Publisher │ │
|
||||
│ └─────────────────────┘ └──────────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│ │ │
|
||||
v v v
|
||||
PostgreSQL RabbitMQ Local Filesystem
|
||||
```
|
||||
|
||||
## Core Components
|
||||
|
||||
### 1. Worker Registration
|
||||
|
||||
**Purpose**: Register worker in the database and maintain worker metadata.
|
||||
|
||||
**Responsibilities**:
|
||||
- Register worker on startup with name, type, capabilities
|
||||
- Update existing worker records to active status on restart
|
||||
- Deregister worker on shutdown (mark as inactive)
|
||||
- Update worker capabilities dynamically
|
||||
|
||||
**Key Implementation Details**:
|
||||
- Worker name defaults to hostname if not specified
|
||||
- Capabilities include supported runtimes (python, shell, node)
|
||||
- Worker type can be Local, Remote, or Container
|
||||
- Uses direct SQL queries for registration (no repository pattern needed)
|
||||
|
||||
**Database Table**: `attune.worker`
|
||||
|
||||
### 2. Heartbeat Manager
|
||||
|
||||
**Purpose**: Keep worker status fresh in the database with periodic heartbeat updates.
|
||||
|
||||
**Responsibilities**:
|
||||
- Send periodic heartbeat updates (default: every 30 seconds)
|
||||
- Update `last_heartbeat` timestamp in database
|
||||
- Run in background task until stopped
|
||||
- Handle transient database errors gracefully
|
||||
|
||||
**Key Implementation Details**:
|
||||
- Runs as a tokio background task with interval ticker
|
||||
- Configurable heartbeat interval via worker config
|
||||
- Logs errors but doesn't fail the worker on heartbeat issues
|
||||
- Clean shutdown on service stop
|
||||
|
||||
### 3. Runtime System
|
||||
|
||||
**Purpose**: Abstraction layer for executing actions in different environments.
|
||||
|
||||
**Components**:
|
||||
|
||||
#### Runtime Trait
|
||||
```rust
|
||||
pub trait Runtime: Send + Sync {
|
||||
fn name(&self) -> &str;
|
||||
fn can_execute(&self, context: &ExecutionContext) -> bool;
|
||||
async fn execute(&self, context: ExecutionContext) -> RuntimeResult<ExecutionResult>;
|
||||
async fn setup(&self) -> RuntimeResult<()>;
|
||||
async fn cleanup(&self) -> RuntimeResult<()>;
|
||||
async fn validate(&self) -> RuntimeResult<()>;
|
||||
}
|
||||
```
|
||||
|
||||
#### Python Runtime
|
||||
- Executes Python scripts via subprocess
|
||||
- Generates wrapper script to inject parameters
|
||||
- Supports timeout, stdout/stderr capture
|
||||
- Parses JSON results from stdout
|
||||
- Default entry point: `run()` function
|
||||
|
||||
**Example Action**:
|
||||
```python
|
||||
def run(x, y):
|
||||
return x + y
|
||||
```
|
||||
|
||||
#### Shell Runtime
|
||||
- Executes bash/shell scripts via subprocess
|
||||
- Injects parameters as environment variables (PARAM_*)
|
||||
- Supports timeout, output capture
|
||||
- Executes with `set -e` for error propagation
|
||||
|
||||
**Example Action**:
|
||||
```bash
|
||||
echo "Hello, $PARAM_NAME!"
|
||||
```
|
||||
|
||||
#### Local Runtime
|
||||
- Facade that delegates to Python or Shell runtime
|
||||
- Selects runtime based on action metadata
|
||||
- Currently supports Python and Shell
|
||||
- Extensible for additional local runtimes
|
||||
|
||||
#### Runtime Registry
|
||||
- Manages collection of registered runtimes
|
||||
- Selects appropriate runtime for each action
|
||||
- Handles runtime setup/cleanup lifecycle
|
||||
|
||||
### 4. Action Executor
|
||||
|
||||
**Purpose**: Orchestrate the complete execution flow for an action.
|
||||
|
||||
**Execution Flow**:
|
||||
```
|
||||
1. Load execution record from database
|
||||
2. Update status to Running
|
||||
3. Load action definition by reference
|
||||
4. Prepare execution context (parameters, env vars, timeout)
|
||||
5. Select and execute in appropriate runtime
|
||||
6. Capture results (stdout, stderr, return value)
|
||||
7. Store artifacts (logs, results)
|
||||
8. Update execution status (Succeeded/Failed)
|
||||
9. Publish status update messages
|
||||
```
|
||||
|
||||
**Responsibilities**:
|
||||
- Coordinate execution lifecycle
|
||||
- Load action and execution data from database
|
||||
- Prepare execution context with parameters and environment
|
||||
- Execute action via runtime registry
|
||||
- Handle success and failure cases
|
||||
- Store execution artifacts
|
||||
|
||||
**Key Implementation Details**:
|
||||
- Parameters merged: action defaults + execution overrides
|
||||
- Environment variables include execution metadata
|
||||
- Default timeout: 5 minutes (300 seconds)
|
||||
- Errors captured and stored as execution result
|
||||
|
||||
### 5. Artifact Manager
|
||||
|
||||
**Purpose**: Store and manage execution artifacts (logs, results, files).
|
||||
|
||||
**Artifact Types**:
|
||||
- **Log**: stdout/stderr from execution
|
||||
- **Result**: JSON result data from action
|
||||
- **File**: Custom file outputs from actions
|
||||
- **Trace**: Debug/trace information (future)
|
||||
|
||||
**Storage Structure**:
|
||||
```
|
||||
/tmp/attune/artifacts/{worker_name}/
|
||||
└── execution_{id}/
|
||||
├── stdout.log
|
||||
├── stderr.log
|
||||
└── result.json
|
||||
```
|
||||
|
||||
**Responsibilities**:
|
||||
- Store logs (stdout/stderr) for each execution
|
||||
- Store JSON result data
|
||||
- Support custom file artifacts
|
||||
- Clean up old artifacts (retention policy)
|
||||
- Delete artifacts for specific executions
|
||||
|
||||
**Key Implementation Details**:
|
||||
- Creates execution-specific directories
|
||||
- Stores all IO errors as Internal errors
|
||||
- Configurable base directory per worker
|
||||
- Retention policy based on file modification time
|
||||
|
||||
### 6. Secret Management
|
||||
|
||||
**Purpose**: Securely manage and inject secrets into action execution environments.
|
||||
|
||||
**Responsibilities**:
|
||||
- Fetch secrets from database based on ownership hierarchy
|
||||
- Decrypt encrypted secrets using AES-256-GCM
|
||||
- Inject secrets as environment variables
|
||||
- Clean up secrets after execution
|
||||
|
||||
**Secret Ownership Hierarchy**:
|
||||
1. **System-level secrets** - Available to all actions
|
||||
2. **Pack-level secrets** - Available to all actions in a pack
|
||||
3. **Action-level secrets** - Available to specific action only
|
||||
|
||||
More specific secrets override less specific ones with the same name.
|
||||
|
||||
**Environment Variable Injection**:
|
||||
- Secret names transformed: `api_key` → `SECRET_API_KEY`
|
||||
- Prefix: `SECRET_`
|
||||
- Uppercase with hyphens replaced by underscores
|
||||
|
||||
**Encryption**:
|
||||
- Algorithm: AES-256-GCM (authenticated encryption)
|
||||
- Key derivation: SHA-256 hash of configured password
|
||||
- Format: `nonce:ciphertext` (Base64-encoded)
|
||||
- Random nonce per encryption operation
|
||||
|
||||
**Key Implementation Details**:
|
||||
- Encryption key loaded from `security.encryption_key` config
|
||||
- Key hash validation ensures correct decryption key
|
||||
- Graceful handling of missing secrets (warning, not failure)
|
||||
- Secrets never logged or exposed in artifacts
|
||||
- Automatic injection during execution context preparation
|
||||
|
||||
**Configuration**:
|
||||
```yaml
|
||||
security:
|
||||
encryption_key: "your-secret-encryption-password"
|
||||
```
|
||||
|
||||
**Database Table**: `attune.key`
|
||||
|
||||
See `docs/secrets-management.md` for comprehensive documentation.
|
||||
|
||||
### 7. Worker Service
|
||||
|
||||
**Purpose**: Main service orchestration and message queue integration.
|
||||
|
||||
**Responsibilities**:
|
||||
- Initialize all service components
|
||||
- Register worker in database
|
||||
- Start heartbeat manager
|
||||
- Consume execution messages from worker-specific queue
|
||||
- Publish execution status updates
|
||||
- Handle graceful shutdown
|
||||
|
||||
**Message Flow**:
|
||||
```
|
||||
Executor (Scheduler)
|
||||
→ Publishes: execution.scheduled
|
||||
→ Queue: worker.{worker_id}.executions
|
||||
→ Worker consumes message
|
||||
→ Executes action
|
||||
→ Publishes: execution.status.running
|
||||
→ Publishes: execution.status.succeeded/failed
|
||||
```
|
||||
|
||||
**Message Types**:
|
||||
|
||||
**Consumed**:
|
||||
- `execution.scheduled` - New execution assigned to this worker
|
||||
|
||||
**Published**:
|
||||
- `execution.status.running` - Execution started
|
||||
- `execution.status.succeeded` - Execution completed successfully
|
||||
- `execution.status.failed` - Execution failed
|
||||
|
||||
**Key Implementation Details**:
|
||||
- Worker-specific queues enable direct routing from scheduler
|
||||
- Database and MQ connections initialized on startup
|
||||
- Graceful shutdown deregisters worker
|
||||
- Message handlers run async and report errors
|
||||
|
||||
## Configuration
|
||||
|
||||
Worker service uses the standard Attune configuration system:
|
||||
|
||||
```yaml
|
||||
# config.yaml
|
||||
database:
|
||||
url: postgresql://localhost/attune
|
||||
max_connections: 20
|
||||
|
||||
message_queue:
|
||||
url: amqp://localhost
|
||||
exchange: attune.executions
|
||||
|
||||
worker:
|
||||
name: worker-01 # Optional, defaults to hostname
|
||||
worker_type: Local # Local, Remote, Container
|
||||
runtime_id: null # Optional runtime association
|
||||
host: null # Optional, defaults to hostname
|
||||
port: null # Optional
|
||||
max_concurrent_tasks: 10 # Max parallel executions
|
||||
heartbeat_interval: 30 # Seconds between heartbeats
|
||||
task_timeout: 300 # Default task timeout (seconds)
|
||||
|
||||
security:
|
||||
encryption_key: "your-encryption-key" # Required for encrypted secrets
|
||||
```
|
||||
|
||||
Environment variable overrides:
|
||||
```bash
|
||||
ATTUNE__WORKER__NAME=my-worker
|
||||
ATTUNE__WORKER__MAX_CONCURRENT_TASKS=20
|
||||
ATTUNE__WORKER__HEARTBEAT_INTERVAL=60
|
||||
```
|
||||
|
||||
## Running the Service
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- PostgreSQL 14+ with Attune schema initialized
|
||||
- RabbitMQ 3.12+ with exchanges and queues configured
|
||||
- Python 3.x and/or bash (for local runtimes)
|
||||
- Environment variables or config file set up
|
||||
|
||||
### Startup
|
||||
|
||||
```bash
|
||||
# Using cargo
|
||||
cd crates/worker
|
||||
cargo run
|
||||
|
||||
# With custom config
|
||||
cargo run -- --config /path/to/config.yaml
|
||||
|
||||
# With custom worker name
|
||||
cargo run -- --name worker-prod-01
|
||||
|
||||
# Or with environment overrides
|
||||
ATTUNE__WORKER__NAME=worker-01 \
|
||||
ATTUNE__WORKER__MAX_CONCURRENT_TASKS=20 \
|
||||
cargo run
|
||||
```
|
||||
|
||||
### Graceful Shutdown
|
||||
|
||||
The service supports graceful shutdown via SIGTERM/SIGINT (Ctrl+C):
|
||||
1. Stop accepting new execution messages
|
||||
2. Finish processing in-flight executions (future enhancement)
|
||||
3. Stop heartbeat manager
|
||||
4. Deregister worker (mark as inactive)
|
||||
5. Close message queue connections
|
||||
6. Close database connections
|
||||
7. Exit cleanly
|
||||
|
||||
## Execution Context
|
||||
|
||||
The executor prepares a comprehensive execution context for each action:
|
||||
|
||||
```rust
|
||||
pub struct ExecutionContext {
|
||||
pub execution_id: i64,
|
||||
pub action_ref: String, // "pack.action"
|
||||
pub parameters: HashMap<String, JsonValue>,
|
||||
pub env: HashMap<String, String>, // Environment variables
|
||||
pub timeout: Option<u64>, // Timeout in seconds
|
||||
pub working_dir: Option<PathBuf>, // Working directory
|
||||
pub entry_point: String, // Function/script entry point
|
||||
pub code: Option<String>, // Action code (inline)
|
||||
pub code_path: Option<PathBuf>, // Action code (file path)
|
||||
}
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
The executor injects these environment variables:
|
||||
- `ATTUNE_EXECUTION_ID` - Execution ID
|
||||
- `ATTUNE_ACTION` - Action reference (pack.action)
|
||||
- `ATTUNE_RUNNER` - Runner type (if specified)
|
||||
- `ATTUNE_CONTEXT_*` - Context data as environment variables
|
||||
|
||||
For shell actions, parameters are also injected as:
|
||||
- `PARAM_{KEY}` - Each parameter as uppercase env var
|
||||
|
||||
## Execution Result
|
||||
|
||||
Actions return a standardized result:
|
||||
|
||||
```rust
|
||||
pub struct ExecutionResult {
|
||||
pub exit_code: i32, // 0 = success
|
||||
pub stdout: String, // Standard output
|
||||
pub stderr: String, // Standard error
|
||||
pub result: Option<JsonValue>, // Parsed result data
|
||||
pub duration_ms: u64, // Execution duration
|
||||
pub error: Option<String>, // Error message if failed
|
||||
}
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Error Categories
|
||||
|
||||
1. **Setup Errors**: Runtime initialization failures
|
||||
2. **Execution Errors**: Action execution failures
|
||||
3. **Timeout Errors**: Execution exceeded timeout
|
||||
4. **IO Errors**: File/network operations
|
||||
5. **Database Errors**: Connection, query failures
|
||||
|
||||
### Error Propagation
|
||||
|
||||
- Runtime errors captured in `ExecutionResult.error`
|
||||
- Execution status updated to Failed in database
|
||||
- Error published in status update message
|
||||
- Artifacts still stored for failed executions
|
||||
- Logs preserved for debugging
|
||||
|
||||
## Testing
|
||||
|
||||
### Unit Tests
|
||||
|
||||
Each runtime includes unit tests:
|
||||
- Simple execution
|
||||
- Parameter passing
|
||||
- Timeout handling
|
||||
- Error handling
|
||||
|
||||
### Integration Tests
|
||||
|
||||
Integration tests require PostgreSQL and RabbitMQ:
|
||||
- Worker registration and heartbeat
|
||||
- End-to-end action execution
|
||||
- Message queue integration
|
||||
- Artifact storage
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
# Unit tests only
|
||||
cargo test -p attune-worker --lib
|
||||
|
||||
# Integration tests (requires services)
|
||||
cargo test -p attune-worker --test '*'
|
||||
|
||||
# Specific runtime tests
|
||||
cargo test -p attune-worker python_runtime
|
||||
cargo test -p attune-worker shell_runtime
|
||||
```
|
||||
|
||||
## Implementation Status
|
||||
|
||||
### Phase 5.1: Worker Foundation ✅ COMPLETE
|
||||
- [x] Worker registration module
|
||||
- [x] Heartbeat manager
|
||||
- [x] Service initialization
|
||||
- [x] Configuration loading
|
||||
|
||||
### Phase 5.2: Runtime System ✅ COMPLETE
|
||||
- [x] Runtime trait abstraction
|
||||
- [x] Python runtime implementation
|
||||
- [x] Shell runtime implementation
|
||||
- [x] Local runtime facade
|
||||
- [x] Runtime registry
|
||||
|
||||
### Phase 5.3: Execution Logic ⏳ IN PROGRESS
|
||||
- [x] Action executor module
|
||||
- [x] Execution context preparation
|
||||
- [ ] Fix data model mismatches
|
||||
- [ ] Complete message queue integration
|
||||
- [ ] Test end-to-end flow
|
||||
|
||||
### Phase 5.4: Artifact Management ✅ COMPLETE
|
||||
- [x] Artifact manager module
|
||||
- [x] Log storage (stdout/stderr)
|
||||
- [x] Result storage (JSON)
|
||||
- [x] File artifact storage
|
||||
- [x] Cleanup/retention policies
|
||||
|
||||
### Phase 5.5: Testing 📋 TODO
|
||||
- [x] Runtime unit tests (basic)
|
||||
- [ ] Integration tests with database
|
||||
- [ ] End-to-end execution tests
|
||||
- [ ] Error handling tests
|
||||
|
||||
### Phase 5.6: Advanced Features 📋 TODO
|
||||
- [ ] Container runtime (Docker)
|
||||
- [ ] Remote worker support
|
||||
- [ ] Concurrent execution limits
|
||||
- [ ] Worker capacity management
|
||||
- [ ] Execution queuing
|
||||
|
||||
## Known Issues
|
||||
|
||||
### Data Model Mismatches
|
||||
|
||||
The current implementation has several mismatches with the actual database schema:
|
||||
|
||||
1. **Execution.action**: Expected String, actual is `Option<i64>`
|
||||
2. **Execution fields**: Missing `parameters`, `context`, `runner` fields
|
||||
3. **Action fields**: `entry_point` → `entrypoint`, missing `timeout`
|
||||
4. **Repository pattern**: Repositories don't have `::new()` constructors
|
||||
5. **Error types**: `Error::BadRequest` and `Error::NotFound` have different signatures
|
||||
|
||||
### Required Fixes
|
||||
|
||||
1. Update executor to use `action_ref` field instead of `action`
|
||||
2. Fix action loading to query by ID from execution
|
||||
3. Update execution context preparation for actual schema
|
||||
4. Fix repository usage patterns
|
||||
5. Update error construction calls
|
||||
6. Implement From<MqError> for Error
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Phase 1: Core Improvements
|
||||
- Concurrent execution management (max_concurrent_tasks)
|
||||
- Worker capacity tracking and reporting
|
||||
- Execution queuing when at capacity
|
||||
- Retry logic for transient failures
|
||||
|
||||
### Phase 2: Advanced Runtimes
|
||||
- Container runtime with Docker
|
||||
- Container image management and caching
|
||||
- Volume mounting for code injection
|
||||
- Network isolation for security
|
||||
|
||||
### Phase 3: Remote Workers
|
||||
- Remote worker registration
|
||||
- Worker-to-worker communication
|
||||
- Geographic distribution
|
||||
- Load balancing strategies
|
||||
|
||||
### Phase 4: Monitoring & Observability
|
||||
- Execution metrics (duration, success rate)
|
||||
- Worker health metrics
|
||||
- Runtime-specific metrics
|
||||
- OpenTelemetry integration
|
||||
|
||||
### Phase 5: Security
|
||||
- Execution sandboxing
|
||||
- Resource limits (CPU, memory)
|
||||
- Secret injection from key store
|
||||
- Encrypted artifact storage
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Executor Service](./executor-service.md)
|
||||
- [API - Executions](./api-executions.md)
|
||||
- [API - Actions](./api-actions.md)
|
||||
- [Configuration](./configuration.md)
|
||||
- [Quick Start](./quick-start.md)
|
||||
Reference in New Issue
Block a user