Files
attune/docs/sensors/sensor-interface.md
2026-02-04 17:46:30 -06:00

607 lines
18 KiB
Markdown

# Sensor Interface Specification
**Version:** 1.0
**Last Updated:** 2025-01-27
**Status:** Draft
## Overview
This document specifies the standard interface that all Attune sensors must implement. Sensors are lightweight, long-running daemon processes that monitor for events and emit them into the Attune platform. Each sensor type has exactly one process instance running at a time, and individual sensor instances are managed dynamically based on active rules.
## Design Principles
1. **Single Process Per Sensor Type**: Each sensor type (e.g., timer, webhook, file_watcher) runs as a single daemon process
2. **Lightweight & Async**: Sensors should be event-driven and non-blocking
3. **Rule-Driven Behavior**: Sensors manage multiple concurrent "instances" based on active rules
4. **RabbitMQ Communication**: All control messages flow through RabbitMQ
5. **API Integration**: Sensors use the Attune API to emit events and fetch configuration
6. **Standard Authentication**: Sensors authenticate using transient API tokens
7. **Graceful Lifecycle**: Sensors handle startup, shutdown, and dynamic reconfiguration
## Sensor Lifecycle
### 1. Initialization
When a sensor starts, it must:
1. **Read Configuration** from environment variables or stdin
2. **Authenticate** with the Attune API using a transient token
3. **Connect to RabbitMQ** and declare/bind to its control queue
4. **Load Active Rules** from the API that use its trigger types
5. **Start Monitoring** for each active rule
6. **Signal Ready** (log startup completion)
### 2. Runtime Operation
During normal operation, a sensor:
1. **Listens to RabbitMQ** for rule lifecycle messages (`RuleCreated`, `RuleEnabled`, `RuleDisabled`, `RuleDeleted`)
2. **Monitors External Sources** (timers, webhooks, file systems, etc.) based on active rules
3. **Emits Events** to the Attune API when trigger conditions are met
4. **Handles Errors** gracefully without crashing
5. **Reports Health** (periodic heartbeat/metrics - future)
### 3. Shutdown
On shutdown (SIGTERM/SIGINT), a sensor must:
1. **Stop Accepting New Work** (stop listening to RabbitMQ)
2. **Cancel Active Monitors** (stop timers, close connections)
3. **Flush Pending Events** (send any buffered events to API)
4. **Close Connections** (RabbitMQ, HTTP clients)
5. **Exit Cleanly** with appropriate exit code
## Configuration
### Environment Variables
Sensors MUST accept the following environment variables:
| Variable | Required | Description | Example |
|----------|----------|-------------|---------|
| `ATTUNE_API_URL` | Yes | Base URL of Attune API | `http://localhost:8080` |
| `ATTUNE_API_TOKEN` | Yes | Transient API token for authentication | `sensor_abc123...` |
| `ATTUNE_SENSOR_REF` | Yes | Reference name of this sensor | `core.timer` |
| `ATTUNE_MQ_URL` | Yes | RabbitMQ connection URL | `amqp://localhost:5672` |
| `ATTUNE_MQ_EXCHANGE` | No | RabbitMQ exchange name | `attune` (default) |
| `ATTUNE_LOG_LEVEL` | No | Logging verbosity | `info` (default) |
### Alternative: stdin Configuration
For containerized or orchestrated deployments, sensors MAY accept configuration as JSON on stdin:
```json
{
"api_url": "http://localhost:8080",
"api_token": "sensor_abc123...",
"sensor_ref": "core.timer",
"mq_url": "amqp://localhost:5672",
"mq_exchange": "attune",
"log_level": "info"
}
```
If stdin is provided, it takes precedence over environment variables. The JSON must be a single line or complete object, followed by EOF or newline.
## API Authentication: Transient Tokens
### Token Requirements
- **Type**: JWT with `service_account` identity type
- **Scope**: Limited to sensor operations (create events, read rules)
- **Lifetime**: Long-lived (90 days) and auto-expires
- **Rotation**: Automatic refresh (sensor refreshes token when 80% of TTL elapsed)
- **Zero-Downtime**: Hot-reload new tokens without restart
### Token Format
Sensors receive a standard JWT that includes:
```json
{
"sub": "sensor:core.timer",
"jti": "abc123def456", // JWT ID for revocation tracking
"identity_id": 123,
"identity_type": "service_account",
"scope": "sensor",
"iat": 1738800000, // Issued at
"exp": 1738886400, // Expires in 24-72 hours (REQUIRED)
"metadata": {
"trigger_types": ["core.timer"] // Enforced by API
}
}
```
### API Endpoints Used by Sensors
Sensors interact with the following API endpoints:
| Method | Endpoint | Purpose | Auth |
|--------|----------|---------|------|
| GET | `/rules?trigger_type={ref}` | Fetch active rules for this sensor's triggers | Required |
| GET | `/triggers/{ref}` | Fetch trigger metadata | Required |
| POST | `/events` | Create new event | Required |
| POST | `/auth/refresh` | Refresh token before expiration | Required |
| GET | `/health` | Verify API connectivity | Optional |
## RabbitMQ Integration
### Queue Naming
Each sensor binds to a dedicated queue for control messages:
- **Queue Name**: `sensor.{sensor_ref}` (e.g., `sensor.core.timer`)
- **Durable**: Yes
- **Auto-Delete**: No
- **Exclusive**: No
### Exchange Binding
Sensors bind their queue to the main exchange with routing keys:
- `rule.created` - New rule created
- `rule.enabled` - Existing rule enabled
- `rule.disabled` - Existing rule disabled
- `rule.deleted` - Rule deleted
### Message Format
All control messages follow this JSON schema:
```json
{
"event_type": "RuleCreated | RuleEnabled | RuleDisabled | RuleDeleted",
"rule_id": 123,
"trigger_type": "core.timer",
"trigger_params": {
"interval_seconds": 5
},
"timestamp": "2025-01-27T12:34:56Z"
}
```
### Message Handling
Sensors MUST:
1. **Validate** messages against expected schema
2. **Filter** messages to only process rules for their trigger types (based on token's `metadata.trigger_types`)
3. **Acknowledge** messages after processing (or reject on unrecoverable error)
4. **Handle Duplicates** idempotently (same rule_id + event_type)
5. **Enforce Trigger Type Restrictions**: Only emit events for trigger types declared in the sensor's token metadata
## Event Emission
### Event Creation API
Sensors create events by POSTing to `/events`:
```http
POST /events
Authorization: Bearer {sensor_token}
Content-Type: application/json
{
"trigger_type": "core.timer",
"payload": {
"timestamp": "2025-01-27T12:34:56Z",
"scheduled_time": "2025-01-27T12:34:56Z"
},
"trigger_instance_id": "rule_123"
}
```
**Important**: Sensors can only emit events for trigger types declared in their token's `metadata.trigger_types`. The API will reject event creation requests for unauthorized trigger types with a `403 Forbidden` error.
### Event Payload Guidelines
- **Timestamp**: Always include event occurrence time
- **Context**: Include relevant context for rule evaluation
- **Size**: Keep payloads small (<1KB recommended, <10KB max)
- **Sensitive Data**: Never include passwords, tokens, or PII unless explicitly required
- **Trigger Type Match**: The `trigger_type` field must match one of the sensor's declared trigger types
### Error Handling
If event creation fails:
1. **Retry** with exponential backoff (3 attempts)
2. **Log Error** with full context
3. **Continue Operating** (don't crash on single event failure)
4. **Alert** if failure rate exceeds threshold (future)
## Sensor-Specific Behavior
Each sensor type implements trigger-specific logic. The sensor monitors external sources and translates them into Attune events.
### Example: Timer Sensor
**Trigger Type**: `core.timer`
**Parameters**:
```json
{
"interval_seconds": 60
}
```
**Behavior**:
- Maintains a hash map of `rule_id -> tokio::task::JoinHandle`
- On `RuleCreated`/`RuleEnabled`: Start an async timer loop for the rule
- On `RuleDisabled`/`RuleDeleted`: Cancel the timer task for the rule
- Timer loop: Every interval, emit an event with current timestamp
**Event Payload**:
```json
{
"timestamp": "2025-01-27T12:34:56Z",
"scheduled_time": "2025-01-27T12:34:56Z"
}
```
### Example: Webhook Sensor
**Trigger Type**: `core.webhook`
**Parameters**:
```json
{
"path": "/hooks/deployment",
"method": "POST",
"secret": "shared_secret_123"
}
```
**Behavior**:
- Runs an HTTP server listening on configured port
- On `RuleCreated`/`RuleEnabled`: Register a route handler for the webhook path
- On `RuleDisabled`/`RuleDeleted`: Unregister the route handler
- On incoming request: Validate secret, emit event with request body
**Event Payload**:
```json
{
"timestamp": "2025-01-27T12:34:56Z",
"method": "POST",
"path": "/hooks/deployment",
"headers": {"Content-Type": "application/json"},
"body": {"status": "deployed"}
}
```
### Example: File Watcher Sensor
**Trigger Type**: `core.file_changed`
**Parameters**:
```json
{
"path": "/var/log/app.log",
"event_types": ["modified", "created"]
}
```
**Behavior**:
- Uses inotify/FSEvents/equivalent to watch file system
- On `RuleCreated`/`RuleEnabled`: Add watch for the specified path
- On `RuleDisabled`/`RuleDeleted`: Remove watch for the path
- On file system event: Emit event with file details
**Event Payload**:
```json
{
"timestamp": "2025-01-27T12:34:56Z",
"path": "/var/log/app.log",
"event_type": "modified",
"size": 12345
}
```
## Implementation Guidelines
### Language & Runtime
- **Recommended**: Rust (for consistency with Attune services)
- **Alternatives**: Python, Node.js, Go (if justified by use case)
- **Async I/O**: Required for scalability
### Dependencies
Sensors should use:
- **HTTP Client**: For API communication (e.g., `reqwest` in Rust)
- **RabbitMQ Client**: For message queue (e.g., `lapin` in Rust)
- **Async Runtime**: For concurrency (e.g., `tokio` in Rust)
- **JSON Parsing**: For message/event handling (e.g., `serde_json` in Rust)
- **Logging**: Structured logging (e.g., `tracing` in Rust)
### Error Handling
- **Panic/Crash**: Never panic on external input (messages, API responses)
- **Retry Logic**: Implement exponential backoff for transient failures
- **Circuit Breaker**: Consider circuit breaker for API calls (future)
- **Graceful Degradation**: Continue operating even if some rules fail
### Logging
Sensors MUST log:
- **Startup**: Configuration loaded, connections established
- **Rule Changes**: Rule added/removed/updated
- **Events Emitted**: Event type and rule_id (not full payload)
- **Errors**: All errors with context
- **Shutdown**: Graceful shutdown initiated and completed
Log format should be JSON for structured logging:
```json
{
"timestamp": "2025-01-27T12:34:56Z",
"level": "info",
"sensor": "core.timer",
"message": "Timer started for rule",
"rule_id": 123,
"interval_seconds": 5
}
```
### Testing
Sensors should include:
- **Unit Tests**: Test message parsing, event creation logic
- **Integration Tests**: Test against real RabbitMQ and API (test environment)
- **Mock Tests**: Test with mocked API/MQ for isolated testing
## Security Considerations
### Token Storage
- **Never Log Tokens**: Redact tokens in logs
- **Memory Only**: Keep tokens in memory, never write to disk
- **Automatic Refresh**: Refresh token when 80% of TTL elapsed (no restart required)
- **Hot-Reload**: Update in-memory token without interrupting operations
- **Refresh Failure Handling**: Log errors and retry with exponential backoff
### Input Validation
- **Validate All Inputs**: RabbitMQ messages, API responses
- **Sanitize Payloads**: Prevent injection attacks in event payloads
- **Rate Limiting**: Prevent resource exhaustion from malicious triggers
- **Trigger Type Enforcement**: API validates that sensor tokens can only create events for declared trigger types
### Network Security
- **TLS**: Use HTTPS for API calls in production
- **AMQPS**: Use TLS for RabbitMQ in production
- **Timeouts**: Set reasonable timeouts for all network calls
## Deployment
### Service Management
Sensors should be managed as system services:
- **systemd**: Linux deployments
- **launchd**: macOS deployments
- **Docker**: Container deployments
- **Kubernetes**: Orchestrated deployments (one pod per sensor type)
### Resource Limits
Recommended limits:
- **Memory**: 64-256 MB per sensor (depends on rule count)
- **CPU**: Minimal (<5% avg, spikes allowed)
- **Network**: Low bandwidth (<1 Mbps typical)
- **Disk**: Minimal (logs only)
### Monitoring
Sensors should expose metrics (future):
- **Rules Active**: Count of rules being monitored
- **Events Emitted**: Counter of events created
- **Errors**: Counter of errors by type
- **API Latency**: Histogram of API call durations
- **MQ Latency**: Histogram of message processing durations
## Compatibility
### Versioning
Sensors should:
- **Declare Version**: Include sensor version in logs and metrics
- **API Compatibility**: Support current API version
- **Message Compatibility**: Handle unknown fields gracefully
### Backwards Compatibility
When updating sensors:
- **Add Fields**: New message fields are optional
- **Deprecate Fields**: Old fields remain supported for 2+ versions
- **Breaking Changes**: Require major version bump and migration guide
## Appendix: Reference Implementation
See `attune/crates/sensor/` for the reference timer sensor implementation in Rust.
Key components:
- `src/main.rs` - Initialization and configuration
- `src/listener.rs` - RabbitMQ message handling
- `src/timer.rs` - Timer-specific logic
- `src/api_client.rs` - API communication
## Appendix: Message Queue Schema
### Rule Lifecycle Messages
**Exchange**: `attune` (topic exchange)
**RuleCreated**:
```json
{
"event_type": "RuleCreated",
"rule_id": 123,
"rule_ref": "timer_every_5s",
"trigger_type": "core.timer",
"trigger_params": {"interval_seconds": 5},
"enabled": true,
"timestamp": "2025-01-27T12:34:56Z"
}
```
**RuleEnabled**:
```json
{
"event_type": "RuleEnabled",
"rule_id": 123,
"trigger_type": "core.timer",
"trigger_params": {"interval_seconds": 5},
"timestamp": "2025-01-27T12:34:56Z"
}
```
**RuleDisabled**:
```json
{
"event_type": "RuleDisabled",
"rule_id": 123,
"trigger_type": "core.timer",
"timestamp": "2025-01-27T12:34:56Z"
}
```
**RuleDeleted**:
```json
{
"event_type": "RuleDeleted",
"rule_id": 123,
"trigger_type": "core.timer",
"timestamp": "2025-01-27T12:34:56Z"
}
```
## Appendix: API Token Management
### Creating Sensor Tokens
Tokens are created via the Attune API (admin only):
```http
POST /service-accounts
Authorization: Bearer {admin_token}
Content-Type: application/json
{
"name": "sensor:core.timer",
"description": "Timer sensor service account",
"scope": "sensor",
"ttl_days": 90
}
```
Response:
```json
{
"identity_id": 123,
"name": "sensor:core.timer",
"token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
"expires_at": "2025-04-27T12:34:56Z"
}
```
### Token Scopes
| Scope | Permissions |
|-------|-------------|
| `sensor` | Create events, read rules/triggers |
| `action` | Read keys, update execution status (for action runners) |
| `admin` | Full access (for CLI, web UI) |
## Token Lifecycle Management
### Automatic Token Refresh
Sensors automatically refresh their own tokens without human intervention:
**Refresh Timing:**
- Tokens have 90-day TTL
- Sensors refresh when 80% of TTL elapsed (72 days)
- Calculation: `refresh_at = issued_at + (TTL * 0.8)`
**Refresh Process:**
1. Background task monitors token expiration
2. When refresh threshold reached, call `POST /auth/refresh` with current token
3. Receive new token with fresh 90-day TTL
4. Hot-load new token (update in-memory reference)
5. Old token remains valid until original expiration
6. Continue operations without interruption
**Implementation Pattern:**
```rust
// Calculate when to refresh (80% of TTL)
let token_exp = decode_jwt(&token)?.exp;
let token_iat = decode_jwt(&token)?.iat;
let ttl_seconds = token_exp - token_iat;
let refresh_at = token_iat + (ttl_seconds * 8 / 10);
// Spawn background refresh task
tokio::spawn(async move {
loop {
let now = current_timestamp();
if now >= refresh_at {
match api_client.refresh_token().await {
Ok(new_token) => {
update_token(new_token);
info!("Token refreshed successfully");
}
Err(e) => {
error!("Failed to refresh token: {}", e);
// Retry with exponential backoff
}
}
}
sleep(Duration::from_hours(1)).await;
}
});
```
**Refresh Failure Handling:**
1. Log error with full context
2. Retry with exponential backoff (1min, 2min, 4min, 8min, max 1 hour)
3. Continue using old token (still valid until expiration)
4. Alert monitoring system after 3 consecutive failures
5. If old token expires before successful refresh, shut down gracefully
**Zero-Downtime:**
- Old token valid during refresh
- No service interruption
- Graceful degradation on failure
- No manual intervention required
### Token Expiration (Edge Case)
If automatic refresh fails and token expires:
1. API returns 401 Unauthorized
2. Sensor logs critical error
3. Sensor shuts down gracefully (stops accepting work, completes in-flight operations)
4. Operator must manually create new token and restart sensor
**This should rarely occur** if automatic refresh is working correctly.
## Future Enhancements
1. **Health Checks**: HTTP endpoint for liveness/readiness probes
2. **Metrics Export**: Prometheus-compatible metrics endpoint (including token refresh metrics)
3. **Dynamic Discovery**: Auto-discover available sensors from registry
4. **Sensor Scaling**: Support multiple instances per sensor type with work distribution
5. **Backpressure**: Handle event backlog when API is slow/unavailable
6. **Circuit Breaker**: Automatic failover when API is unreachable
7. **Sensor Plugins**: Dynamic loading of sensor implementations
8. **Configurable Refresh Threshold**: Allow custom refresh timing (e.g., 75%, 85%)
9. **Token Refresh Alerts**: Alert on refresh failures, not normal refresh events