re-uploading work

2026-02-04 17:46:30 -06:00
commit 3b14c65998
1388 changed files with 381262 additions and 0 deletions
--- a/docs/sensors/sensor-interface.md
+++ b/docs/sensors/sensor-interface.md
@@ -0,0 +1,607 @@
+# Sensor Interface Specification
+
+**Version:** 1.0  
+**Last Updated:** 2025-01-27  
+**Status:** Draft
+
+## Overview
+
+This document specifies the standard interface that all Attune sensors must implement. Sensors are lightweight, long-running daemon processes that monitor for events and emit them into the Attune platform. Each sensor type has exactly one process instance running at a time, and individual sensor instances are managed dynamically based on active rules.
+
+## Design Principles
+
+1. **Single Process Per Sensor Type**: Each sensor type (e.g., timer, webhook, file_watcher) runs as a single daemon process
+2. **Lightweight & Async**: Sensors should be event-driven and non-blocking
+3. **Rule-Driven Behavior**: Sensors manage multiple concurrent "instances" based on active rules
+4. **RabbitMQ Communication**: All control messages flow through RabbitMQ
+5. **API Integration**: Sensors use the Attune API to emit events and fetch configuration
+6. **Standard Authentication**: Sensors authenticate using transient API tokens
+7. **Graceful Lifecycle**: Sensors handle startup, shutdown, and dynamic reconfiguration
+
+## Sensor Lifecycle
+
+### 1. Initialization
+
+When a sensor starts, it must:
+
+1. **Read Configuration** from environment variables or stdin
+2. **Authenticate** with the Attune API using a transient token
+3. **Connect to RabbitMQ** and declare/bind to its control queue
+4. **Load Active Rules** from the API that use its trigger types
+5. **Start Monitoring** for each active rule
+6. **Signal Ready** (log startup completion)
+
+### 2. Runtime Operation
+
+During normal operation, a sensor:
+
+1. **Listens to RabbitMQ** for rule lifecycle messages (`RuleCreated`, `RuleEnabled`, `RuleDisabled`, `RuleDeleted`)
+2. **Monitors External Sources** (timers, webhooks, file systems, etc.) based on active rules
+3. **Emits Events** to the Attune API when trigger conditions are met
+4. **Handles Errors** gracefully without crashing
+5. **Reports Health** (periodic heartbeat/metrics - future)
+
+### 3. Shutdown
+
+On shutdown (SIGTERM/SIGINT), a sensor must:
+
+1. **Stop Accepting New Work** (stop listening to RabbitMQ)
+2. **Cancel Active Monitors** (stop timers, close connections)
+3. **Flush Pending Events** (send any buffered events to API)
+4. **Close Connections** (RabbitMQ, HTTP clients)
+5. **Exit Cleanly** with appropriate exit code
+
+## Configuration
+
+### Environment Variables
+
+Sensors MUST accept the following environment variables:
+
+| Variable | Required | Description | Example |
+|----------|----------|-------------|---------|
+| `ATTUNE_API_URL` | Yes | Base URL of Attune API | `http://localhost:8080` |
+| `ATTUNE_API_TOKEN` | Yes | Transient API token for authentication | `sensor_abc123...` |
+| `ATTUNE_SENSOR_REF` | Yes | Reference name of this sensor | `core.timer` |
+| `ATTUNE_MQ_URL` | Yes | RabbitMQ connection URL | `amqp://localhost:5672` |
+| `ATTUNE_MQ_EXCHANGE` | No | RabbitMQ exchange name | `attune` (default) |
+| `ATTUNE_LOG_LEVEL` | No | Logging verbosity | `info` (default) |
+
+### Alternative: stdin Configuration
+
+For containerized or orchestrated deployments, sensors MAY accept configuration as JSON on stdin:
+
+```json
+{
+  "api_url": "http://localhost:8080",
+  "api_token": "sensor_abc123...",
+  "sensor_ref": "core.timer",
+  "mq_url": "amqp://localhost:5672",
+  "mq_exchange": "attune",
+  "log_level": "info"
+}
+```
+
+If stdin is provided, it takes precedence over environment variables. The JSON must be a single line or complete object, followed by EOF or newline.
+
+## API Authentication: Transient Tokens
+
+### Token Requirements
+
+- **Type**: JWT with `service_account` identity type
+- **Scope**: Limited to sensor operations (create events, read rules)
+- **Lifetime**: Long-lived (90 days) and auto-expires
+- **Rotation**: Automatic refresh (sensor refreshes token when 80% of TTL elapsed)
+- **Zero-Downtime**: Hot-reload new tokens without restart
+
+### Token Format
+
+Sensors receive a standard JWT that includes:
+
+```json
+{
+  "sub": "sensor:core.timer",
+  "jti": "abc123def456",  // JWT ID for revocation tracking
+  "identity_id": 123,
+  "identity_type": "service_account",
+  "scope": "sensor",
+  "iat": 1738800000,  // Issued at
+  "exp": 1738886400,  // Expires in 24-72 hours (REQUIRED)
+  "metadata": {
+    "trigger_types": ["core.timer"]  // Enforced by API
+  }
+}
+```
+
+### API Endpoints Used by Sensors
+
+Sensors interact with the following API endpoints:
+
+| Method | Endpoint | Purpose | Auth |
+|--------|----------|---------|------|
+| GET | `/rules?trigger_type={ref}` | Fetch active rules for this sensor's triggers | Required |
+| GET | `/triggers/{ref}` | Fetch trigger metadata | Required |
+| POST | `/events` | Create new event | Required |
+| POST | `/auth/refresh` | Refresh token before expiration | Required |
+| GET | `/health` | Verify API connectivity | Optional |
+
+## RabbitMQ Integration
+
+### Queue Naming
+
+Each sensor binds to a dedicated queue for control messages:
+
+- **Queue Name**: `sensor.{sensor_ref}` (e.g., `sensor.core.timer`)
+- **Durable**: Yes
+- **Auto-Delete**: No
+- **Exclusive**: No
+
+### Exchange Binding
+
+Sensors bind their queue to the main exchange with routing keys:
+
+- `rule.created` - New rule created
+- `rule.enabled` - Existing rule enabled
+- `rule.disabled` - Existing rule disabled
+- `rule.deleted` - Rule deleted
+
+### Message Format
+
+All control messages follow this JSON schema:
+
+```json
+{
+  "event_type": "RuleCreated | RuleEnabled | RuleDisabled | RuleDeleted",
+  "rule_id": 123,
+  "trigger_type": "core.timer",
+  "trigger_params": {
+    "interval_seconds": 5
+  },
+  "timestamp": "2025-01-27T12:34:56Z"
+}
+```
+
+### Message Handling
+
+Sensors MUST:
+
+1. **Validate** messages against expected schema
+2. **Filter** messages to only process rules for their trigger types (based on token's `metadata.trigger_types`)
+3. **Acknowledge** messages after processing (or reject on unrecoverable error)
+4. **Handle Duplicates** idempotently (same rule_id + event_type)
+5. **Enforce Trigger Type Restrictions**: Only emit events for trigger types declared in the sensor's token metadata
+
+## Event Emission
+
+### Event Creation API
+
+Sensors create events by POSTing to `/events`:
+
+```http
+POST /events
+Authorization: Bearer {sensor_token}
+Content-Type: application/json
+
+{
+  "trigger_type": "core.timer",
+  "payload": {
+    "timestamp": "2025-01-27T12:34:56Z",
+    "scheduled_time": "2025-01-27T12:34:56Z"
+  },
+  "trigger_instance_id": "rule_123"
+}
+```
+
+**Important**: Sensors can only emit events for trigger types declared in their token's `metadata.trigger_types`. The API will reject event creation requests for unauthorized trigger types with a `403 Forbidden` error.
+
+### Event Payload Guidelines
+
+- **Timestamp**: Always include event occurrence time
+- **Context**: Include relevant context for rule evaluation
+- **Size**: Keep payloads small (<1KB recommended, <10KB max)
+- **Sensitive Data**: Never include passwords, tokens, or PII unless explicitly required
+- **Trigger Type Match**: The `trigger_type` field must match one of the sensor's declared trigger types
+
+### Error Handling
+
+If event creation fails:
+
+1. **Retry** with exponential backoff (3 attempts)
+2. **Log Error** with full context
+3. **Continue Operating** (don't crash on single event failure)
+4. **Alert** if failure rate exceeds threshold (future)
+
+## Sensor-Specific Behavior
+
+Each sensor type implements trigger-specific logic. The sensor monitors external sources and translates them into Attune events.
+
+### Example: Timer Sensor
+
+**Trigger Type**: `core.timer`
+
+**Parameters**:
+```json
+{
+  "interval_seconds": 60
+}
+```
+
+**Behavior**:
+- Maintains a hash map of `rule_id -> tokio::task::JoinHandle`
+- On `RuleCreated`/`RuleEnabled`: Start an async timer loop for the rule
+- On `RuleDisabled`/`RuleDeleted`: Cancel the timer task for the rule
+- Timer loop: Every interval, emit an event with current timestamp
+
+**Event Payload**:
+```json
+{
+  "timestamp": "2025-01-27T12:34:56Z",
+  "scheduled_time": "2025-01-27T12:34:56Z"
+}
+```
+
+### Example: Webhook Sensor
+
+**Trigger Type**: `core.webhook`
+
+**Parameters**:
+```json
+{
+  "path": "/hooks/deployment",
+  "method": "POST",
+  "secret": "shared_secret_123"
+}
+```
+
+**Behavior**:
+- Runs an HTTP server listening on configured port
+- On `RuleCreated`/`RuleEnabled`: Register a route handler for the webhook path
+- On `RuleDisabled`/`RuleDeleted`: Unregister the route handler
+- On incoming request: Validate secret, emit event with request body
+
+**Event Payload**:
+```json
+{
+  "timestamp": "2025-01-27T12:34:56Z",
+  "method": "POST",
+  "path": "/hooks/deployment",
+  "headers": {"Content-Type": "application/json"},
+  "body": {"status": "deployed"}
+}
+```
+
+### Example: File Watcher Sensor
+
+**Trigger Type**: `core.file_changed`
+
+**Parameters**:
+```json
+{
+  "path": "/var/log/app.log",
+  "event_types": ["modified", "created"]
+}
+```
+
+**Behavior**:
+- Uses inotify/FSEvents/equivalent to watch file system
+- On `RuleCreated`/`RuleEnabled`: Add watch for the specified path
+- On `RuleDisabled`/`RuleDeleted`: Remove watch for the path
+- On file system event: Emit event with file details
+
+**Event Payload**:
+```json
+{
+  "timestamp": "2025-01-27T12:34:56Z",
+  "path": "/var/log/app.log",
+  "event_type": "modified",
+  "size": 12345
+}
+```
+
+## Implementation Guidelines
+
+### Language & Runtime
+
+- **Recommended**: Rust (for consistency with Attune services)
+- **Alternatives**: Python, Node.js, Go (if justified by use case)
+- **Async I/O**: Required for scalability
+
+### Dependencies
+
+Sensors should use:
+
+- **HTTP Client**: For API communication (e.g., `reqwest` in Rust)
+- **RabbitMQ Client**: For message queue (e.g., `lapin` in Rust)
+- **Async Runtime**: For concurrency (e.g., `tokio` in Rust)
+- **JSON Parsing**: For message/event handling (e.g., `serde_json` in Rust)
+- **Logging**: Structured logging (e.g., `tracing` in Rust)
+
+### Error Handling
+
+- **Panic/Crash**: Never panic on external input (messages, API responses)
+- **Retry Logic**: Implement exponential backoff for transient failures
+- **Circuit Breaker**: Consider circuit breaker for API calls (future)
+- **Graceful Degradation**: Continue operating even if some rules fail
+
+### Logging
+
+Sensors MUST log:
+
+- **Startup**: Configuration loaded, connections established
+- **Rule Changes**: Rule added/removed/updated
+- **Events Emitted**: Event type and rule_id (not full payload)
+- **Errors**: All errors with context
+- **Shutdown**: Graceful shutdown initiated and completed
+
+Log format should be JSON for structured logging:
+
+```json
+{
+  "timestamp": "2025-01-27T12:34:56Z",
+  "level": "info",
+  "sensor": "core.timer",
+  "message": "Timer started for rule",
+  "rule_id": 123,
+  "interval_seconds": 5
+}
+```
+
+### Testing
+
+Sensors should include:
+
+- **Unit Tests**: Test message parsing, event creation logic
+- **Integration Tests**: Test against real RabbitMQ and API (test environment)
+- **Mock Tests**: Test with mocked API/MQ for isolated testing
+
+## Security Considerations
+
+### Token Storage
+
+- **Never Log Tokens**: Redact tokens in logs
+- **Memory Only**: Keep tokens in memory, never write to disk
+- **Automatic Refresh**: Refresh token when 80% of TTL elapsed (no restart required)
+- **Hot-Reload**: Update in-memory token without interrupting operations
+- **Refresh Failure Handling**: Log errors and retry with exponential backoff
+
+### Input Validation
+
+- **Validate All Inputs**: RabbitMQ messages, API responses
+- **Sanitize Payloads**: Prevent injection attacks in event payloads
+- **Rate Limiting**: Prevent resource exhaustion from malicious triggers
+- **Trigger Type Enforcement**: API validates that sensor tokens can only create events for declared trigger types
+
+### Network Security
+
+- **TLS**: Use HTTPS for API calls in production
+- **AMQPS**: Use TLS for RabbitMQ in production
+- **Timeouts**: Set reasonable timeouts for all network calls
+
+## Deployment
+
+### Service Management
+
+Sensors should be managed as system services:
+
+- **systemd**: Linux deployments
+- **launchd**: macOS deployments
+- **Docker**: Container deployments
+- **Kubernetes**: Orchestrated deployments (one pod per sensor type)
+
+### Resource Limits
+
+Recommended limits:
+
+- **Memory**: 64-256 MB per sensor (depends on rule count)
+- **CPU**: Minimal (<5% avg, spikes allowed)
+- **Network**: Low bandwidth (<1 Mbps typical)
+- **Disk**: Minimal (logs only)
+
+### Monitoring
+
+Sensors should expose metrics (future):
+
+- **Rules Active**: Count of rules being monitored
+- **Events Emitted**: Counter of events created
+- **Errors**: Counter of errors by type
+- **API Latency**: Histogram of API call durations
+- **MQ Latency**: Histogram of message processing durations
+
+## Compatibility
+
+### Versioning
+
+Sensors should:
+
+- **Declare Version**: Include sensor version in logs and metrics
+- **API Compatibility**: Support current API version
+- **Message Compatibility**: Handle unknown fields gracefully
+
+### Backwards Compatibility
+
+When updating sensors:
+
+- **Add Fields**: New message fields are optional
+- **Deprecate Fields**: Old fields remain supported for 2+ versions
+- **Breaking Changes**: Require major version bump and migration guide
+
+## Appendix: Reference Implementation
+
+See `attune/crates/sensor/` for the reference timer sensor implementation in Rust.
+
+Key components:
+
+- `src/main.rs` - Initialization and configuration
+- `src/listener.rs` - RabbitMQ message handling
+- `src/timer.rs` - Timer-specific logic
+- `src/api_client.rs` - API communication
+
+## Appendix: Message Queue Schema
+
+### Rule Lifecycle Messages
+
+**Exchange**: `attune` (topic exchange)
+
+**RuleCreated**:
+```json
+{
+  "event_type": "RuleCreated",
+  "rule_id": 123,
+  "rule_ref": "timer_every_5s",
+  "trigger_type": "core.timer",
+  "trigger_params": {"interval_seconds": 5},
+  "enabled": true,
+  "timestamp": "2025-01-27T12:34:56Z"
+}
+```
+
+**RuleEnabled**:
+```json
+{
+  "event_type": "RuleEnabled",
+  "rule_id": 123,
+  "trigger_type": "core.timer",
+  "trigger_params": {"interval_seconds": 5},
+  "timestamp": "2025-01-27T12:34:56Z"
+}
+```
+
+**RuleDisabled**:
+```json
+{
+  "event_type": "RuleDisabled",
+  "rule_id": 123,
+  "trigger_type": "core.timer",
+  "timestamp": "2025-01-27T12:34:56Z"
+}
+```
+
+**RuleDeleted**:
+```json
+{
+  "event_type": "RuleDeleted",
+  "rule_id": 123,
+  "trigger_type": "core.timer",
+  "timestamp": "2025-01-27T12:34:56Z"
+}
+```
+
+## Appendix: API Token Management
+
+### Creating Sensor Tokens
+
+Tokens are created via the Attune API (admin only):
+
+```http
+POST /service-accounts
+Authorization: Bearer {admin_token}
+Content-Type: application/json
+
+{
+  "name": "sensor:core.timer",
+  "description": "Timer sensor service account",
+  "scope": "sensor",
+  "ttl_days": 90
+}
+```
+
+Response:
+```json
+{
+  "identity_id": 123,
+  "name": "sensor:core.timer",
+  "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
+  "expires_at": "2025-04-27T12:34:56Z"
+}
+```
+
+### Token Scopes
+
+| Scope | Permissions |
+|-------|-------------|
+| `sensor` | Create events, read rules/triggers |
+| `action` | Read keys, update execution status (for action runners) |
+| `admin` | Full access (for CLI, web UI) |
+
+## Token Lifecycle Management
+
+### Automatic Token Refresh
+
+Sensors automatically refresh their own tokens without human intervention:
+
+**Refresh Timing:**
+- Tokens have 90-day TTL
+- Sensors refresh when 80% of TTL elapsed (72 days)
+- Calculation: `refresh_at = issued_at + (TTL * 0.8)`
+
+**Refresh Process:**
+1. Background task monitors token expiration
+2. When refresh threshold reached, call `POST /auth/refresh` with current token
+3. Receive new token with fresh 90-day TTL
+4. Hot-load new token (update in-memory reference)
+5. Old token remains valid until original expiration
+6. Continue operations without interruption
+
+**Implementation Pattern:**
+```rust
+// Calculate when to refresh (80% of TTL)
+let token_exp = decode_jwt(&token)?.exp;
+let token_iat = decode_jwt(&token)?.iat;
+let ttl_seconds = token_exp - token_iat;
+let refresh_at = token_iat + (ttl_seconds * 8 / 10);
+
+// Spawn background refresh task
+tokio::spawn(async move {
+    loop {
+        let now = current_timestamp();
+        if now >= refresh_at {
+            match api_client.refresh_token().await {
+                Ok(new_token) => {
+                    update_token(new_token);
+                    info!("Token refreshed successfully");
+                }
+                Err(e) => {
+                    error!("Failed to refresh token: {}", e);
+                    // Retry with exponential backoff
+                }
+            }
+        }
+        sleep(Duration::from_hours(1)).await;
+    }
+});
+```
+
+**Refresh Failure Handling:**
+1. Log error with full context
+2. Retry with exponential backoff (1min, 2min, 4min, 8min, max 1 hour)
+3. Continue using old token (still valid until expiration)
+4. Alert monitoring system after 3 consecutive failures
+5. If old token expires before successful refresh, shut down gracefully
+
+**Zero-Downtime:**
+- Old token valid during refresh
+- No service interruption
+- Graceful degradation on failure
+- No manual intervention required
+
+### Token Expiration (Edge Case)
+
+If automatic refresh fails and token expires:
+
+1. API returns 401 Unauthorized
+2. Sensor logs critical error
+3. Sensor shuts down gracefully (stops accepting work, completes in-flight operations)
+4. Operator must manually create new token and restart sensor
+
+**This should rarely occur** if automatic refresh is working correctly.
+
+## Future Enhancements
+
+1. **Health Checks**: HTTP endpoint for liveness/readiness probes
+2. **Metrics Export**: Prometheus-compatible metrics endpoint (including token refresh metrics)
+3. **Dynamic Discovery**: Auto-discover available sensors from registry
+4. **Sensor Scaling**: Support multiple instances per sensor type with work distribution
+5. **Backpressure**: Handle event backlog when API is slow/unavailable
+6. **Circuit Breaker**: Automatic failover when API is unreachable
+7. **Sensor Plugins**: Dynamic loading of sensor implementations
+8. **Configurable Refresh Threshold**: Allow custom refresh timing (e.g., 75%, 85%)
+9. **Token Refresh Alerts**: Alert on refresh failures, not normal refresh events