attune/docs/sensors/sensor-interface.md

# Sensor Interface Specification

**Version:** 1.0
**Last Updated:** 2025-01-27
**Status:** Draft

## Overview

This document specifies the standard interface that all Attune sensors must implement. Sensors are lightweight, long-running daemon processes that monitor for events and emit them into the Attune platform. Each sensor type has exactly one process instance running at a time, and individual sensor instances are managed dynamically based on active rules.

## Design Principles

1. **Single Process Per Sensor Type**: Each sensor type (e.g., timer, webhook, file_watcher) runs as a single daemon process
2. **Lightweight & Async**: Sensors should be event-driven and non-blocking
3. **Rule-Driven Behavior**: Sensors manage multiple concurrent "instances" based on active rules
4. **RabbitMQ Communication**: All control messages flow through RabbitMQ
5. **API Integration**: Sensors use the Attune API to emit events and fetch configuration
6. **Standard Authentication**: Sensors authenticate using transient API tokens
7. **Graceful Lifecycle**: Sensors handle startup, shutdown, and dynamic reconfiguration

## Sensor Lifecycle

### 1. Initialization

When a sensor starts, it must:

1. **Read Configuration** from environment variables or stdin
2. **Authenticate** with the Attune API using a transient token
3. **Connect to RabbitMQ** and declare/bind to its control queue
4. **Load Active Rules** from the API that use its trigger types
5. **Start Monitoring** for each active rule
6. **Signal Ready** (log startup completion)

### 2. Runtime Operation

During normal operation, a sensor:

1. **Listens to RabbitMQ** for rule lifecycle messages (`RuleCreated`, `RuleEnabled`, `RuleDisabled`, `RuleDeleted`)
2. **Monitors External Sources** (timers, webhooks, file systems, etc.) based on active rules
3. **Emits Events** to the Attune API when trigger conditions are met
4. **Handles Errors** gracefully without crashing
5. **Reports Health** (periodic heartbeat/metrics - future)

### 3. Shutdown

On shutdown (SIGTERM/SIGINT), a sensor must:

1. **Stop Accepting New Work** (stop listening to RabbitMQ)
2. **Cancel Active Monitors** (stop timers, close connections)
3. **Flush Pending Events** (send any buffered events to API)
4. **Close Connections** (RabbitMQ, HTTP clients)
5. **Exit Cleanly** with appropriate exit code

## Configuration

### Environment Variables

Sensors MUST accept the following environment variables:

| Variable | Required | Description | Example |
|----------|----------|-------------|---------|
| `ATTUNE_API_URL` | Yes | Base URL of Attune API | `http://localhost:8080` |
| `ATTUNE_API_TOKEN` | Yes | Transient API token for authentication | `sensor_abc123...` |
| `ATTUNE_SENSOR_REF` | Yes | Reference name of this sensor | `core.timer` |
| `ATTUNE_MQ_URL` | Yes | RabbitMQ connection URL | `amqp://localhost:5672` |
| `ATTUNE_MQ_EXCHANGE` | No | RabbitMQ exchange name | `attune` (default) |
| `ATTUNE_LOG_LEVEL` | No | Logging verbosity | `info` (default) |

### Alternative: stdin Configuration

For containerized or orchestrated deployments, sensors MAY accept configuration as JSON on stdin:

```json
{
  "api_url": "http://localhost:8080",
  "api_token": "sensor_abc123...",
  "sensor_ref": "core.timer",
  "mq_url": "amqp://localhost:5672",
  "mq_exchange": "attune",
  "log_level": "info"
}
```

If stdin is provided, it takes precedence over environment variables. The JSON must be a single line or complete object, followed by EOF or newline.

## API Authentication: Transient Tokens

### Token Requirements

- **Type**: JWT with `service_account` identity type
- **Scope**: Limited to sensor operations (create events, read rules)
- **Lifetime**: Long-lived (90 days) and auto-expires
- **Rotation**: Automatic refresh (sensor refreshes token when 80% of TTL elapsed)
- **Zero-Downtime**: Hot-reload new tokens without restart

### Token Format

Sensors receive a standard JWT that includes:

```json
{
  "sub": "sensor:core.timer",
  "jti": "abc123def456",  // JWT ID for revocation tracking
  "identity_id": 123,
  "identity_type": "service_account",
  "scope": "sensor",
  "iat": 1738800000,  // Issued at
  "exp": 1738886400,  // Expires in 24-72 hours (REQUIRED)
  "metadata": {
    "trigger_types": ["core.timer"]  // Enforced by API
  }
}
```

### API Endpoints Used by Sensors

Sensors interact with the following API endpoints:

| Method | Endpoint | Purpose | Auth |
|--------|----------|---------|------|
| GET | `/rules?trigger_type={ref}` | Fetch active rules for this sensor's triggers | Required |
| GET | `/triggers/{ref}` | Fetch trigger metadata | Required |
| POST | `/events` | Create new event | Required |
| POST | `/auth/refresh` | Refresh token before expiration | Required |
| GET | `/health` | Verify API connectivity | Optional |

## RabbitMQ Integration

### Queue Naming

Each sensor binds to a dedicated queue for control messages:

- **Queue Name**: `sensor.{sensor_ref}` (e.g., `sensor.core.timer`)
- **Durable**: Yes
- **Auto-Delete**: No
- **Exclusive**: No

### Exchange Binding

Sensors bind their queue to the main exchange with routing keys:

- `rule.created` - New rule created
- `rule.enabled` - Existing rule enabled
- `rule.disabled` - Existing rule disabled
- `rule.deleted` - Rule deleted

### Message Format

All control messages follow this JSON schema:

```json
{
  "event_type": "RuleCreated | RuleEnabled | RuleDisabled | RuleDeleted",
  "rule_id": 123,
  "trigger_type": "core.timer",
  "trigger_params": {
    "interval_seconds": 5
  },
  "timestamp": "2025-01-27T12:34:56Z"
}
```

### Message Handling

Sensors MUST:

1. **Validate** messages against expected schema
2. **Filter** messages to only process rules for their trigger types (based on token's `metadata.trigger_types`)
3. **Acknowledge** messages after processing (or reject on unrecoverable error)
4. **Handle Duplicates** idempotently (same rule_id + event_type)
5. **Enforce Trigger Type Restrictions**: Only emit events for trigger types declared in the sensor's token metadata

## Event Emission

### Event Creation API

Sensors create events by POSTing to `/events`:

```http
POST /events
Authorization: Bearer {sensor_token}
Content-Type: application/json

{
  "trigger_type": "core.timer",
  "payload": {
    "timestamp": "2025-01-27T12:34:56Z",
    "scheduled_time": "2025-01-27T12:34:56Z"
  },
  "trigger_instance_id": "rule_123"
}
```

**Important**: Sensors can only emit events for trigger types declared in their token's `metadata.trigger_types`. The API will reject event creation requests for unauthorized trigger types with a `403 Forbidden` error.

### Event Payload Guidelines

- **Timestamp**: Always include event occurrence time
- **Context**: Include relevant context for rule evaluation
- **Size**: Keep payloads small (<1KB recommended, <10KB max)
- **Sensitive Data**: Never include passwords, tokens, or PII unless explicitly required
- **Trigger Type Match**: The `trigger_type` field must match one of the sensor's declared trigger types

### Error Handling

If event creation fails:

1. **Retry** with exponential backoff (3 attempts)
2. **Log Error** with full context
3. **Continue Operating** (don't crash on single event failure)
4. **Alert** if failure rate exceeds threshold (future)

## Sensor-Specific Behavior

Each sensor type implements trigger-specific logic. The sensor monitors external sources and translates them into Attune events.

### Example: Timer Sensor

**Trigger Type**: `core.timer`

**Parameters**:
```json
{
  "interval_seconds": 60
}
```

**Behavior**:
- Maintains a hash map of `rule_id -> tokio::task::JoinHandle`
- On `RuleCreated`/`RuleEnabled`: Start an async timer loop for the rule
- On `RuleDisabled`/`RuleDeleted`: Cancel the timer task for the rule
- Timer loop: Every interval, emit an event with current timestamp

**Event Payload**:
```json
{
  "timestamp": "2025-01-27T12:34:56Z",
  "scheduled_time": "2025-01-27T12:34:56Z"
}
```

### Example: Webhook Sensor

**Trigger Type**: `core.webhook`

**Parameters**:
```json
{
  "path": "/hooks/deployment",
  "method": "POST",
  "secret": "shared_secret_123"
}
```

**Behavior**:
- Runs an HTTP server listening on configured port
- On `RuleCreated`/`RuleEnabled`: Register a route handler for the webhook path
- On `RuleDisabled`/`RuleDeleted`: Unregister the route handler
- On incoming request: Validate secret, emit event with request body

**Event Payload**:
```json
{
  "timestamp": "2025-01-27T12:34:56Z",
  "method": "POST",
  "path": "/hooks/deployment",
  "headers": {"Content-Type": "application/json"},
  "body": {"status": "deployed"}
}
```

### Example: File Watcher Sensor

**Trigger Type**: `core.file_changed`

**Parameters**:
```json
{
  "path": "/var/log/app.log",
  "event_types": ["modified", "created"]
}
```

**Behavior**:
- Uses inotify/FSEvents/equivalent to watch file system
- On `RuleCreated`/`RuleEnabled`: Add watch for the specified path
- On `RuleDisabled`/`RuleDeleted`: Remove watch for the path
- On file system event: Emit event with file details

**Event Payload**:
```json
{
  "timestamp": "2025-01-27T12:34:56Z",
  "path": "/var/log/app.log",
  "event_type": "modified",
  "size": 12345
}
```

## Implementation Guidelines

### Language & Runtime

- **Recommended**: Rust (for consistency with Attune services)
- **Alternatives**: Python, Node.js, Go (if justified by use case)
- **Async I/O**: Required for scalability

### Dependencies

Sensors should use:

- **HTTP Client**: For API communication (e.g., `reqwest` in Rust)
- **RabbitMQ Client**: For message queue (e.g., `lapin` in Rust)
- **Async Runtime**: For concurrency (e.g., `tokio` in Rust)
- **JSON Parsing**: For message/event handling (e.g., `serde_json` in Rust)
- **Logging**: Structured logging (e.g., `tracing` in Rust)

### Error Handling

- **Panic/Crash**: Never panic on external input (messages, API responses)
- **Retry Logic**: Implement exponential backoff for transient failures
- **Circuit Breaker**: Consider circuit breaker for API calls (future)
- **Graceful Degradation**: Continue operating even if some rules fail

### Logging

Sensors MUST log:

- **Startup**: Configuration loaded, connections established
- **Rule Changes**: Rule added/removed/updated
- **Events Emitted**: Event type and rule_id (not full payload)
- **Errors**: All errors with context
- **Shutdown**: Graceful shutdown initiated and completed

Log format should be JSON for structured logging:

```json
{
  "timestamp": "2025-01-27T12:34:56Z",
  "level": "info",
  "sensor": "core.timer",
  "message": "Timer started for rule",
  "rule_id": 123,
  "interval_seconds": 5
}
```

### Testing

Sensors should include:

- **Unit Tests**: Test message parsing, event creation logic
- **Integration Tests**: Test against real RabbitMQ and API (test environment)
- **Mock Tests**: Test with mocked API/MQ for isolated testing

## Security Considerations

### Token Storage

- **Never Log Tokens**: Redact tokens in logs
- **Memory Only**: Keep tokens in memory, never write to disk
- **Automatic Refresh**: Refresh token when 80% of TTL elapsed (no restart required)
- **Hot-Reload**: Update in-memory token without interrupting operations
- **Refresh Failure Handling**: Log errors and retry with exponential backoff

### Input Validation

- **Validate All Inputs**: RabbitMQ messages, API responses
- **Sanitize Payloads**: Prevent injection attacks in event payloads
- **Rate Limiting**: Prevent resource exhaustion from malicious triggers
- **Trigger Type Enforcement**: API validates that sensor tokens can only create events for declared trigger types

### Network Security

- **TLS**: Use HTTPS for API calls in production
- **AMQPS**: Use TLS for RabbitMQ in production
- **Timeouts**: Set reasonable timeouts for all network calls

## Deployment

### Service Management

Sensors should be managed as system services:

- **systemd**: Linux deployments
- **launchd**: macOS deployments
- **Docker**: Container deployments
- **Kubernetes**: Orchestrated deployments (one pod per sensor type)

### Resource Limits

Recommended limits:

- **Memory**: 64-256 MB per sensor (depends on rule count)
- **CPU**: Minimal (<5% avg, spikes allowed)
- **Network**: Low bandwidth (<1 Mbps typical)
- **Disk**: Minimal (logs only)

### Monitoring

Sensors should expose metrics (future):

- **Rules Active**: Count of rules being monitored
- **Events Emitted**: Counter of events created
- **Errors**: Counter of errors by type
- **API Latency**: Histogram of API call durations
- **MQ Latency**: Histogram of message processing durations

## Compatibility

### Versioning

Sensors should:

- **Declare Version**: Include sensor version in logs and metrics
- **API Compatibility**: Support current API version
- **Message Compatibility**: Handle unknown fields gracefully

### Backwards Compatibility

When updating sensors:

- **Add Fields**: New message fields are optional
- **Deprecate Fields**: Old fields remain supported for 2+ versions
- **Breaking Changes**: Require major version bump and migration guide

## Appendix: Reference Implementation

See `attune/crates/sensor/` for the reference timer sensor implementation in Rust.

Key components:

- `src/main.rs` - Initialization and configuration
- `src/listener.rs` - RabbitMQ message handling
- `src/timer.rs` - Timer-specific logic
- `src/api_client.rs` - API communication

## Appendix: Message Queue Schema

### Rule Lifecycle Messages

**Exchange**: `attune` (topic exchange)

**RuleCreated**:
```json
{
  "event_type": "RuleCreated",
  "rule_id": 123,
  "rule_ref": "timer_every_5s",
  "trigger_type": "core.timer",
  "trigger_params": {"interval_seconds": 5},
  "enabled": true,
  "timestamp": "2025-01-27T12:34:56Z"
}
```

**RuleEnabled**:
```json
{
  "event_type": "RuleEnabled",
  "rule_id": 123,
  "trigger_type": "core.timer",
  "trigger_params": {"interval_seconds": 5},
  "timestamp": "2025-01-27T12:34:56Z"
}
```

**RuleDisabled**:
```json
{
  "event_type": "RuleDisabled",
  "rule_id": 123,
  "trigger_type": "core.timer",
  "timestamp": "2025-01-27T12:34:56Z"
}
```

**RuleDeleted**:
```json
{
  "event_type": "RuleDeleted",
  "rule_id": 123,
  "trigger_type": "core.timer",
  "timestamp": "2025-01-27T12:34:56Z"
}
```

## Appendix: API Token Management

### Creating Sensor Tokens

Tokens are created via the Attune API (admin only):

```http
POST /service-accounts
Authorization: Bearer {admin_token}
Content-Type: application/json

{
  "name": "sensor:core.timer",
  "description": "Timer sensor service account",
  "scope": "sensor",
  "ttl_days": 90
}
```

Response:
```json
{
  "identity_id": 123,
  "name": "sensor:core.timer",
  "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
  "expires_at": "2025-04-27T12:34:56Z"
}
```

### Token Scopes

| Scope | Permissions |
|-------|-------------|
| `sensor` | Create events, read rules/triggers |
| `action` | Read keys, update execution status (for action runners) |
| `admin` | Full access (for CLI, web UI) |

## Token Lifecycle Management

### Automatic Token Refresh

Sensors automatically refresh their own tokens without human intervention:

**Refresh Timing:**
- Tokens have 90-day TTL
- Sensors refresh when 80% of TTL elapsed (72 days)
- Calculation: `refresh_at = issued_at + (TTL * 0.8)`

**Refresh Process:**
1. Background task monitors token expiration
2. When refresh threshold reached, call `POST /auth/refresh` with current token
3. Receive new token with fresh 90-day TTL
4. Hot-load new token (update in-memory reference)
5. Old token remains valid until original expiration
6. Continue operations without interruption

**Implementation Pattern:**
```rust
// Calculate when to refresh (80% of TTL)
let token_exp = decode_jwt(&token)?.exp;
let token_iat = decode_jwt(&token)?.iat;
let ttl_seconds = token_exp - token_iat;
let refresh_at = token_iat + (ttl_seconds * 8 / 10);

// Spawn background refresh task
tokio::spawn(async move {
    loop {
        let now = current_timestamp();
        if now >= refresh_at {
            match api_client.refresh_token().await {
                Ok(new_token) => {
                    update_token(new_token);
                    info!("Token refreshed successfully");
                }
                Err(e) => {
                    error!("Failed to refresh token: {}", e);
                    // Retry with exponential backoff
                }
            }
        }
        sleep(Duration::from_hours(1)).await;
    }
});
```

**Refresh Failure Handling:**
1. Log error with full context
2. Retry with exponential backoff (1min, 2min, 4min, 8min, max 1 hour)
3. Continue using old token (still valid until expiration)
4. Alert monitoring system after 3 consecutive failures
5. If old token expires before successful refresh, shut down gracefully

**Zero-Downtime:**
- Old token valid during refresh
- No service interruption
- Graceful degradation on failure
- No manual intervention required

### Token Expiration (Edge Case)

If automatic refresh fails and token expires:

1. API returns 401 Unauthorized
2. Sensor logs critical error
3. Sensor shuts down gracefully (stops accepting work, completes in-flight operations)
4. Operator must manually create new token and restart sensor

**This should rarely occur** if automatic refresh is working correctly.

## Future Enhancements

1. **Health Checks**: HTTP endpoint for liveness/readiness probes
2. **Metrics Export**: Prometheus-compatible metrics endpoint (including token refresh metrics)
3. **Dynamic Discovery**: Auto-discover available sensors from registry
4. **Sensor Scaling**: Support multiple instances per sensor type with work distribution
5. **Backpressure**: Handle event backlog when API is slow/unavailable
6. **Circuit Breaker**: Automatic failover when API is unreachable
7. **Sensor Plugins**: Dynamic loading of sensor implementations
8. **Configurable Refresh Threshold**: Allow custom refresh timing (e.g., 75%, 85%)
9. **Token Refresh Alerts**: Alert on refresh failures, not normal refresh events