attune/docs/authentication/service-accounts.md

# Service Accounts and Transient API Tokens

**Version:** 1.0
**Last Updated:** 2025-01-27
**Status:** Draft

## Overview

Service accounts provide programmatic access to the Attune API for sensors, action executions, and other automated processes. Unlike user accounts, service accounts:

- Have no password (token-based authentication only)
- Have limited scopes (principle of least privilege)
- Can be short-lived or long-lived depending on use case
- Are not tied to a human user
- Can be easily revoked without affecting user access

## Use Cases

1. **Sensors**: Long-lived tokens for sensor daemons to emit events
2. **Action Executions**: Short-lived tokens scoped to a single execution
3. **CLI Tools**: User-scoped tokens for command-line operations
4. **Webhooks**: Tokens for external systems to trigger actions
5. **Monitoring**: Tokens for health checks and metrics collection

## Token Types

### 1. Sensor Tokens

**Purpose**: Authentication for sensor daemon processes

**Characteristics**:
- **Lifetime**: Long-lived (90 days, auto-expires)
- **Scope**: `sensor`
- **Permissions**: Create events, read rules/triggers for specific trigger types
- **Revocable**: Yes (manual revocation via API)
- **Renewable**: Yes (automatic refresh via API, no restart required)
- **Rotation**: Automatic (sensor refreshes token when 80% of TTL elapsed)

**Example Usage**:
```bash
ATTUNE_API_TOKEN=sensor_abc123... ./attune-sensor --sensor-ref core.timer
```

### 2. Action Execution Tokens

**Purpose**: Authentication for action scripts during execution

**Characteristics**:
- **Lifetime**: Short-lived (matches execution timeout, typically 5-60 minutes)
- **Scope**: `action_execution`
- **Permissions**: Read keys, update execution status, limited to specific execution_id
- **Revocable**: Yes (auto-revoked on execution completion or timeout)
- **Renewable**: No (single-use, expires when execution completes or times out)
- **Auto-Cleanup**: Token revocation records are auto-deleted after expiration

**Example Usage**:
```python
# Action script receives token via environment variable
import os
import requests

api_url = os.environ['ATTUNE_API_URL']
api_token = os.environ['ATTUNE_API_TOKEN']
execution_id = os.environ['ATTUNE_EXECUTION_ID']

# Fetch encrypted key
response = requests.get(
    f"{api_url}/keys/myapp.api_key",
    headers={"Authorization": f"Bearer {api_token}"}
)
secret = response.json()['value']
```

### 3. User CLI Tokens

**Purpose**: Authentication for CLI tools on behalf of a user

**Characteristics**:
- **Lifetime**: Medium-lived (7-30 days)
- **Scope**: `user`
- **Permissions**: Full user permissions (RBAC-based)
- **Revocable**: Yes
- **Renewable**: Yes (via refresh token)

**Example Usage**:
```bash
attune auth login  # Stores token in ~/.attune/token
attune action execute core.echo --param message="Hello"
```

### 4. Webhook Tokens

**Purpose**: Authentication for external systems calling Attune webhooks

**Characteristics**:
- **Lifetime**: Long-lived (90-365 days, auto-expires)
- **Scope**: `webhook`
- **Permissions**: Trigger specific actions or create events
- **Revocable**: Yes
- **Renewable**: Yes (generate new token before expiration)
- **Rotation**: Recommended every 90 days

**Example Usage**:
```bash
curl -X POST https://attune.example.com/api/webhooks/deploy \
  -H "Authorization: Bearer webhook_xyz789..." \
  -d '{"status": "deployed"}'
```

## Token Scopes and Permissions

| Scope | Permissions | Use Case |
|-------|-------------|----------|
| `admin` | Full access to all resources | System administrators, web UI |
| `user` | RBAC-based permissions | CLI tools, user sessions |
| `sensor` | Create events, read rules/triggers | Sensor daemons |
| `action_execution` | Read keys, update execution (scoped to execution_id) | Action scripts |
| `webhook` | Create events, trigger actions | External integrations |
| `readonly` | Read-only access to all resources | Monitoring, auditing |

## Database Schema

### Identity Table

Service accounts are stored in the `identity` table with `identity_type = 'service_account'`:

```sql
CREATE TABLE identity (
    id BIGSERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL UNIQUE,
    identity_type identity_type NOT NULL,  -- 'user' or 'service_account'
    email VARCHAR(255),  -- NULL for service accounts
    password_hash VARCHAR(255),  -- NULL for service accounts
    metadata JSONB DEFAULT '{}',
    created TIMESTAMPTZ DEFAULT NOW(),
    updated TIMESTAMPTZ DEFAULT NOW()
);
```

Service account metadata includes:
```json
{
  "scope": "sensor",
  "description": "Timer sensor service account",
  "created_by": 1,  // identity_id of creator
  "expires_at": "2025-04-27T12:34:56Z",
  "trigger_types": ["core.timer"],  // For sensor scope
  "execution_id": 123  // For action_execution scope
}
```

### Token Storage

Tokens are **not** stored in the database (they are stateless JWTs). However, revocation is tracked:

```sql
CREATE TABLE token_revocation (
    id BIGSERIAL PRIMARY KEY,
    identity_id BIGINT NOT NULL REFERENCES identity(id) ON DELETE CASCADE,
    token_jti VARCHAR(255) NOT NULL,  -- JWT ID (jti claim)
    token_exp TIMESTAMPTZ NOT NULL,   -- Token expiration (from exp claim)
    revoked_at TIMESTAMPTZ DEFAULT NOW(),
    revoked_by BIGINT REFERENCES identity(id),
    reason VARCHAR(500),
    UNIQUE(token_jti)
);

CREATE INDEX idx_token_revocation_jti ON token_revocation(token_jti);
CREATE INDEX idx_token_revocation_identity ON token_revocation(identity_id);
CREATE INDEX idx_token_revocation_exp ON token_revocation(token_exp);  -- For cleanup queries
```

## JWT Token Format

### Claims

All service account tokens include these claims:

```json
{
  "sub": "sensor:core.timer",  // Subject: "type:name"
  "jti": "abc123...",  // JWT ID (for revocation)
  "iat": 1706356496,  // Issued at (Unix timestamp)
  "exp": 1714132496,  // Expires at (Unix timestamp)
  "identity_id": 123,
  "identity_type": "service_account",
  "scope": "sensor",
  "metadata": {
    "trigger_types": ["core.timer"]
  }
}
```

### Scope-Specific Claims

**Sensor tokens** (restricted to declared trigger types):
```json
{
  "scope": "sensor",
  "metadata": {
    "trigger_types": ["core.timer", "core.interval"]
  }
}
```

The API enforces that sensors can only create events for trigger types listed in `metadata.trigger_types`. Attempting to create an event for an unauthorized trigger type will result in a `403 Forbidden` error.

**Action execution tokens**:
```json
{
  "scope": "action_execution",
  "metadata": {
    "execution_id": 456,
    "action_ref": "core.echo",
    "workflow_id": 789  // Optional, if part of workflow
  }
}
```

**Webhook tokens**:
```json
{
  "scope": "webhook",
  "metadata": {
    "allowed_paths": ["/webhooks/deploy", "/webhooks/alert"],
    "ip_whitelist": ["203.0.113.0/24"]  // Optional
  }
}
```

## API Endpoints

### Create Service Account

**Admin only**

```http
POST /service-accounts
Authorization: Bearer {admin_token}
Content-Type: application/json

{
  "name": "sensor:core.timer",
  "scope": "sensor",
  "description": "Timer sensor service account",
  "ttl_days": 90,  // Sensor tokens: 90 days, auto-refresh before expiration
  "metadata": {
    "trigger_types": ["core.timer"]
  }
}
```

**Response**:
```json
{
  "identity_id": 123,
  "name": "sensor:core.timer",
  "scope": "sensor",
  "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
  "expires_at": "2025-04-27T12:34:56Z"  // 90 days from now
}
```

**Important**: The token is only shown once. Store it securely.

### List Service Accounts

**Admin only**

```http
GET /service-accounts
Authorization: Bearer {admin_token}
```

**Response**:
```json
{
  "data": [
    {
      "identity_id": 123,
      "name": "sensor:core.timer",
      "scope": "sensor",
      "created_at": "2025-01-27T12:34:56Z",
      "expires_at": "2025-04-27T12:34:56Z",
      "metadata": {
        "trigger_types": ["core.timer"]
      }
    }
  ]
}
```

### Refresh Token (Self-Service)

**Sensor/User tokens can refresh themselves**

```http
POST /auth/refresh
Authorization: Bearer {current_token}
Content-Type: application/json

{}
```

**Response**:
```json
{
  "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
  "expires_at": "2025-04-27T12:34:56Z"
}
```

**Notes**:
- Current token must be valid (not expired, not revoked)
- New token has same scope and metadata as current token
- New token has same TTL as original token type (e.g., 90 days for sensors)
- Old token remains valid until its original expiration (allows zero-downtime refresh)
- Only `sensor` and `user` scopes can refresh (not `action_execution` or `webhook`)

### Revoke Service Account Token

**Admin only**

```http
DELETE /service-accounts/{identity_id}
Authorization: Bearer {admin_token}
Content-Type: application/json

{
  "reason": "Token compromised"
}
```

**Response**:
```json
{
  "message": "Service account revoked",
  "identity_id": 123
}
```

### Create Execution Token (Internal)

**Called by executor service, not exposed in API**

```rust
// In executor service
let execution_timeout_minutes = get_action_timeout(action_ref); // e.g., 30 minutes
let token = create_execution_token(
    execution_id,
    action_ref,
    ttl_minutes: execution_timeout_minutes
)?;
```

This token is passed to the worker service, which injects it into the action's environment.

## Token Creation Workflow

### 1. Sensor Token Creation

```
Admin → POST /service-accounts (scope=sensor) → API
API → Create identity record → Database
API → Generate JWT with sensor scope → Response
Admin → Store token in secure config → Sensor deployment
Sensor → Use token for API calls → Event emission
```

### 2. Execution Token Creation

```
Rule fires → Executor creates enforcement → Executor
Executor → Schedule execution → Database
Executor → Create execution token (internal) → JWT library
Executor → Send execution request to worker → RabbitMQ
Worker → Receive message with token → Action runner
Action → Use token to fetch keys → API
Execution completes → Token expires (TTL) → Automatic cleanup
```

## Token Validation

### Middleware (API Service)

```rust
// In API service
pub async fn validate_token(
    token: &str,
    required_scope: Option<&str>
) -> Result<Claims> {
    // 1. Verify JWT signature
    let claims = decode_jwt(token)?;

    // 2. Check expiration (JWT library handles this, but explicit check for clarity)
    if claims.exp < now() {
        return Err(Error::TokenExpired);
    }

    // 3. Check revocation (only check non-expired tokens)
    if is_revoked(&claims.jti, claims.exp).await? {
        return Err(Error::TokenRevoked);
    }

    // 4. Check scope
    if let Some(scope) = required_scope {
        if claims.scope != scope {
            return Err(Error::InsufficientPermissions);
        }
    }

    Ok(claims)
}
```

### Scope-Based Authorization

```rust
// Execution-scoped token can only access its own execution
if claims.scope == "action_execution" {
    let allowed_execution_id = claims.metadata
        .get("execution_id")
        .and_then(|v| v.as_i64())
        .ok_or(Error::InvalidToken)?;

    if execution_id != allowed_execution_id {
        return Err(Error::InsufficientPermissions);
    }
}

// Sensor-scoped token can only create events for declared trigger types
if claims.scope == "sensor" {
    let allowed_trigger_types = claims.metadata
        .get("trigger_types")
        .and_then(|v| v.as_array())
        .ok_or(Error::InvalidToken)?;

    let allowed_types: Vec<String> = allowed_trigger_types
        .iter()
        .filter_map(|v| v.as_str().map(String::from))
        .collect();

    if !allowed_types.contains(&trigger_type) {
        return Err(Error::InsufficientPermissions);
    }
}
```

## Security Best Practices

### Token Generation
**Generation:**

1. **Use Strong Secrets**: JWT signing key must be 256+ bits, randomly generated
2. **Include JTI**: Always include `jti` claim for revocation support
3. **REQUIRED Expiration**: All tokens MUST have `exp` claim - no exceptions
   - Sensor tokens: 90 days (auto-refresh before expiration)
   - Action execution tokens: Match execution timeout (5-60 minutes)
   - User CLI tokens: 7-30 days (auto-refresh before expiration)
   - Webhook tokens: 90-365 days (manual rotation)
4. **Minimal Scope**: Grant least privilege necessary
5. **Restrict Trigger Types**: For sensor tokens, only include necessary trigger types in metadata

### Token Storage

1. **Environment Variables**: Preferred method for sensors and actions
2. **Never Log**: Redact tokens from logs (show only last 4 chars)
3. **Never Commit**: Don't commit tokens to version control
4. **Secure Config**: Store in encrypted config management (Vault, k8s secrets)

### Token Transmission

1. **HTTPS Only**: Never send tokens over unencrypted connections
2. **Authorization Header**: Use `Authorization: Bearer {token}` header
3. **No Query Params**: Don't pass tokens in URL query parameters
4. **No Cookies**: For service accounts, avoid cookie-based auth

### Token Revocation

1. **Immediate Revocation**: Check revocation list on every request
2. **Audit Trail**: Log who revoked, when, and why
3. **Cascade Delete**: Revoke all tokens when service account is deleted
4. **Automatic Cleanup**: Delete revocation records for expired tokens (run hourly)
   - Query: `DELETE FROM token_revocation WHERE token_exp < NOW()`
   - Prevents indefinite table bloat
   - Expired tokens are already invalid, no need to track revocation
5. **Validate Permissions**: Enforce trigger type restrictions for sensor tokens on event creation

## Implementation Checklist

- [ ] Add `identity_type` enum to database schema
- [ ] Add `token_revocation` table (with `token_exp` column)
- [ ] Create `POST /service-accounts` endpoint
- [ ] Create `GET /service-accounts` endpoint
- [ ] Create `DELETE /service-accounts/{id}` endpoint
- [ ] Create `POST /auth/refresh` endpoint (for automatic token refresh)
- [ ] Add scope validation middleware
- [ ] Add token revocation check middleware (skip check for expired tokens)
- [ ] Implement execution token creation in executor (TTL = action timeout)
- [ ] Pass execution token to worker via RabbitMQ
- [ ] Inject execution token into action environment
- [ ] Add CLI commands: `attune service-account create/list/revoke`
- [ ] Document token creation for sensor deployment
- [ ] Implement automatic token refresh in sensors (refresh at 80% of TTL)
- [ ] Implement cleanup job for expired token revocations (hourly cron)

## Migration Path

### Phase 1: Database Schema

```sql
-- Add identity_type enum if not exists
DO $$ BEGIN
    CREATE TYPE identity_type AS ENUM ('user', 'service_account');
EXCEPTION
    WHEN duplicate_object THEN null;
END $$;

-- Add identity_type column to identity table
ALTER TABLE identity
    ADD COLUMN IF NOT EXISTS identity_type identity_type DEFAULT 'user';

-- Create token_revocation table
CREATE TABLE IF NOT EXISTS token_revocation (
    id BIGSERIAL PRIMARY KEY,
    identity_id BIGINT NOT NULL REFERENCES identity(id) ON DELETE CASCADE,
    token_jti VARCHAR(255) NOT NULL,
    token_exp TIMESTAMPTZ NOT NULL,  -- For cleanup queries
    revoked_at TIMESTAMPTZ DEFAULT NOW(),
    revoked_by BIGINT REFERENCES identity(id),
    reason VARCHAR(500),
    UNIQUE(token_jti)
);

CREATE INDEX IF NOT EXISTS idx_token_revocation_jti ON token_revocation(token_jti);
CREATE INDEX IF NOT EXISTS idx_token_revocation_exp ON token_revocation(token_exp);
```

### Phase 2: API Implementation

1. Add service account repository
2. Add JWT utilities for scope-based tokens
3. Implement service account CRUD endpoints
4. Add middleware for token validation and revocation

### Phase 3: Integration

1. Update executor to create execution tokens
2. Update worker to receive and use execution tokens
3. Update sensor to accept and use sensor tokens
4. Update CLI to support service account management

## Examples

### Python Action Using Execution Token

```python
#!/usr/bin/env python3
import os
import requests
import sys

# Token is injected by worker
api_url = os.environ['ATTUNE_API_URL']
api_token = os.environ['ATTUNE_API_TOKEN']
execution_id = os.environ['ATTUNE_EXECUTION_ID']

# Fetch encrypted secret
response = requests.get(
    f"{api_url}/keys/myapp.database_password",
    headers={"Authorization": f"Bearer {api_token}"}
)

if response.status_code != 200:
    print(f"Failed to fetch key: {response.text}", file=sys.stderr)
    sys.exit(1)

db_password = response.json()['value']

# Use the secret...
print("Successfully connected to database")
```

### Sensor Using Sensor Token

```rust
// In sensor initialization
let api_token = env::var("ATTUNE_API_TOKEN")?;
let api_url = env::var("ATTUNE_API_URL")?;

let client = reqwest::Client::new();

// Fetch active rules
let response = client
    .get(format!("{}/rules?trigger_type=core.timer", api_url))
    .header("Authorization", format!("Bearer {}", api_token))
    .send()
    .await?;

let rules: Vec<Rule> = response.json().await?;
```

## Token Lifecycle Management

### Expiration Strategy

**All tokens MUST expire** to prevent indefinite revocation table bloat and reduce attack surface:

| Token Type | Expiration | Rationale |
|------------|------------|-----------|
| Sensor | 90 days | Perpetually running service, auto-refresh before expiration |
| Action Execution | 5-60 minutes | Matches action timeout, auto-cleanup on completion |
| User CLI | 7-30 days | Balance between convenience and security, auto-refresh |
| Webhook | 90-365 days | External integration, manual rotation required |

### Revocation Table Cleanup

Cleanup job runs hourly to prevent table bloat:

```sql
-- Delete revocation records for expired tokens
DELETE FROM token_revocation
WHERE token_exp < NOW();
```

**Why this works:**
- Expired tokens are already invalid (enforced by JWT `exp` claim)
- No need to track revocation status for invalid tokens
- Keeps revocation table small and queries fast
- Typical size: <1000 rows instead of millions

### Sensor Token Refresh

Sensors automatically refresh their own tokens without human intervention:

**Automatic Process:**
1. Sensor starts with 90-day token
2. Background task monitors token expiration
3. When 80% of TTL elapsed (72 days), sensor requests new token via `POST /auth/refresh`
4. New token is hot-loaded without restart
5. Old token remains valid until original expiration
6. Process repeats indefinitely

**Refresh Timing Example:**
- Token issued: Day 0, expires Day 90
- Refresh trigger: Day 72 (80% of 90 days)
- New token issued: Day 72, expires Day 162
- Old token still valid: Day 72-90 (overlap period)
- Next refresh: Day 144 (80% of new token)

**Zero-Downtime:**
- No service interruption during refresh
- Old token valid during transition
- Graceful fallback on refresh failure

## Cleanup Job Implementation

### Purpose

Prevent indefinite growth of the `token_revocation` table by removing revocation records for expired tokens.

### Why Cleanup Is Safe

- Expired tokens are already invalid (enforced by JWT `exp` claim)
- Token validation checks expiration before checking revocation
- No security risk in deleting expired token revocations
- Significantly reduces table size and improves query performance

### Implementation

**Frequency**: Hourly cron job or background task

**SQL Query**:
```sql
DELETE FROM token_revocation
WHERE token_exp < NOW();
```

**Expected Impact**:
- Typical table size: <1,000 rows instead of millions over time
- Fast revocation checks (indexed queries on small dataset)
- Reduced storage and backup costs

### Rust Implementation Example

```rust
use tokio::time::{interval, Duration};

/// Background task to clean up expired token revocations
pub async fn start_revocation_cleanup_task(db: PgPool) {
    let mut interval = interval(Duration::from_secs(3600)); // Every hour

    loop {
        interval.tick().await;

        match cleanup_expired_revocations(&db).await {
            Ok(count) => {
                info!("Cleaned up {} expired token revocations", count);
            }
            Err(e) => {
                error!("Failed to clean up expired token revocations: {}", e);
            }
        }
    }
}

/// Delete token revocation records for expired tokens
async fn cleanup_expired_revocations(db: &PgPool) -> Result<u64> {
    let result = sqlx::query!(
        "DELETE FROM token_revocation WHERE token_exp < NOW()"
    )
    .execute(db)
    .await?;

    Ok(result.rows_affected())
}
```

### Monitoring

Track cleanup job metrics:
- Number of records deleted per run
- Job execution time
- Job failures (alert if consecutive failures)

**Prometheus Metrics Example**:
```rust
// Define metrics
lazy_static! {
    static ref REVOCATION_CLEANUP_COUNT: IntCounter = register_int_counter!(
        "attune_revocation_cleanup_total",
        "Total number of expired token revocations cleaned up"
    ).unwrap();

    static ref REVOCATION_CLEANUP_DURATION: Histogram = register_histogram!(
        "attune_revocation_cleanup_duration_seconds",
        "Duration of token revocation cleanup job"
    ).unwrap();
}

// In cleanup function
let timer = REVOCATION_CLEANUP_DURATION.start_timer();
let count = cleanup_expired_revocations(&db).await?;
REVOCATION_CLEANUP_COUNT.inc_by(count);
timer.observe_duration();
```

### Alternative: Database Trigger

For automatic cleanup without application code:

```sql
-- Create function to delete old revocations
CREATE OR REPLACE FUNCTION cleanup_expired_token_revocations()
RETURNS trigger AS $$
BEGIN
    DELETE FROM token_revocation WHERE token_exp < NOW() - INTERVAL '1 hour';
    RETURN NULL;
END;
$$ LANGUAGE plpgsql;

-- Trigger on insert (cleanup when new revocations are added)
CREATE TRIGGER trigger_cleanup_expired_revocations
    AFTER INSERT ON token_revocation
    EXECUTE FUNCTION cleanup_expired_token_revocations();
```

**Note**: Application-level cleanup is preferred for better observability and control.

## Future Enhancements

1. **Rate Limiting**: Per-token rate limits to prevent abuse
2. **Audit Logging**: Comprehensive audit trail of token usage and refresh events
3. **OAuth 2.0**: Support OAuth 2.0 client credentials flow
4. **mTLS**: Mutual TLS authentication for high-security deployments
5. **Token Introspection**: RFC 7662-compliant token introspection endpoint
6. **Scope Hierarchies**: More granular permission scopes
7. **IP Whitelisting**: Restrict token usage to specific IP ranges
8. **Configurable Refresh Timing**: Allow custom refresh thresholds per token type
9. **Token Lineage Tracking**: Track token refresh chains for security audits
8. **Refresh Failure Alerts**: Notify operators when automatic refresh fails
9. **Token Lineage Tracking**: Track token refresh chains for audit purposes