Files

David Culbreth 3b14c65998 re-uploading work

2026-02-04 17:46:30 -06:00

8.6 KiB

Raw Blame History

Work Summary: Execution Completion Issue Fix

Date: January 29, 2026
Issue: Executions getting stuck in "running" state and never completing
Status: ✅ Resolved

Problem Statement

Executions submitted via the web UI or CLI were being scheduled and transitioning to "running" status, but never completing (transitioning to "completed" or "failed"). The executions remained stuck in "running" state indefinitely.

Symptoms

Executions created via attune run core.echo would show status "requested" → "running" but never "completed"
Database showed multiple executions stuck in "running" status
No error messages visible in user-facing logs
Worker service was running and appeared to be processing messages

Root Causes Identified

Investigation revealed four distinct issues that compounded to prevent execution completion:

1. Hardcoded Schema Prefixes in SQL Queries

Location: attune/crates/worker/src/executor.rs, attune/crates/worker/src/secrets.rs

Problem: The worker service had hardcoded schema prefixes (attune.action, attune.key, attune.runtime) in SQL queries, but the development environment uses the public schema.

Error: relation "attune.action" does not exist

Files affected:

executor.rs lines 133, 159-160, 262 (action and runtime queries)
secrets.rs lines 104, 120 (key/secrets queries)

Fix: Removed all hardcoded schema prefixes and rely on PostgreSQL's search_path mechanism instead.

// Before:
sqlx::query_as::<_, Action>("SELECT * FROM attune.action WHERE id = $1")

// After:
sqlx::query_as::<_, Action>("SELECT * FROM action WHERE id = $1")

2. Runtime Name Case Sensitivity Mismatch

Location: attune/crates/worker/src/executor.rs line 272

Problem: The database stores runtime names with capitalization (e.g., "Shell", "Python"), but the code was performing case-sensitive string comparisons with lowercase values ("shell").

Impact: Runtime selection failed, preventing the worker from choosing the correct runtime to execute the action.

Fix: Convert runtime names to lowercase after loading from database:

// Before:
Some(runtime.name)

// After:
Some(runtime.name.to_lowercase())

3. Missing Pack File Loading Mechanism

Location: attune/crates/worker/src/executor.rs, attune/crates/worker/src/service.rs

Problem: The worker had no mechanism to locate and load pack action script files from disk. It only had the entrypoint field from the database (e.g., "echo.sh"), which is just a filename, not a file path or script content.

Impact: The worker tried to execute the string "echo.sh" as a shell command directly, which failed with "command not found".

Fix:

Added packs_base_dir field to ActionExecutor struct
Implemented file path resolution: {packs_base_dir}/{pack_ref}/actions/{entrypoint}
Set code_path in execution context when action file exists
Updated configuration to include packs_base_dir: ./packs for development

Files modified:

executor.rs: Added packs_base_dir field and path construction logic (lines 28, 296-317)
service.rs: Pass packs_base_dir from config to ActionExecutor (line 191)
config.development.yaml: Added packs_base_dir: ./packs configuration

4. Missing Parameter Passing to Shell Scripts

Location: attune/crates/worker/src/runtime/shell.rs

Problem: When executing action files directly via code_path, parameters were stored in context.parameters but not converted to environment variables. Shell scripts expect parameters as environment variables with ATTUNE_ACTION_ prefix.

Impact: Actions executed but ignored all parameters, using default values instead.

Fix: Convert parameters to environment variables before executing shell files:

// Merge parameters into environment variables with ATTUNE_ACTION_ prefix
let mut env = context.env.clone();
for (key, value) in &context.parameters {
    env.insert(format!("ATTUNE_ACTION_{}", key.to_uppercase()), value_str);
}

Changes Made

Files Modified

attune/crates/worker/src/executor.rs
- Removed hardcoded schema prefixes (3 locations)
- Added packs_base_dir: PathBuf field to ActionExecutor struct
- Implemented pack action file path resolution logic
- Convert runtime names to lowercase for comparison
- Added std::path::PathBuf import
attune/crates/worker/src/secrets.rs
- Removed hardcoded schema prefixes from key queries (2 locations)
attune/crates/worker/src/service.rs
- Extract packs_base_dir from config
- Pass packs_base_dir to ActionExecutor constructor
attune/crates/worker/src/runtime/shell.rs
- Added parameter-to-environment-variable conversion for file execution
- Parameters exported with ATTUNE_ACTION_ prefix
attune/config.development.yaml
- Added packs_base_dir: ./packs configuration
attune/.rules
- Updated "Common Pitfalls" section with schema prefix warning
- Updated "Key Tools & Libraries" with pack file loading documentation
- Updated project status to reflect completed worker service

Testing & Validation

Test Cases Executed

Basic execution without parameters:

attune run core.echo
# Expected: "Hello, World!"
# Result: ✅ Success - execution completed, correct output

Execution with parameters:

attune run core.echo --param message="Testing parameters!" --param uppercase=true
# Expected: "TESTING PARAMETERS!"
# Result: ✅ Success - execution completed, parameters applied correctly

Database verification:

SELECT id, status, result->>'stdout' as output FROM execution WHERE id = 16;
-- Result: status='completed', output='TESTING PARAMETERS!\n'

Execution Flow Verified

Complete end-to-end execution flow confirmed working:

CLI/API → Execution Created (requested) 
       → Executor Schedules (scheduling → scheduled)
       → Worker Receives Message
       → Worker Updates Status (running)
       → Worker Loads Action from DB
       → Worker Resolves Pack File Path
       → Worker Executes Action with Parameters
       → Worker Publishes Status (completed/failed)
       → Executor Updates Database

Previous Stuck Executions

Executions 7, 8, and 9 remain in "running" status as they were stuck before the fix was applied. These are effectively abandoned executions and would require manual cleanup or a background job to mark as "abandoned".

Performance Impact

Execution speed: Shell actions complete in 1-3ms (measured)
No performance degradation: File path resolution is minimal overhead
Schema-agnostic queries: Better compatibility across environments

Lessons Learned

Always use PostgreSQL search_path: Never hardcode schema prefixes in queries
Case-insensitive comparisons: When dealing with user-configured data like runtime names
Complete file path resolution: Don't assume filenames alone are sufficient
Environment variable conventions: Document and consistently apply naming conventions (e.g., ATTUNE_ACTION_ prefix)
Test across schema configurations: Ensure code works in both public and custom schemas

Follow-up Items

Recommended (Not Blocking)

Abandoned execution cleanup: Implement background job to mark stuck executions as "abandoned" after timeout
Pack validation on startup: Verify pack files exist and are readable when worker starts
Better error messages: Add specific error messages for missing pack files
Python runtime testing: Verify similar fixes work for Python actions (not tested in this session)
Integration tests: Add end-to-end tests for pack file loading and parameter passing

Future Enhancements

Pack hot-reloading: Allow pack updates without worker restart
Action code caching: Cache frequently-used action files in memory
Execution retry mechanism: Auto-retry failed executions with backoff
Execution timeout handling: Implement proper timeout for long-running actions

Conclusion

The execution completion issue has been fully resolved. All four root causes have been addressed:

✅ Schema prefixes removed
✅ Runtime name comparison fixed
✅ Pack file loading implemented
✅ Parameter passing corrected

The system now successfully executes actions from pack files with proper parameter handling, and executions complete correctly with appropriate status transitions.

Verification: Executions 13, 14, and 16 all completed successfully with correct output, demonstrating that the fix is working as expected.

8.6 KiB Raw Blame History