Files

David Culbreth 3b14c65998 re-uploading work

2026-02-04 17:46:30 -06:00

8.0 KiB

Raw Blame History

Fix: Execution Failure Detection and Error Capture

Date: 2026-01-30
Issue: Executions occasionally fail with "Execution failed during preparation" error even though stdout.log shows the action ran successfully
Status: Fixed

Problem Description

Users reported occasional execution failures with the following characteristics:

Error message: "Execution failed during preparation"
Result JSON shows "succeeded": false
The stdout.log file exists and contains output from the action
The action appears to have run, but the system failed to capture the success

Example Error

{
  "error": "Execution failed during preparation",
  "stdout_log": "/tmp/attune/artifacts/execution_10172/stdout.log",
  "succeeded": false
}

Root Cause Analysis

The issue was identified in the worker's execution flow, specifically in how runtime errors are handled:

1. Process Wait Failures

In shell.rs (execute_with_streaming method), if child.wait() fails after the process has already started and written output:

Ok(Err(e)) => {
    return Err(RuntimeError::ProcessError(format!(
        "Process wait failed: {}",
        e
    )));
}

This returns an Err even though:

The child process ran successfully
Output was captured to stdout/stderr
The process may have completed normally

2. Stdin Write Failures

Writing secrets to stdin could fail after the process spawned:

let secrets_json = serde_json::to_string(secrets)?;
stdin.write_all(secrets_json.as_bytes()).await?;

The ? operator would propagate the error up, discarding captured output.

3. Error Propagation in Executor

In executor.rs, when execute_action() returns an Err:

let result = match self.execute_action(context).await {
    Ok(result) => result,
    Err(e) => {
        error!("Action execution failed: {}", e);
        self.handle_execution_failure(execution_id, None).await?;  // None = no result
        return Err(e);
    }
};

Passing None to handle_execution_failure triggers the "Execution failed during preparation" message, even though logs exist.

4. Poor Error Messages

When exit code was non-zero, the entire stderr was used as the error message, which could be very long and unhelpful.

Solution Implemented

Changes to `shell.rs`

1. Graceful Stdin Write Handling

let stdin_write_error = if let Some(mut stdin) = child.stdin.take() {
    match serde_json::to_string(secrets) {
        Ok(secrets_json) => {
            if let Err(e) = stdin.write_all(secrets_json.as_bytes()).await {
                Some(format!("Failed to write secrets to stdin: {}", e))
            } else if let Err(e) = stdin.write_all(b"\n").await {
                Some(format!("Failed to write newline to stdin: {}", e))
            } else {
                drop(stdin);
                None
            }
        }
        Err(e) => Some(format!("Failed to serialize secrets: {}", e)),
    }
} else {
    None
};

Capture stdin write errors instead of propagating them
Continue execution to capture output
Include error in ExecutionResult

2. Process Wait Error Recovery

let (exit_code, process_error) = match wait_result {
    Ok(Ok(status)) => (status.code().unwrap_or(-1), None),
    Ok(Err(e)) => {
        // Process wait failed, but we have the output - return it with an error
        warn!("Process wait failed but captured output: {}", e);
        (-1, Some(format!("Process wait failed: {}", e)))
    }
    Err(_) => {
        // Timeout occurred - return captured output
        return Ok(ExecutionResult {
            exit_code: -1,
            stdout: stdout_result.content.clone(),
            stderr: stderr_result.content.clone(),
            // ... include truncation info
        });
    }
};

Always return Ok(ExecutionResult) when we have captured output
Include process wait errors in the result's error field
Preserve stdout/stderr even on timeout

3. Improved Error Messages

let error = if let Some(proc_err) = process_error {
    Some(proc_err)
} else if let Some(stdin_err) = stdin_write_error {
    Some(stdin_err)
} else if exit_code != 0 {
    Some(if stderr_result.content.is_empty() {
        format!("Command exited with code {}", exit_code)
    } else {
        // Use last line of stderr as error, or full stderr if short
        if stderr_result.content.lines().count() > 5 {
            stderr_result.content.lines().last().unwrap_or("").to_string()
        } else {
            stderr_result.content.clone()
        }
    })
} else {
    None
};

Prioritize specific error sources
Use last line of stderr for concise error messages
Full stderr only if short (≤5 lines)

Changes to `executor.rs`

1. Better Documentation

// Note: execute_action should rarely return Err - most failures should be
// captured in ExecutionResult with non-zero exit codes
let result = match self.execute_action(context).await {
    Ok(result) => result,
    Err(e) => {
        error!("Action execution failed catastrophically: {}", e);
        // This should only happen for unrecoverable errors like runtime not found

Clarified that returning Err should be rare.

2. Enhanced Failure Handling

When result is None (early failure), now attempts to read logs from disk:

// Check if stdout log exists from artifact storage
let stdout_path = exec_dir.join("stdout.log");
if stdout_path.exists() {
    result_data["stdout_log"] = serde_json::json!(stdout_path.to_string_lossy());
    // Try to read a preview if file exists
    if let Ok(contents) = tokio::fs::read_to_string(&stdout_path).await {
        let preview = if contents.len() > 1000 {
            format!("{}...", &contents[..1000])
        } else {
            contents
        };
        result_data["stdout"] = serde_json::json!(preview);
    }
}

This provides better diagnostics even for catastrophic failures.

3. Truncation Metadata

Added truncation information to failure results:

if exec_result.stdout_truncated {
    result_data["stdout_truncated"] = serde_json::json!(true);
    result_data["stdout_bytes_truncated"] = 
        serde_json::json!(exec_result.stdout_bytes_truncated);
}

Impact

Before

Intermittent "preparation" failures even when actions ran successfully
Lost output from partially-completed executions
Verbose error messages (entire stderr dump)
Difficult debugging due to missing context

After

Always capture output when process runs, regardless of wait() status
Specific error messages identifying the actual failure point
Concise error summaries (last line of stderr)
Better diagnostics with truncation metadata
Graceful degradation for stdin write failures

Testing Recommendations

Process Termination Scenarios
- Actions that crash or are killed
- Zombie processes
- Process that exit before we can wait()
Resource Exhaustion
- Very large stdout/stderr (test truncation)
- Many concurrent executions
- Slow process cleanup
Stdin Write Failures
- Processes that close stdin immediately
- Broken pipe scenarios
- Large secret payloads
Edge Cases
- Timeout with partial output
- Exit code 0 but stderr present
- No output but successful exit

Files Modified

attune/crates/worker/src/runtime/shell.rs - Improved error handling and output capture
attune/crates/worker/src/executor.rs - Enhanced failure diagnostics

Notes

This fix makes the system more resilient to transient process management issues
The "Execution failed during preparation" error should now be extremely rare
When it does occur, the result will include any available logs
Error messages are now more actionable and concise
All changes are backward compatible - existing executions unaffected

attune/docs/worker-service.md - Worker architecture
attune/docs/running-tests.md - Testing guidelines

8.0 KiB Raw Blame History