# Fix: Execution Failure Detection and Error Capture **Date:** 2026-01-30 **Issue:** Executions occasionally fail with "Execution failed during preparation" error even though stdout.log shows the action ran successfully **Status:** Fixed --- ## Problem Description Users reported occasional execution failures with the following characteristics: - Error message: `"Execution failed during preparation"` - Result JSON shows `"succeeded": false` - The `stdout.log` file exists and contains output from the action - The action appears to have run, but the system failed to capture the success ### Example Error ```json { "error": "Execution failed during preparation", "stdout_log": "/tmp/attune/artifacts/execution_10172/stdout.log", "succeeded": false } ``` --- ## Root Cause Analysis The issue was identified in the worker's execution flow, specifically in how runtime errors are handled: ### 1. **Process Wait Failures** In `shell.rs` (`execute_with_streaming` method), if `child.wait()` fails after the process has already started and written output: ```rust Ok(Err(e)) => { return Err(RuntimeError::ProcessError(format!( "Process wait failed: {}", e ))); } ``` This returns an `Err` even though: - The child process ran successfully - Output was captured to stdout/stderr - The process may have completed normally ### 2. **Stdin Write Failures** Writing secrets to stdin could fail after the process spawned: ```rust let secrets_json = serde_json::to_string(secrets)?; stdin.write_all(secrets_json.as_bytes()).await?; ``` The `?` operator would propagate the error up, discarding captured output. ### 3. **Error Propagation in Executor** In `executor.rs`, when `execute_action()` returns an `Err`: ```rust let result = match self.execute_action(context).await { Ok(result) => result, Err(e) => { error!("Action execution failed: {}", e); self.handle_execution_failure(execution_id, None).await?; // None = no result return Err(e); } }; ``` Passing `None` to `handle_execution_failure` triggers the "Execution failed during preparation" message, even though logs exist. ### 4. **Poor Error Messages** When exit code was non-zero, the entire stderr was used as the error message, which could be very long and unhelpful. --- ## Solution Implemented ### Changes to `shell.rs` #### 1. **Graceful Stdin Write Handling** ```rust let stdin_write_error = if let Some(mut stdin) = child.stdin.take() { match serde_json::to_string(secrets) { Ok(secrets_json) => { if let Err(e) = stdin.write_all(secrets_json.as_bytes()).await { Some(format!("Failed to write secrets to stdin: {}", e)) } else if let Err(e) = stdin.write_all(b"\n").await { Some(format!("Failed to write newline to stdin: {}", e)) } else { drop(stdin); None } } Err(e) => Some(format!("Failed to serialize secrets: {}", e)), } } else { None }; ``` - Capture stdin write errors instead of propagating them - Continue execution to capture output - Include error in ExecutionResult #### 2. **Process Wait Error Recovery** ```rust let (exit_code, process_error) = match wait_result { Ok(Ok(status)) => (status.code().unwrap_or(-1), None), Ok(Err(e)) => { // Process wait failed, but we have the output - return it with an error warn!("Process wait failed but captured output: {}", e); (-1, Some(format!("Process wait failed: {}", e))) } Err(_) => { // Timeout occurred - return captured output return Ok(ExecutionResult { exit_code: -1, stdout: stdout_result.content.clone(), stderr: stderr_result.content.clone(), // ... include truncation info }); } }; ``` - Always return `Ok(ExecutionResult)` when we have captured output - Include process wait errors in the result's `error` field - Preserve stdout/stderr even on timeout #### 3. **Improved Error Messages** ```rust let error = if let Some(proc_err) = process_error { Some(proc_err) } else if let Some(stdin_err) = stdin_write_error { Some(stdin_err) } else if exit_code != 0 { Some(if stderr_result.content.is_empty() { format!("Command exited with code {}", exit_code) } else { // Use last line of stderr as error, or full stderr if short if stderr_result.content.lines().count() > 5 { stderr_result.content.lines().last().unwrap_or("").to_string() } else { stderr_result.content.clone() } }) } else { None }; ``` - Prioritize specific error sources - Use last line of stderr for concise error messages - Full stderr only if short (≤5 lines) ### Changes to `executor.rs` #### 1. **Better Documentation** ```rust // Note: execute_action should rarely return Err - most failures should be // captured in ExecutionResult with non-zero exit codes let result = match self.execute_action(context).await { Ok(result) => result, Err(e) => { error!("Action execution failed catastrophically: {}", e); // This should only happen for unrecoverable errors like runtime not found ``` Clarified that returning `Err` should be rare. #### 2. **Enhanced Failure Handling** When `result` is `None` (early failure), now attempts to read logs from disk: ```rust // Check if stdout log exists from artifact storage let stdout_path = exec_dir.join("stdout.log"); if stdout_path.exists() { result_data["stdout_log"] = serde_json::json!(stdout_path.to_string_lossy()); // Try to read a preview if file exists if let Ok(contents) = tokio::fs::read_to_string(&stdout_path).await { let preview = if contents.len() > 1000 { format!("{}...", &contents[..1000]) } else { contents }; result_data["stdout"] = serde_json::json!(preview); } } ``` This provides better diagnostics even for catastrophic failures. #### 3. **Truncation Metadata** Added truncation information to failure results: ```rust if exec_result.stdout_truncated { result_data["stdout_truncated"] = serde_json::json!(true); result_data["stdout_bytes_truncated"] = serde_json::json!(exec_result.stdout_bytes_truncated); } ``` --- ## Impact ### Before - **Intermittent "preparation" failures** even when actions ran successfully - **Lost output** from partially-completed executions - **Verbose error messages** (entire stderr dump) - **Difficult debugging** due to missing context ### After - **Always capture output** when process runs, regardless of wait() status - **Specific error messages** identifying the actual failure point - **Concise error summaries** (last line of stderr) - **Better diagnostics** with truncation metadata - **Graceful degradation** for stdin write failures --- ## Testing Recommendations 1. **Process Termination Scenarios** - Actions that crash or are killed - Zombie processes - Process that exit before we can wait() 2. **Resource Exhaustion** - Very large stdout/stderr (test truncation) - Many concurrent executions - Slow process cleanup 3. **Stdin Write Failures** - Processes that close stdin immediately - Broken pipe scenarios - Large secret payloads 4. **Edge Cases** - Timeout with partial output - Exit code 0 but stderr present - No output but successful exit --- ## Files Modified - `attune/crates/worker/src/runtime/shell.rs` - Improved error handling and output capture - `attune/crates/worker/src/executor.rs` - Enhanced failure diagnostics --- ## Notes - This fix makes the system more resilient to transient process management issues - The "Execution failed during preparation" error should now be extremely rare - When it does occur, the result will include any available logs - Error messages are now more actionable and concise - All changes are backward compatible - existing executions unaffected --- ## Related Documentation - `attune/docs/worker-service.md` - Worker architecture - `attune/docs/running-tests.md` - Testing guidelines