Files
attune/work-summary/sessions/2026-01-30-fix-execution-failure-capture.md
2026-02-04 17:46:30 -06:00

267 lines
8.0 KiB
Markdown

# Fix: Execution Failure Detection and Error Capture
**Date:** 2026-01-30
**Issue:** Executions occasionally fail with "Execution failed during preparation" error even though stdout.log shows the action ran successfully
**Status:** Fixed
---
## Problem Description
Users reported occasional execution failures with the following characteristics:
- Error message: `"Execution failed during preparation"`
- Result JSON shows `"succeeded": false`
- The `stdout.log` file exists and contains output from the action
- The action appears to have run, but the system failed to capture the success
### Example Error
```json
{
"error": "Execution failed during preparation",
"stdout_log": "/tmp/attune/artifacts/execution_10172/stdout.log",
"succeeded": false
}
```
---
## Root Cause Analysis
The issue was identified in the worker's execution flow, specifically in how runtime errors are handled:
### 1. **Process Wait Failures**
In `shell.rs` (`execute_with_streaming` method), if `child.wait()` fails after the process has already started and written output:
```rust
Ok(Err(e)) => {
return Err(RuntimeError::ProcessError(format!(
"Process wait failed: {}",
e
)));
}
```
This returns an `Err` even though:
- The child process ran successfully
- Output was captured to stdout/stderr
- The process may have completed normally
### 2. **Stdin Write Failures**
Writing secrets to stdin could fail after the process spawned:
```rust
let secrets_json = serde_json::to_string(secrets)?;
stdin.write_all(secrets_json.as_bytes()).await?;
```
The `?` operator would propagate the error up, discarding captured output.
### 3. **Error Propagation in Executor**
In `executor.rs`, when `execute_action()` returns an `Err`:
```rust
let result = match self.execute_action(context).await {
Ok(result) => result,
Err(e) => {
error!("Action execution failed: {}", e);
self.handle_execution_failure(execution_id, None).await?; // None = no result
return Err(e);
}
};
```
Passing `None` to `handle_execution_failure` triggers the "Execution failed during preparation" message, even though logs exist.
### 4. **Poor Error Messages**
When exit code was non-zero, the entire stderr was used as the error message, which could be very long and unhelpful.
---
## Solution Implemented
### Changes to `shell.rs`
#### 1. **Graceful Stdin Write Handling**
```rust
let stdin_write_error = if let Some(mut stdin) = child.stdin.take() {
match serde_json::to_string(secrets) {
Ok(secrets_json) => {
if let Err(e) = stdin.write_all(secrets_json.as_bytes()).await {
Some(format!("Failed to write secrets to stdin: {}", e))
} else if let Err(e) = stdin.write_all(b"\n").await {
Some(format!("Failed to write newline to stdin: {}", e))
} else {
drop(stdin);
None
}
}
Err(e) => Some(format!("Failed to serialize secrets: {}", e)),
}
} else {
None
};
```
- Capture stdin write errors instead of propagating them
- Continue execution to capture output
- Include error in ExecutionResult
#### 2. **Process Wait Error Recovery**
```rust
let (exit_code, process_error) = match wait_result {
Ok(Ok(status)) => (status.code().unwrap_or(-1), None),
Ok(Err(e)) => {
// Process wait failed, but we have the output - return it with an error
warn!("Process wait failed but captured output: {}", e);
(-1, Some(format!("Process wait failed: {}", e)))
}
Err(_) => {
// Timeout occurred - return captured output
return Ok(ExecutionResult {
exit_code: -1,
stdout: stdout_result.content.clone(),
stderr: stderr_result.content.clone(),
// ... include truncation info
});
}
};
```
- Always return `Ok(ExecutionResult)` when we have captured output
- Include process wait errors in the result's `error` field
- Preserve stdout/stderr even on timeout
#### 3. **Improved Error Messages**
```rust
let error = if let Some(proc_err) = process_error {
Some(proc_err)
} else if let Some(stdin_err) = stdin_write_error {
Some(stdin_err)
} else if exit_code != 0 {
Some(if stderr_result.content.is_empty() {
format!("Command exited with code {}", exit_code)
} else {
// Use last line of stderr as error, or full stderr if short
if stderr_result.content.lines().count() > 5 {
stderr_result.content.lines().last().unwrap_or("").to_string()
} else {
stderr_result.content.clone()
}
})
} else {
None
};
```
- Prioritize specific error sources
- Use last line of stderr for concise error messages
- Full stderr only if short (≤5 lines)
### Changes to `executor.rs`
#### 1. **Better Documentation**
```rust
// Note: execute_action should rarely return Err - most failures should be
// captured in ExecutionResult with non-zero exit codes
let result = match self.execute_action(context).await {
Ok(result) => result,
Err(e) => {
error!("Action execution failed catastrophically: {}", e);
// This should only happen for unrecoverable errors like runtime not found
```
Clarified that returning `Err` should be rare.
#### 2. **Enhanced Failure Handling**
When `result` is `None` (early failure), now attempts to read logs from disk:
```rust
// Check if stdout log exists from artifact storage
let stdout_path = exec_dir.join("stdout.log");
if stdout_path.exists() {
result_data["stdout_log"] = serde_json::json!(stdout_path.to_string_lossy());
// Try to read a preview if file exists
if let Ok(contents) = tokio::fs::read_to_string(&stdout_path).await {
let preview = if contents.len() > 1000 {
format!("{}...", &contents[..1000])
} else {
contents
};
result_data["stdout"] = serde_json::json!(preview);
}
}
```
This provides better diagnostics even for catastrophic failures.
#### 3. **Truncation Metadata**
Added truncation information to failure results:
```rust
if exec_result.stdout_truncated {
result_data["stdout_truncated"] = serde_json::json!(true);
result_data["stdout_bytes_truncated"] =
serde_json::json!(exec_result.stdout_bytes_truncated);
}
```
---
## Impact
### Before
- **Intermittent "preparation" failures** even when actions ran successfully
- **Lost output** from partially-completed executions
- **Verbose error messages** (entire stderr dump)
- **Difficult debugging** due to missing context
### After
- **Always capture output** when process runs, regardless of wait() status
- **Specific error messages** identifying the actual failure point
- **Concise error summaries** (last line of stderr)
- **Better diagnostics** with truncation metadata
- **Graceful degradation** for stdin write failures
---
## Testing Recommendations
1. **Process Termination Scenarios**
- Actions that crash or are killed
- Zombie processes
- Process that exit before we can wait()
2. **Resource Exhaustion**
- Very large stdout/stderr (test truncation)
- Many concurrent executions
- Slow process cleanup
3. **Stdin Write Failures**
- Processes that close stdin immediately
- Broken pipe scenarios
- Large secret payloads
4. **Edge Cases**
- Timeout with partial output
- Exit code 0 but stderr present
- No output but successful exit
---
## Files Modified
- `attune/crates/worker/src/runtime/shell.rs` - Improved error handling and output capture
- `attune/crates/worker/src/executor.rs` - Enhanced failure diagnostics
---
## Notes
- This fix makes the system more resilient to transient process management issues
- The "Execution failed during preparation" error should now be extremely rare
- When it does occur, the result will include any available logs
- Error messages are now more actionable and concise
- All changes are backward compatible - existing executions unaffected
---
## Related Documentation
- `attune/docs/worker-service.md` - Worker architecture
- `attune/docs/running-tests.md` - Testing guidelines