343 lines
16 KiB
Markdown
343 lines
16 KiB
Markdown
# Bug Fix: Duplicate Completion Notifications & Unnecessary Database Updates
|
|
|
|
**Date**: 2026-02-09
|
|
**Component**: Executor Service (ExecutionManager)
|
|
**Issue Type**: Performance & Correctness
|
|
|
|
## Overview
|
|
|
|
Fixed two related inefficiencies in the executor service:
|
|
1. **Duplicate completion notifications** causing queue manager warnings
|
|
2. **Unnecessary database updates** writing unchanged status values
|
|
|
|
---
|
|
|
|
## Problem 1: Duplicate Completion Notifications
|
|
|
|
### Symptom
|
|
```
|
|
WARN crates/executor/src/queue_manager.rs:320:
|
|
Completion notification for action 3 but active_count is 0
|
|
```
|
|
|
|
### Before Fix - Message Flow
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Worker Service │
|
|
│ │
|
|
│ 1. Completes action execution │
|
|
│ 2. Updates DB: status = "Completed" │
|
|
│ 3. Publishes: execution.status_changed (status: "completed") │
|
|
│ 4. Publishes: execution.completed ────────────┐ │
|
|
└─────────────────────────────────────────────────┼───────────────┘
|
|
│
|
|
┌────────────────────────────────┼───────────────┐
|
|
│ │ │
|
|
▼ ▼ │
|
|
┌─────────────────────────────┐ ┌──────────────────────────────┤
|
|
│ ExecutionManager │ │ CompletionListener │
|
|
│ │ │ │
|
|
│ Receives: │ │ Receives: execution.completed│
|
|
│ execution.status_changed │ │ │
|
|
│ │ │ → notify_completion() │
|
|
│ → handle_completion() │ │ → Decrements active_count ✅ │
|
|
│ → publish_completion_notif()│ └──────────────────────────────┘
|
|
│ │
|
|
│ Publishes: execution.completed ───────┐
|
|
└─────────────────────────────┘ │
|
|
│
|
|
┌─────────────────────┘
|
|
│
|
|
▼
|
|
┌────────────────────────────┐
|
|
│ CompletionListener (again) │
|
|
│ │
|
|
│ Receives: execution.completed (2nd time!)
|
|
│ │
|
|
│ → notify_completion() │
|
|
│ → active_count already 0 │
|
|
│ → ⚠️ WARNING LOGGED │
|
|
└────────────────────────────┘
|
|
|
|
Result: 2x completion notifications, 1x warning
|
|
```
|
|
|
|
### After Fix - Message Flow
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Worker Service │
|
|
│ │
|
|
│ 1. Completes action execution │
|
|
│ 2. Updates DB: status = "Completed" │
|
|
│ 3. Publishes: execution.status_changed (status: "completed") │
|
|
│ 4. Publishes: execution.completed ────────────┐ │
|
|
└─────────────────────────────────────────────────┼───────────────┘
|
|
│
|
|
┌────────────────────────────────┼───────────────┐
|
|
│ │ │
|
|
▼ ▼ │
|
|
┌─────────────────────────────┐ ┌──────────────────────────────┤
|
|
│ ExecutionManager │ │ CompletionListener │
|
|
│ │ │ │
|
|
│ Receives: │ │ Receives: execution.completed│
|
|
│ execution.status_changed │ │ │
|
|
│ │ │ → notify_completion() │
|
|
│ → handle_completion() │ │ → Decrements active_count ✅ │
|
|
│ → Handles workflow children │ └──────────────────────────────┘
|
|
│ → NO completion publish ✅ │
|
|
└─────────────────────────────┘
|
|
|
|
Result: 1x completion notification, 0x warnings ✅
|
|
```
|
|
|
|
---
|
|
|
|
## Problem 2: Unnecessary Database Updates
|
|
|
|
### Symptom
|
|
```
|
|
INFO crates/executor/src/execution_manager.rs:108:
|
|
Updated execution 9061 status: Completed -> Completed
|
|
```
|
|
|
|
### Before Fix - Status Update Flow
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Worker Service │
|
|
│ │
|
|
│ 1. Completes action execution │
|
|
│ 2. ExecutionRepository::update() │
|
|
│ status: Running → Completed ✅ │
|
|
│ 3. Publishes: execution.status_changed (status: "completed") │
|
|
└─────────────────────────────────┬───────────────────────────────┘
|
|
│
|
|
│ Message Queue
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ ExecutionManager │
|
|
│ │
|
|
│ 1. Receives: execution.status_changed (status: "completed") │
|
|
│ 2. Fetches execution from DB │
|
|
│ Current status: Completed │
|
|
│ 3. Sets: execution.status = Completed (same value) │
|
|
│ 4. ExecutionRepository::update() │
|
|
│ status: Completed → Completed ❌ │
|
|
│ 5. Logs: "Updated execution 9061 status: Completed -> Completed"
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
Result: 2x database writes for same status value
|
|
```
|
|
|
|
### After Fix - Status Update Flow
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Worker Service │
|
|
│ │
|
|
│ 1. Completes action execution │
|
|
│ 2. ExecutionRepository::update() │
|
|
│ status: Running → Completed ✅ │
|
|
│ 3. Publishes: execution.status_changed (status: "completed") │
|
|
└─────────────────────────────────────┬───────────────────────────┘
|
|
│
|
|
│ Message Queue
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ ExecutionManager │
|
|
│ │
|
|
│ 1. Receives: execution.status_changed (status: "completed") │
|
|
│ 2. Fetches execution from DB │
|
|
│ Current status: Completed │
|
|
│ 3. Compares: old_status (Completed) == new_status (Completed) │
|
|
│ 4. Skips database update ✅ │
|
|
│ 5. Still handles orchestration (workflow children) │
|
|
│ 6. Logs: "Execution 9061 status unchanged, skipping update" │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
Result: 1x database write (only when status changes) ✅
|
|
```
|
|
|
|
---
|
|
|
|
## Code Changes
|
|
|
|
### Change 1: Remove Duplicate Completion Publication
|
|
|
|
**File**: `crates/executor/src/execution_manager.rs`
|
|
|
|
```rust
|
|
// BEFORE
|
|
async fn handle_completion(...) -> Result<()> {
|
|
// Handle workflow children...
|
|
|
|
// Publish completion notification
|
|
Self::publish_completion_notification(pool, publisher, execution).await?;
|
|
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
// DUPLICATE - worker already did this!
|
|
Ok(())
|
|
}
|
|
```
|
|
|
|
```rust
|
|
// AFTER
|
|
async fn handle_completion(...) -> Result<()> {
|
|
// Handle workflow children...
|
|
|
|
// NOTE: Completion notification is published by the worker, not here.
|
|
// This prevents duplicate execution.completed messages that would cause
|
|
// the queue manager to decrement active_count twice.
|
|
|
|
Ok(())
|
|
}
|
|
|
|
// Removed entire publish_completion_notification() method
|
|
```
|
|
|
|
### Change 2: Skip Unnecessary Database Updates
|
|
|
|
**File**: `crates/executor/src/execution_manager.rs`
|
|
|
|
```rust
|
|
// BEFORE
|
|
async fn process_status_change(...) -> Result<()> {
|
|
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
|
|
|
|
let old_status = execution.status.clone();
|
|
execution.status = status; // Always set, even if same
|
|
|
|
ExecutionRepository::update(pool, execution.id, execution.clone().into()).await?;
|
|
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
// ALWAYS writes, even if unchanged!
|
|
|
|
info!("Updated execution {} status: {:?} -> {:?}", execution_id, old_status, status);
|
|
|
|
// Handle completion logic...
|
|
Ok(())
|
|
}
|
|
```
|
|
|
|
```rust
|
|
// AFTER
|
|
async fn process_status_change(...) -> Result<()> {
|
|
let execution = ExecutionRepository::find_by_id(pool, execution_id).await?;
|
|
|
|
let old_status = execution.status.clone();
|
|
|
|
// Skip update if status hasn't changed
|
|
if old_status == status {
|
|
debug!("Execution {} status unchanged ({:?}), skipping database update",
|
|
execution_id, status);
|
|
|
|
// Still handle completion logic for orchestration (e.g., workflow children)
|
|
if matches!(status, ExecutionStatus::Completed | ExecutionStatus::Failed | ExecutionStatus::Cancelled) {
|
|
Self::handle_completion(pool, publisher, &execution).await?;
|
|
}
|
|
|
|
return Ok(()); // Early return - no DB write
|
|
}
|
|
|
|
execution.status = status;
|
|
ExecutionRepository::update(pool, execution.id, execution.clone().into()).await?;
|
|
|
|
info!("Updated execution {} status: {:?} -> {:?}", execution_id, old_status, status);
|
|
|
|
// Handle completion logic...
|
|
Ok(())
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Impact & Benefits
|
|
|
|
### Performance Improvements
|
|
|
|
| Metric | Before | After | Improvement |
|
|
|--------|--------|-------|-------------|
|
|
| Completion messages per execution | 2 | 1 | **50% reduction** |
|
|
| Queue manager warnings | Frequent | None | **100% elimination** |
|
|
| Database writes (no status change) | Always | Never | **100% elimination** |
|
|
| Log noise | High | Low | **Significant reduction** |
|
|
|
|
### Typical Execution Flow
|
|
|
|
**Before fixes**:
|
|
- 1x execution completed
|
|
- 2x `execution.completed` messages published
|
|
- 1x unnecessary database write (Completed → Completed)
|
|
- 1x queue manager warning
|
|
- Noisy logs with redundant "status: Completed -> Completed" messages
|
|
|
|
**After fixes**:
|
|
- 1x execution completed
|
|
- 1x `execution.completed` message published (worker only)
|
|
- 0x unnecessary database writes
|
|
- 0x queue manager warnings
|
|
- Clean, informative logs
|
|
|
|
### High-Throughput Scenarios
|
|
|
|
At **1000 executions/minute**:
|
|
|
|
**Before**:
|
|
- 2000 completion messages/min
|
|
- ~1000 unnecessary DB writes/min
|
|
- ~1000 warning logs/min
|
|
|
|
**After**:
|
|
- 1000 completion messages/min (50% reduction)
|
|
- 0 unnecessary DB writes (100% reduction)
|
|
- 0 warning logs (100% reduction)
|
|
|
|
---
|
|
|
|
## Testing
|
|
|
|
✅ All 58 executor unit tests pass
|
|
✅ Zero compiler warnings
|
|
✅ No breaking changes to external behavior
|
|
✅ Orchestration logic (workflow children) still works correctly
|
|
|
|
---
|
|
|
|
## Architecture Clarifications
|
|
|
|
### Separation of Concerns
|
|
|
|
| Component | Responsibility |
|
|
|-----------|----------------|
|
|
| **Worker** | Authoritative source for execution completion, publishes completion notifications |
|
|
| **Executor** | Orchestration (workflows, child executions), NOT completion notifications |
|
|
| **CompletionListener** | Queue management (releases slots for queued executions) |
|
|
|
|
### Idempotency
|
|
|
|
The executor is now **idempotent** with respect to status change messages:
|
|
- Receiving the same status change multiple times has no effect after the first
|
|
- Database is only written when state actually changes
|
|
- Orchestration logic (workflows) runs correctly regardless
|
|
|
|
---
|
|
|
|
## Lessons Learned
|
|
|
|
1. **Message publishers should be explicit** - Only one component should publish a given message type
|
|
2. **Always check for actual changes** - Don't blindly write to database without comparing old/new values
|
|
3. **Separate orchestration from notification** - Workflow logic shouldn't trigger duplicate notifications
|
|
4. **Log levels matter** - Changed redundant updates from INFO to DEBUG to reduce noise
|
|
5. **Trust the source** - Worker owns execution lifecycle; executor shouldn't second-guess it
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- Work Summary: `attune/work-summary/2026-02-09-duplicate-completion-fix.md`
|
|
- Queue Manager: `attune/crates/executor/src/queue_manager.rs`
|
|
- Completion Listener: `attune/crates/executor/src/completion_listener.rs`
|
|
- Execution Manager: `attune/crates/executor/src/execution_manager.rs`
|