fix: add retry logic for resume race condition with pause persistence#3577
fix: add retry logic for resume race condition with pause persistence#3577guoyangzhen wants to merge 2 commits intosimstudioai:mainfrom
Conversation
PR SummaryMedium Risk Overview Only the specific "Paused execution not found" case is retried (3 attempts with a 200ms delay); all other validation/queueing errors continue to fail fast, preserving existing behavior. Written by Cursor Bugbot for commit d16d7bd. This will update automatically on new commits. Configure here. |
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
Greptile SummaryThis PR addresses a documented race condition in the human-in-the-loop workflow: when a resume API call arrives within milliseconds of Key changes:
Issues found:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant W as Workflow Executor
participant DB as PostgreSQL
participant API as Resume API
participant M as PauseResumeManager
W->>DB: persistPauseResult (INSERT pausedExecutions)
note over W,DB: ~10-50ms window before INSERT commits
API->>M: enqueueOrStartResume(executionId)
M->>DB: SELECT ... FOR UPDATE (attempt 0)
DB-->>M: row not found (race condition)
M->>M: isNotFound=true, wait 200ms
W->>DB: INSERT commits (row now visible)
M->>DB: SELECT ... FOR UPDATE (attempt 1)
DB-->>M: pausedExecution row returned
alt activeResume exists
M->>DB: INSERT resumeQueue (status=pending)
M->>DB: UPDATE pausePoints resumeStatus=queued
M-->>API: { status: "queued", queuePosition }
else no active resume
M->>DB: INSERT resumeQueue (status=claimed)
M->>DB: UPDATE pausePoints resumeStatus=resuming
M-->>API: { status: "starting", resumeEntryId, ... }
end
Last reviewed commit: 278daf6 |
| const MAX_RETRIES = 3 | ||
| const RETRY_DELAY_MS = 200 | ||
| let lastError: Error | null = null | ||
|
|
||
| for (let attempt = 0; attempt <= MAX_RETRIES; attempt++) { |
There was a problem hiding this comment.
Off-by-one: 4 attempts, not 3
The loop condition attempt <= MAX_RETRIES with MAX_RETRIES = 3 results in 4 total attempts (attempts 0, 1, 2, 3), not 3 as stated in the PR description and comments. The variable name MAX_RETRIES typically means the number of retries after the initial try, so 3 retries would mean 4 total attempts. However, the comment says "3 attempts" which implies only 3 total. This discrepancy can lead to confusion and unexpected behavior (an extra 200ms delay on the third retry).
If the intent is 3 total attempts:
| const MAX_RETRIES = 3 | |
| const RETRY_DELAY_MS = 200 | |
| let lastError: Error | null = null | |
| for (let attempt = 0; attempt <= MAX_RETRIES; attempt++) { | |
| const MAX_ATTEMPTS = 3 | |
| const RETRY_DELAY_MS = 200 | |
| let lastError: Error | null = null | |
| for (let attempt = 0; attempt < MAX_ATTEMPTS; attempt++) { |
If the intent is 3 retries (4 total attempts), update the comment to say "4 attempts (1 initial + 3 retries)".
| } catch (err: any) { | ||
| lastError = err | ||
| const isNotFound = err.message?.includes('Paused execution not found') | ||
| const isLastAttempt = attempt === MAX_RETRIES | ||
|
|
||
| if (!isNotFound || isLastAttempt) { | ||
| throw err | ||
| } |
There was a problem hiding this comment.
Fragile string-based error matching
The retry selector err.message?.includes('Paused execution not found') couples the retry logic to the exact wording of the error message. If the message on line 187 is ever changed, refactored, or translated, the retry silently stops working — without any compile-time or test-time warning.
A more robust approach is to use a typed custom error class so the check is structural rather than textual:
class PausedExecutionNotFoundError extends Error {
constructor() {
super('Paused execution not found or already resumed')
this.name = 'PausedExecutionNotFoundError'
}
}
// At throw site (line 187):
throw new PausedExecutionNotFoundError()
// In the catch block:
const isNotFound = err instanceof PausedExecutionNotFoundErrorThis makes the intent explicit and refactor-safe.
- Change <= to < so MAX_RETRIES=3 means 3 attempts, not 4 - Update isLastAttempt check accordingly
Problem
When a resume API request arrives within milliseconds of a workflow pausing,
persistPauseResult's INSERT may not yet be committed, causingenqueueOrStartResumeto throwPaused execution not found.Race window (~10-50ms):
persistPauseResultcallsdb.insert(pausedExecutions)SELECT ... FOR UPDATEfinds nothingFix
Retry logic (3 attempts, 200ms delay) in
enqueueOrStartResumethat catchesPaused execution not founderrors. Other errors (invalid pause point, already resumed, snapshot not ready) are NOT retried.Retry is in the manager layer, not the API route, so all callers benefit.
Why not just wrap in a transaction?
Wrapping
persistPauseResultin a transaction doesn't help — the race is between the method being called and itsawait db.insert()returning. PostgreSQL autocommit means INSERT is atomic and visible immediately afterawaitreturns. The retry handles the narrow window beforepersistPauseResulteven starts.Fixes #3081