Skip to content

fix: add retry logic for resume race condition with pause persistence#3577

Open
guoyangzhen wants to merge 2 commits intosimstudioai:mainfrom
guoyangzhen:fix/resume-race-condition
Open

fix: add retry logic for resume race condition with pause persistence#3577
guoyangzhen wants to merge 2 commits intosimstudioai:mainfrom
guoyangzhen:fix/resume-race-condition

Conversation

@guoyangzhen
Copy link

Problem

When a resume API request arrives within milliseconds of a workflow pausing, persistPauseResult's INSERT may not yet be committed, causing enqueueOrStartResume to throw Paused execution not found.

Race window (~10-50ms):

  1. Workflow reaches pause point
  2. persistPauseResult calls db.insert(pausedExecutions)
  3. INSERT commits to DB
  4. Meanwhile, resume API hits and SELECT ... FOR UPDATE finds nothing

Fix

Retry logic (3 attempts, 200ms delay) in enqueueOrStartResume that catches Paused execution not found errors. Other errors (invalid pause point, already resumed, snapshot not ready) are NOT retried.

Retry is in the manager layer, not the API route, so all callers benefit.

Why not just wrap in a transaction?

Wrapping persistPauseResult in a transaction doesn't help — the race is between the method being called and its await db.insert() returning. PostgreSQL autocommit means INSERT is atomic and visible immediately after await returns. The retry handles the narrow window before persistPauseResult even starts.

Fixes #3081

@cursor
Copy link

cursor bot commented Mar 14, 2026

PR Summary

Medium Risk
Touches pause/resume orchestration and DB transaction flow; while the change is small, incorrect retry conditions or delays could mask real errors or add latency to resumes.

Overview
Adds bounded retry logic to PauseResumeManager.enqueueOrStartResume to handle a race where a resume request arrives before the paused_executions row is visible/committed.

Only the specific "Paused execution not found" case is retried (3 attempts with a 200ms delay); all other validation/queueing errors continue to fail fast, preserving existing behavior.

Written by Cursor Bugbot for commit d16d7bd. This will update automatically on new commits. Configure here.

@vercel
Copy link

vercel bot commented Mar 14, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Mar 14, 2026 1:02pm

Request Review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 14, 2026

Greptile Summary

This PR addresses a documented race condition in the human-in-the-loop workflow: when a resume API call arrives within milliseconds of persistPauseResult committing its INSERT, enqueueOrStartResume finds no row and throws Paused execution not found. The fix wraps the existing db.transaction call in a bounded retry loop (up to 4 attempts, 200 ms apart) and only retries on the "not found" error, letting all other errors surface immediately.

Key changes:

  • enqueueOrStartResume now retries up to MAX_RETRIES + 1 (effectively 4) times before giving up, with a 200 ms pause between attempts.
  • Only errors whose message contains 'Paused execution not found' are retried; errors for invalid pause points, already-resumed executions, or unready snapshots propagate immediately.

Issues found:

  • The loop condition attempt <= MAX_RETRIES with MAX_RETRIES = 3 yields 4 total attempts, not 3 as stated in the PR description and inline comments. This is at minimum a documentation mismatch and may be an unintentional extra retry.
  • Error discrimination uses err.message?.includes('Paused execution not found') — a fragile substring check that silently breaks if the error text is ever changed. A custom error class or error code would be more robust and refactor-safe.

Confidence Score: 3/5

  • The fix is logically sound and addresses the stated race condition, but an off-by-one in the retry count and fragile string-based error matching should be resolved before merging.
  • The core approach (bounded retry inside the manager layer, not the API route) is correct and the happy path works as expected. However, the loop executes one more attempt than documented (4 vs. 3), and the retry guard couples to a hard-coded substring that could silently degrade if the error message is ever refactored.
  • apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts — review the retry loop bounds and error discriminator.

Important Files Changed

Filename Overview
apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts Adds a retry loop (up to 4 total attempts, 200ms apart) inside enqueueOrStartResume to handle the race window between persistPauseResult committing a row and a concurrent resume API call reading it. Off-by-one exists between MAX_RETRIES=3 and the loop condition <=, yielding 4 actual attempts instead of the stated 3; error discrimination relies on a fragile substring match rather than a typed error class.

Sequence Diagram

sequenceDiagram
    participant W as Workflow Executor
    participant DB as PostgreSQL
    participant API as Resume API
    participant M as PauseResumeManager

    W->>DB: persistPauseResult (INSERT pausedExecutions)
    note over W,DB: ~10-50ms window before INSERT commits

    API->>M: enqueueOrStartResume(executionId)
    M->>DB: SELECT ... FOR UPDATE (attempt 0)
    DB-->>M: row not found (race condition)
    M->>M: isNotFound=true, wait 200ms

    W->>DB: INSERT commits (row now visible)

    M->>DB: SELECT ... FOR UPDATE (attempt 1)
    DB-->>M: pausedExecution row returned

    alt activeResume exists
        M->>DB: INSERT resumeQueue (status=pending)
        M->>DB: UPDATE pausePoints resumeStatus=queued
        M-->>API: { status: "queued", queuePosition }
    else no active resume
        M->>DB: INSERT resumeQueue (status=claimed)
        M->>DB: UPDATE pausePoints resumeStatus=resuming
        M-->>API: { status: "starting", resumeEntryId, ... }
    end
Loading

Last reviewed commit: 278daf6

Comment on lines +171 to +175
const MAX_RETRIES = 3
const RETRY_DELAY_MS = 200
let lastError: Error | null = null

for (let attempt = 0; attempt <= MAX_RETRIES; attempt++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off-by-one: 4 attempts, not 3

The loop condition attempt <= MAX_RETRIES with MAX_RETRIES = 3 results in 4 total attempts (attempts 0, 1, 2, 3), not 3 as stated in the PR description and comments. The variable name MAX_RETRIES typically means the number of retries after the initial try, so 3 retries would mean 4 total attempts. However, the comment says "3 attempts" which implies only 3 total. This discrepancy can lead to confusion and unexpected behavior (an extra 200ms delay on the third retry).

If the intent is 3 total attempts:

Suggested change
const MAX_RETRIES = 3
const RETRY_DELAY_MS = 200
let lastError: Error | null = null
for (let attempt = 0; attempt <= MAX_RETRIES; attempt++) {
const MAX_ATTEMPTS = 3
const RETRY_DELAY_MS = 200
let lastError: Error | null = null
for (let attempt = 0; attempt < MAX_ATTEMPTS; attempt++) {

If the intent is 3 retries (4 total attempts), update the comment to say "4 attempts (1 initial + 3 retries)".

Comment on lines +291 to +298
} catch (err: any) {
lastError = err
const isNotFound = err.message?.includes('Paused execution not found')
const isLastAttempt = attempt === MAX_RETRIES

if (!isNotFound || isLastAttempt) {
throw err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fragile string-based error matching

The retry selector err.message?.includes('Paused execution not found') couples the retry logic to the exact wording of the error message. If the message on line 187 is ever changed, refactored, or translated, the retry silently stops working — without any compile-time or test-time warning.

A more robust approach is to use a typed custom error class so the check is structural rather than textual:

class PausedExecutionNotFoundError extends Error {
  constructor() {
    super('Paused execution not found or already resumed')
    this.name = 'PausedExecutionNotFoundError'
  }
}

// At throw site (line 187):
throw new PausedExecutionNotFoundError()

// In the catch block:
const isNotFound = err instanceof PausedExecutionNotFoundError

This makes the intent explicit and refactor-safe.

- Change <= to < so MAX_RETRIES=3 means 3 attempts, not 4
- Update isLastAttempt check accordingly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Race Condition: Resume request fails with "Paused execution not found" during pause persistence

1 participant