Add worker crash recovery to ProcessParallelController #395

jhewers-pf · 2026-02-05T13:55:57Z

When a child process in the ProcessPoolExecutor crashes (OOM, segfault, etc.), Python raises BrokenExecutor and the pool becomes unusable. Previously, this was caught as a generic Exception, logged, and caused silent failure of the evolution process.

Changes

Add explicit BrokenExecutor exception handling in run_evolution()
Add _recover_process_pool() method that gracefully shuts down the broken executor, runs garbage
collection, waits briefly for system stabilization, and recreates the pool
Re-queue all pending iterations after recovery
Track recovery attempts with a limit of 3 consecutive failures to prevent infinite loops
Reset recovery counter after successful iterations (only consecutive crashes count toward the limit)
Propagate BrokenExecutor from _submit_iteration() for centralized handling

Behavior
When a worker crashes:

Detect BrokenExecutor exception
Collect all pending iteration numbers
Shut down broken pool, run GC, wait 2s
Recreate fresh pool
Re-queue failed iterations
Continue evolution

If 3 crashes occur without any successful iterations in between, evolution stops gracefully.

CLAassistant · 2026-02-05T13:56:04Z

All committers have signed the CLA.

jhewers-pf added 2 commits February 5, 2026 13:54

fix: Add worker crash recovery to ProcessParallelController

2e7f739

chore: Test worker crash recovery

fb286f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add worker crash recovery to ProcessParallelController #395

Add worker crash recovery to ProcessParallelController #395

Uh oh!

jhewers-pf commented Feb 5, 2026

Uh oh!

CLAassistant commented Feb 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add worker crash recovery to ProcessParallelController #395

Are you sure you want to change the base?

Add worker crash recovery to ProcessParallelController #395

Uh oh!

Conversation

jhewers-pf commented Feb 5, 2026

Uh oh!

CLAassistant commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented Feb 5, 2026 •

edited

Loading