Skip to content

Conversation

@jhewers-pf
Copy link

When a child process in the ProcessPoolExecutor crashes (OOM, segfault, etc.), Python raises BrokenExecutor and the pool becomes unusable. Previously, this was caught as a generic Exception, logged, and caused silent failure of the evolution process.

Changes

  • Add explicit BrokenExecutor exception handling in run_evolution()
  • Add _recover_process_pool() method that gracefully shuts down the broken executor, runs garbage
    collection, waits briefly for system stabilization, and recreates the pool
  • Re-queue all pending iterations after recovery
  • Track recovery attempts with a limit of 3 consecutive failures to prevent infinite loops
  • Reset recovery counter after successful iterations (only consecutive crashes count toward the limit)
  • Propagate BrokenExecutor from _submit_iteration() for centralized handling

Behavior
When a worker crashes:

  1. Detect BrokenExecutor exception
  2. Collect all pending iteration numbers
  3. Shut down broken pool, run GC, wait 2s
  4. Recreate fresh pool
  5. Re-queue failed iterations
  6. Continue evolution

If 3 crashes occur without any successful iterations in between, evolution stops gracefully.

@CLAassistant
Copy link

CLAassistant commented Feb 5, 2026

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants