-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix: race condition in SpillPool caused by buffered stream #20067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ahmed hossam <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes an indeterministic test failure in SpillPool caused by a race condition between the writer and reader coordination logic when using a buffered stream.
Changes:
- Removed the
spawn_bufferedwrapper fromread_spill_as_streamto return an unbuffered stream - Removed the unused import of
spawn_bufferedfromspill_manager.rs
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
martin-g
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is not too complex maybe add a test case verifying that there is no racing any more
| max_record_batch_memory, | ||
| ))); | ||
|
|
||
| Ok(spawn_buffered(stream, self.batch_read_buffer_capacity)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems batch_read_buffer_capacity is no more used.
It could be deprecated or maybe even removed.
https://github.com/dekuu5/datafusion/blob/1b8ef43fdd1424a3e4fe2db213fec4e7228788b0/datafusion/physical-plan/src/sorts/multi_level_merge.rs#L276 is a no-op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay sure i will look into that and the tests also
Signed-off-by: Ahmed hossam <[email protected]>
Signed-off-by: Ahmed hossam <[email protected]>
Which issue does this PR close?
SpillPool#20027Rationale for this change
The SpillPool test spill::spill_pool::channel was failing indeterministically due to a race condition between SpillFile's coordination logic and the spawn_buffered background task used in the stream reader.
What changes are included in this PR?
read_spill_as_stream now returns a normal stream not a buffered stream
Are these changes tested?
No, the fix is for an existing test mentioned in the issue. I wasn't able to find the bug initially, but I found it by stress-testing the original test 100 times in parallel.
Are there any user-facing changes?
no
i still think we should add a buffer stream that is aware of the coordination of the writer and reader but i can't warp my head around it yet.