[AURON #2045] Fix IndexOutOfBoundsException in UDAF#2046
[AURON #2045] Fix IndexOutOfBoundsException in UDAF#2046Flyangz wants to merge 1 commit intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request fixes an IndexOutOfBoundsException that occurs when passing sliced Arrow arrays from Rust to Java via FFI in Spark UDAFs. The issue arose because the Java side expects Arrow arrays with 0-based offsets, but Rust was passing sliced arrays that retained their original buffer with non-zero offsets.
Changes:
- Added logic to detect "clean" arrays (arrays with 0-based offsets) and pass them through unchanged
- For sliced arrays, creates new zero-based arrays by reconstructing offsets, value buffers, and null buffers
- Added necessary imports for array reconstruction (ArrayData, BooleanBufferBuilder, Buffer)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let new_null_buf = if binary_array.null_count() == 0 { | ||
| None | ||
| } else { | ||
| let mut buffer_builder = BooleanBufferBuilder::new(len); | ||
| for i in 0..len { | ||
| buffer_builder.append(binary_array.is_valid(i)); | ||
| } | ||
| Some(buffer_builder.finish().into_inner()) | ||
| }; |
There was a problem hiding this comment.
The null buffer reconstruction iterates through all elements calling is_valid, which can be inefficient for large arrays. Consider using the existing null buffer from array_data.nulls() and slicing it appropriately, similar to how the value buffer is sliced. If slicing the null buffer directly is not straightforward due to bit alignment, the current approach is acceptable but could be optimized.
| let new_null_buf = if binary_array.null_count() == 0 { | |
| None | |
| } else { | |
| let mut buffer_builder = BooleanBufferBuilder::new(len); | |
| for i in 0..len { | |
| buffer_builder.append(binary_array.is_valid(i)); | |
| } | |
| Some(buffer_builder.finish().into_inner()) | |
| }; | |
| let new_null_buf = array_data.nulls().cloned(); |
| let ffi_imported_rows = FFI_ArrowArray::new(&array_to_export.to_data()); | ||
| let rows = jni_call!(SparkUDAFWrapperContext(self.jcontext.as_obj()) | ||
| .importRows(&ffi_imported_rows as *const FFI_ArrowArray as i64) -> JObject)?; | ||
| self.obj = jni_new_global_ref!(rows.as_obj())?; |
There was a problem hiding this comment.
The assertion that validated the number of imported rows matches the expected count has been removed. Consider adding back a validation check after the importRows call to ensure the Java side processed the correct number of rows. You can use array.len() (or binary_array.len() since it's still in scope) to validate against self.num_records() after line 454.
| self.obj = jni_new_global_ref!(rows.as_obj())?; | |
| self.obj = jni_new_global_ref!(rows.as_obj())?; | |
| let expected_rows = array_to_export.len(); | |
| let imported_rows = self.num_records(); | |
| assert_eq!( | |
| imported_rows, | |
| expected_rows, | |
| "SparkUDAFBufferRowsColumn::unfreeze_from_arrays: Java imported {imported_rows} rows, expected {expected_rows}" | |
| ); |
| data_builder = data_builder.null_bit_buffer(Some(buf)); | ||
| } | ||
|
|
||
| let data = unsafe { data_builder.build_unchecked() }; |
There was a problem hiding this comment.
Using build_unchecked bypasses ArrayData validation, which could lead to undefined behavior if the constructed data is invalid. Consider using the safe builder API pattern instead. You can replace unsafe { data_builder.build_unchecked() } with data_builder.build()? to ensure the ArrayData is valid before proceeding.
| let data = unsafe { data_builder.build_unchecked() }; | |
| let data = data_builder.build()?; |
| let ffi_imported_rows = FFI_ArrowArray::new(&arrays[0].to_data()); | ||
| let array = &arrays[0]; | ||
| let binary_array = downcast_any!(array, BinaryArray)?; | ||
|
|
There was a problem hiding this comment.
The condition checks both binary_array.offset() == 0 and binary_array.value_offsets()[0] == 0, but these checks serve different purposes. The offset() check verifies the array itself is not sliced at the array level, while value_offsets()[0] == 0 checks if the values buffer starts at offset 0. Consider adding a comment explaining this dual check to clarify why both conditions are necessary for determining if the array is "clean".
| // "Clean" means neither the logical array nor its values buffer are sliced: | |
| // - offset() == 0 ensures the array itself has no logical offset (not sliced at array level) | |
| // - value_offsets()[0] == 0 ensures the values buffer also starts at 0 (not sliced in the buffer) |
Which issue does this PR close?
Closes #2045
What changes are included in this PR?
Create a new zero based offsets for the export array.
Are there any user-facing changes?
No
How was this patch tested?