Skip to content

[AURON #2045] Fix IndexOutOfBoundsException in UDAF#2046

Open
Flyangz wants to merge 1 commit intoapache:masterfrom
Flyangz:bugfix/udaf-iob
Open

[AURON #2045] Fix IndexOutOfBoundsException in UDAF#2046
Flyangz wants to merge 1 commit intoapache:masterfrom
Flyangz:bugfix/udaf-iob

Conversation

@Flyangz
Copy link
Contributor

@Flyangz Flyangz commented Feb 27, 2026

Which issue does this PR close?

Closes #2045

What changes are included in this PR?

Create a new zero based offsets for the export array.

Are there any user-facing changes?

No

How was this patch tested?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes an IndexOutOfBoundsException that occurs when passing sliced Arrow arrays from Rust to Java via FFI in Spark UDAFs. The issue arose because the Java side expects Arrow arrays with 0-based offsets, but Rust was passing sliced arrays that retained their original buffer with non-zero offsets.

Changes:

  • Added logic to detect "clean" arrays (arrays with 0-based offsets) and pass them through unchanged
  • For sliced arrays, creates new zero-based arrays by reconstructing offsets, value buffers, and null buffers
  • Added necessary imports for array reconstruction (ArrayData, BooleanBufferBuilder, Buffer)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +428 to +436
let new_null_buf = if binary_array.null_count() == 0 {
None
} else {
let mut buffer_builder = BooleanBufferBuilder::new(len);
for i in 0..len {
buffer_builder.append(binary_array.is_valid(i));
}
Some(buffer_builder.finish().into_inner())
};
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The null buffer reconstruction iterates through all elements calling is_valid, which can be inefficient for large arrays. Consider using the existing null buffer from array_data.nulls() and slicing it appropriately, similar to how the value buffer is sliced. If slicing the null buffer directly is not straightforward due to bit alignment, the current approach is acceptable but could be optimized.

Suggested change
let new_null_buf = if binary_array.null_count() == 0 {
None
} else {
let mut buffer_builder = BooleanBufferBuilder::new(len);
for i in 0..len {
buffer_builder.append(binary_array.is_valid(i));
}
Some(buffer_builder.finish().into_inner())
};
let new_null_buf = array_data.nulls().cloned();

Copilot uses AI. Check for mistakes.
let ffi_imported_rows = FFI_ArrowArray::new(&array_to_export.to_data());
let rows = jni_call!(SparkUDAFWrapperContext(self.jcontext.as_obj())
.importRows(&ffi_imported_rows as *const FFI_ArrowArray as i64) -> JObject)?;
self.obj = jni_new_global_ref!(rows.as_obj())?;
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion that validated the number of imported rows matches the expected count has been removed. Consider adding back a validation check after the importRows call to ensure the Java side processed the correct number of rows. You can use array.len() (or binary_array.len() since it's still in scope) to validate against self.num_records() after line 454.

Suggested change
self.obj = jni_new_global_ref!(rows.as_obj())?;
self.obj = jni_new_global_ref!(rows.as_obj())?;
let expected_rows = array_to_export.len();
let imported_rows = self.num_records();
assert_eq!(
imported_rows,
expected_rows,
"SparkUDAFBufferRowsColumn::unfreeze_from_arrays: Java imported {imported_rows} rows, expected {expected_rows}"
);

Copilot uses AI. Check for mistakes.
data_builder = data_builder.null_bit_buffer(Some(buf));
}

let data = unsafe { data_builder.build_unchecked() };
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using build_unchecked bypasses ArrayData validation, which could lead to undefined behavior if the constructed data is invalid. Consider using the safe builder API pattern instead. You can replace unsafe { data_builder.build_unchecked() } with data_builder.build()? to ensure the ArrayData is valid before proceeding.

Suggested change
let data = unsafe { data_builder.build_unchecked() };
let data = data_builder.build()?;

Copilot uses AI. Check for mistakes.
let ffi_imported_rows = FFI_ArrowArray::new(&arrays[0].to_data());
let array = &arrays[0];
let binary_array = downcast_any!(array, BinaryArray)?;

Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition checks both binary_array.offset() == 0 and binary_array.value_offsets()[0] == 0, but these checks serve different purposes. The offset() check verifies the array itself is not sliced at the array level, while value_offsets()[0] == 0 checks if the values buffer starts at offset 0. Consider adding a comment explaining this dual check to clarify why both conditions are necessary for determining if the array is "clean".

Suggested change
// "Clean" means neither the logical array nor its values buffer are sliced:
// - offset() == 0 ensures the array itself has no logical offset (not sliced at array level)
// - value_offsets()[0] == 0 ensures the values buffer also starts at 0 (not sliced in the buffer)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IndexOutOfBoundsException in UDAF

2 participants