Skip to content

GH-49410: [C++] Fix if_else null-scalar fast paths for sliced BaseBinary arrays#49443

Open
Ebraam-Ashraf wants to merge 9 commits intoapache:mainfrom
Ebraam-Ashraf:GH-49410-fix-if-else-sliced-string-chunks
Open

GH-49410: [C++] Fix if_else null-scalar fast paths for sliced BaseBinary arrays#49443
Ebraam-Ashraf wants to merge 9 commits intoapache:mainfrom
Ebraam-Ashraf:GH-49410-fix-if-else-sliced-string-chunks

Conversation

@Ebraam-Ashraf
Copy link

@Ebraam-Ashraf Ebraam-Ashraf commented Mar 3, 2026

Rationale for this change

if_else with a null scalar and a sliced BaseBinary array (offset != 0) produces invalid output. The ASA and AAS shortcut paths in scalar_if_else.cc copy offsets without adjusting for the slice offset, and copy data from byte 0 instead of data + offsets[0].

What changes are included in this PR?

Fix (scalar_if_else.cc): In both the ASA and AAS fast paths:

  • Replace the single memcpy for the offsets buffer with a loop that normalizes each offset by subtracting offsets[0] as a base, so output offsets always start at 0.
  • Change the data copy to start from data + base instead of data, so only the bytes belonging to the slice are copied.

Regression test (scalar_if_else_test.cc): Added TYPED_TEST(TestIfElseBaseBinary, IfElseBaseBinarySliced) covering utf8, binary, large_utf8, and large_binary types, testing both the ASA path and AAS path with a sliced array.

Are these changes tested?

Yes. The new IfElseBaseBinarySliced typed test reproduces the bug and passes after the fix. All 416 existing tests continue to pass.

Are there any user-facing changes?

Yes. The bug causes incorrect/invalid data to be produced when if_else is called with a null scalar and a sliced BaseBinary array.

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

⚠️ GitHub issue #49410 has been automatically assigned in GitHub to PR creator.

@Ebraam-Ashraf
Copy link
Author

hi @kou
thanks for your guidance
as requested here's the test that reproduces the bug
Looking forward to any feedback also I'll add the fix in a follow up commit once you confirm the test is fine

{ArrayFromJSON(int64(), "[-1]"), ArrayFromJSON(int32(), "[0]")}));
}

TEST_F(TestIfElseKernel, IfElseBaseBinarySlicedChunk) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this TestIfEleseKernel test to the location where other TestIfEleseKernel tests exist instead of adding it after the TestChooseKernel test?

Can we use better name than IfElseBaseBinarySlicedChunk like other existing test names?

}

TEST_F(TestIfElseKernel, IfElseBaseBinarySlicedChunk) {
for (auto type : {utf8(), binary(), large_utf8(), large_binary()}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to test all these types?

If so, TYPED_TEST(TestIfElseBaseBinary, ...) may be better than TEST_F(TestifEleseKernel, ...).

Comment on lines +3787 to +3809
auto cond_asa = ArrayFromJSON(boolean(), "[true, false, false]");
ASSERT_OK_AND_ASSIGN(auto result_asa,
CallFunction("if_else", {cond_asa, MakeNullScalar(type), chunk1}));
ASSERT_OK(result_asa.make_array()->ValidateFull());
AssertArraysEqual(*ArrayFromJSON(type, R"([null, "x", "x"])"),
*result_asa.make_array(), true);

auto cond_aas = ArrayFromJSON(boolean(), "[false, true, true]");
ASSERT_OK_AND_ASSIGN(auto result_aas,
CallFunction("if_else", {cond_aas, chunk1, MakeNullScalar(type)}));
ASSERT_OK(result_aas.make_array()->ValidateFull());
AssertArraysEqual(*ArrayFromJSON(type, R"([null, "x", "x"])"),
*result_aas.make_array(), true);

auto arr1 = std::make_shared<ChunkedArray>(ArrayVector{chunk0, chunk1});
auto mask = *CallFunction("is_null", {arr1});
ASSERT_OK_AND_ASSIGN(auto arr2_datum,
CallFunction("if_else", {Datum(true), *Concatenate(arr1->chunks()), arr1}));
ASSERT_OK(arr2_datum.chunked_array()->ValidateFull());
ASSERT_OK_AND_ASSIGN(auto arr3_datum,
CallFunction("if_else", {mask, MakeNullScalar(type), arr2_datum}));
ASSERT_OK(arr3_datum.chunked_array()->ValidateFull());
AssertDatumsEqual(Datum(arr1), arr3_datum);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reproduce this problem without chunked array, right?

If so, we don't need to test chunked array.

@Ebraam-Ashraf
Copy link
Author

thanks for the feedback @kou

I switched from TEST_F(TestIfElseKernel, ...) to TYPED_TEST(TestIfElseBaseBinary, IfElseBaseBinarySliced)
and moved the test to its right place
also removed the chunked array portion

[----------] Global test environment tear-down
[==========] 4 tests from 4 test suites ran. (125 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] TestIfElseBaseBinary/0.IfElseBaseBinarySliced, where TypeParam = arrow::BinaryType
[  FAILED  ] TestIfElseBaseBinary/1.IfElseBaseBinarySliced, where TypeParam = arrow::LargeBinaryType
[  FAILED  ] TestIfElseBaseBinary/2.IfElseBaseBinarySliced, where TypeParam = arrow::StringType
[  FAILED  ] TestIfElseBaseBinary/3.IfElseBaseBinarySliced, where TypeParam = arrow::LargeStringType

Offset invariant failure: offset for slot 2 out of bounds: 3 > 2

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Could you push a fix for this problem to this branch?

Comment on lines +614 to +615
auto full_arr = ArrayFromJSON(type, R"([null, "x", "x", null, "x", "x"])");
auto sliced = full_arr->Slice(3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify test data?

Suggested change
auto full_arr = ArrayFromJSON(type, R"([null, "x", "x", null, "x", "x"])");
auto sliced = full_arr->Slice(3);
auto full_arr = ArrayFromJSON(type, R"(["not used", null, "x", "x"])");
auto sliced = full_arr->Slice(1);

@Ebraam-Ashraf
Copy link
Author

Tests before fix

[   RUN    ] TestIfElseBaseBinary/0.IfElseBaseBinarySliced
[  FAILED  ] TestIfElseBaseBinary/0.IfElseBaseBinarySliced (20 ms)
[  FAILED  ] TestIfElseBaseBinary/1.IfElseBaseBinarySliced (1 ms)
[  FAILED  ] TestIfElseBaseBinary/2.IfElseBaseBinarySliced (1 ms)
[  FAILED  ] TestIfElseBaseBinary/3.IfElseBaseBinarySliced (0 ms)
[  PASSED  ] 0 tests.
[  FAILED  ] 4 tests.

Invalid: Offset invariant failure: offset for slot 1 out of bounds: 8 > 2

What the fix does

Replaced the single memcpy for the offsets buffer with a loop that subtracts offsets[0] as a base from each value
so the output offsets are always normalized to start at 0.

Then changed the data copy to start from data + base instead of data
so only the bytes belonging to the slice are copied.

The same two changes are applied symmetrically in both the ASA and AAS paths.

Tests after fix

[==========] 416 tests from 132 test suites ran. (31991 ms total)
[  PASSED  ] 416 tests.

@Ebraam-Ashraf Ebraam-Ashraf requested a review from kou March 6, 2026 07:13
@Ebraam-Ashraf
Copy link
Author

@kou

@kou
Copy link
Member

kou commented Mar 7, 2026

Could you fix the lint failure and update the PR title and description?

@Ebraam-Ashraf Ebraam-Ashraf changed the title GH-49410: [C++] Add regression test for if_else with sliced BaseBinary chunks GH-49410: [C++] Fix if_else null-scalar fast paths for sliced BaseBinary arrays Mar 7, 2026
@Ebraam-Ashraf
Copy link
Author

thanks @kou
I really appreciate your patience and guidance throughout this PR
looking forward to your review 🙏

Comment on lines +746 to +753
std::memcpy(out_data->buffers[1]->mutable_data(), right_offsets, offset_length);

auto right_data_length = right_offsets[right.length] - right_offsets[0];
OffsetType base = right_offsets[0];
auto* out_offsets =
reinterpret_cast<OffsetType*>(out_data->buffers[1]->mutable_data());
for (int64_t i = 0; i <= cond.length; ++i) {
out_offsets[i] = right_offsets[i] - base;
}
Copy link
Member

@kou kou Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change always (1) copies right offsets and (2) recomputes right offsets, right?

Can we use do the following?

  1. If right array isn't sliced, we use only (1)
  2. If right array is sliced, we use only (2)

@Ebraam-Ashraf Ebraam-Ashraf requested a review from kou March 8, 2026 14:17
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes if_else fast paths for BaseBinary arrays when one input is a null scalar and the other input is a sliced (non-zero offset) binary/utf8 array, which previously could produce invalid offsets/data.

Changes:

  • Adjust ASA/AAS fast paths in scalar_if_else.cc to normalize copied offsets and copy only the sliced data range.
  • Add a typed regression test covering sliced utf8, binary, large_utf8, and large_binary for both ASA and AAS paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
cpp/src/arrow/compute/kernels/scalar_if_else.cc Fixes offset normalization and data copy base for null-scalar fast paths with sliced BaseBinary arrays.
cpp/src/arrow/compute/kernels/scalar_if_else_test.cc Adds regression coverage for sliced BaseBinary arrays in both ASA and AAS shortcut paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +747 to +755
if (right.offset == 0) {
std::memcpy(out_data->buffers[1]->mutable_data(), right_offsets, offset_length);
} else {
OffsetType base = right_offsets[0];
auto* out_offsets =
reinterpret_cast<OffsetType*>(out_data->buffers[1]->mutable_data());
for (int64_t i = 0; i <= cond.length; ++i) {
out_offsets[i] = right_offsets[i] - base;
}
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision to memcpy offsets based on right.offset == 0 is insufficient. Binary arrays can be valid with offset == 0 but right_offsets[0] != 0 (ValidateFull only requires bounds/monotonicity). In that case this branch would copy non-zero-based offsets while the data copy starts at right_data + right_offsets[0], producing invalid offsets relative to the copied data. Consider keying this optimization off right_offsets[0] == 0 (or always normalizing by subtracting right_offsets[0]) so the output offsets are consistent with the copied data in all valid inputs.

Copilot uses AI. Check for mistakes.
Comment on lines +798 to +806
if (left.offset == 0) {
std::memcpy(out_data->buffers[1]->mutable_data(), left_offsets, offset_length);
} else {
OffsetType base = left_offsets[0];
auto* out_offsets =
reinterpret_cast<OffsetType*>(out_data->buffers[1]->mutable_data());
for (int64_t i = 0; i <= cond.length; ++i) {
out_offsets[i] = left_offsets[i] - base;
}
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as ASA path: using left.offset == 0 to decide whether to memcpy offsets can produce invalid output if left.offset == 0 but left_offsets[0] != 0. Since the data is copied starting at left_data + left_offsets[0], the output offsets should be normalized whenever left_offsets[0] is non-zero (or unconditionally), not only when the array has a non-zero element offset.

Copilot uses AI. Check for mistakes.
@Ebraam-Ashraf
Copy link
Author

@kou
regarding copilot's feedback applied it this is more correct since the data copy already keys on offsets[0] so the branch condition should match
I initially thought offset == 0 with offsets[0] != 0 was impossible but after checking the spec and ValidateFull() both allow it so it's possible to craft such an array manually
do you think a test is needed for this edge case or is the existing sufficient?

@kou
Copy link
Member

kou commented Mar 10, 2026

Can we create such test data easily? If it's difficult/complex, the current test is enough.

@Ebraam-Ashraf
Copy link
Author

It's doable but requires manually crafting raw buffers via ArrayData::Make() like

std::vector<OffsetType> raw_offsets = {8, 8, 9, 10};
std::string raw_data(8, 'x');
raw_data += "xx";
auto array_data = ArrayData::Make(type, 3,
    {nullptr, Buffer::Wrap(raw_offsets), data_buf}, 1, 0);

I can add it if you think it's worth covering but I think adding it not a problem

@kou
Copy link
Member

kou commented Mar 10, 2026

OK. Could you add it?

@Ebraam-Ashraf
Copy link
Author

added the edge case test
before fix

[  FAILED  ] TestIfElseBaseBinary/0.IfElseBaseBinarySliced, where TypeParam = arrow::BinaryType
[  FAILED  ] TestIfElseBaseBinary/1.IfElseBaseBinarySliced, where TypeParam = arrow::LargeBinaryType
[  FAILED  ] TestIfElseBaseBinary/2.IfElseBaseBinarySliced, where TypeParam = arrow::StringType
[  FAILED  ] TestIfElseBaseBinary/3.IfElseBaseBinarySliced, where TypeParam = arrow::LargeStringType

Invalid: Offset invariant failure: offset for slot 1 out of bounds: 8 > 2

after fix

[  PASSED  ] TestIfElseBaseBinary/0.IfElseBaseBinarySliced
[  PASSED  ] TestIfElseBaseBinary/1.IfElseBaseBinarySliced
[  PASSED  ] TestIfElseBaseBinary/2.IfElseBaseBinarySliced
[  PASSED  ] TestIfElseBaseBinary/3.IfElseBaseBinarySliced
[==========] 4 tests passed.

@kou

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants