Skip to content

perf: add SIMD-accelerated UTF-8 validation to core arrow crates#9495

Open
lyang24 wants to merge 1 commit intoapache:mainfrom
lyang24:simdutf8
Open

perf: add SIMD-accelerated UTF-8 validation to core arrow crates#9495
lyang24 wants to merge 1 commit intoapache:mainfrom
lyang24:simdutf8

Conversation

@lyang24
Copy link
Contributor

@lyang24 lyang24 commented Mar 1, 2026

Which issue does this PR close?

Rationale for this change

Add simdutf8 for fast UTF-8 validation in arrow-data, arrow-array, arrow-row, and arrow-csv. A shared check_utf8() utility in arrow-data uses SIMD on the happy path and falls back to std::str::from_utf8 on error for detailed Utf8Error. The feature is default-enabled in the arrow umbrella crate.

What changes are included in this PR?

simd impl of utf8 instead of the standard lib method

Are these changes tested?

all tests passed

Are there any user-facing changes?

no

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 1, 2026
/// on the happy path for improved throughput. Falls back to `std::str::from_utf8`
/// on the error path to provide a detailed [`std::str::Utf8Error`].
#[inline(always)]
pub fn check_utf8(val: &[u8]) -> Result<&str, std::str::Utf8Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we unify this with the existing utf8 check?

Copy link
Contributor Author

@lyang24 lyang24 Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, do you mean unifying it with the one in parquet folder?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 8, 2026
@lyang24 lyang24 requested a review from Dandandan March 8, 2026 21:39
@lyang24 lyang24 marked this pull request as ready for review March 8, 2026 21:39
@github-actions github-actions bot removed the parquet Changes to the parquet crate label Mar 8, 2026
Add simdutf8 for fast UTF-8 validation in arrow-data, arrow-array,
arrow-row, and arrow-csv. A shared check_utf8() utility in arrow-data
uses SIMD on the happy path and falls back to std::str::from_utf8 on
error for detailed Utf8Error. The feature is default-enabled in the
arrow umbrella crate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants