perf: DictionaryArray passthrough optimization for Parquet writer#9520
Open
asuresh8 wants to merge 3 commits intoapache:mainfrom
Open
perf: DictionaryArray passthrough optimization for Parquet writer#9520asuresh8 wants to merge 3 commits intoapache:mainfrom
asuresh8 wants to merge 3 commits intoapache:mainfrom
Conversation
When writing DictionaryArray columns with dictionary encoding enabled, use a lazy remap table to avoid re-hashing values that share the same dictionary key. Reduces hash operations from O(N) to O(D) where D is the number of unique dictionary values referenced in the batch. Activates automatically when D <= N/2 (low-cardinality dictionaries). High-cardinality inputs fall through to the existing path unchanged. Benchmark results (string_dictionary_low_cardinality, 4096 rows, 15 unique values): - default: -39% time (+67% throughput) - bloom_filter: -40% time (+70% throughput) - parquet_2: -38% time (+64% throughput) - zstd: -36% time (+56% throughput) Existing string_dictionary benchmark (high-cardinality): zero regression. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
|
run benchmark arrow_writer |
Dandandan
reviewed
Mar 7, 2026
| fresh | ||
| } | ||
| }; | ||
| self.indices.push(interned); |
Contributor
There was a problem hiding this comment.
Can use self.indices.extend instead by using indices.map and returning interned inside.
- Replace reserve+loop+push with extend+map per reviewer feedback - Fix no-default-features compilation (add #![cfg(feature = "arrow")] to test) - Fix clippy warnings (&Box<DataType> -> &DataType, vec![] -> array literals) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
🤖 |
|
🤖: Benchmark completed Details
|
Contributor
|
Ah @asuresh8 could you add the new benchmarks in a separate PR? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Related to the optimization opportunity noted in #2322 (specifically this comment by @tustvold suggesting that DictionaryArray inputs could avoid re-hashing dictionary values).
Rationale for this change
When writing a
DictionaryArraycolumn with dictionary encoding enabled, the currentByteArrayEncoderinterns every row's string value individually viaself.interner.intern(value.as_ref()). This performs O(N) hash operations where N is the number of rows -- even though the inputDictionaryArrayalready has all unique values pre-indexed with only D unique dictionary entries (D << N for low-cardinality data).This is wasteful for the common case of categorical columns (status codes, country codes, enum-like strings, etc.) where D might be 5-50 while N is thousands or millions.
Why fewer hashes matters: Each
intern()call performs a full ahash computation of the byte slice plus a hashbrown table probe. For a column with 15 unique values across 4096 rows, the existing code performs 4,096 hash+lookup operations. With this optimization, it performs only 15 hash+lookup operations plus 4,096 simpleVecindex lookups -- effectively free compared to hashing.The cost scales linearly with N. At millions of rows (common in analytics workloads), the redundant hashing becomes the dominant cost of dictionary encoding for low-cardinality string columns.
What changes are included in this PR?
A new
encode_with_remapmethod onDictEncoderthat builds a lazy remap table of size O(D):Vec<Option<u64>>of length D (the dictionary size)remap[key]: ifSome(interned_id), use the cached value (no hash). IfNone, intern the value once (one hash), cache the result inremap[key]The optimization activates in
write_gatherwhen all of these hold:DictionaryArray(checked viadata_type())self.dict_encoder.is_some())D <= N/2(the remap table overhead is not worthwhile when most keys are unique)When any condition is not met, the code falls through to the existing
encodepath with zero overhead.Output is byte-identical to the existing path. The remap table produces the same interned IDs in the same order -- it is purely a caching optimization that avoids redundant hash operations.
Benchmark results
New benchmark
string_dictionary_low_cardinality-- 4,096 rows, 15 unique string values (simulating categorical columns):Existing
string_dictionarybenchmark (high-cardinality, random data): no change in performance detected (p > 0.05 for all configurations), confirming zero regression on inputs where the optimization does not activate.Are these changes tested?
Unit tests (in
byte_array.rs):test_dict_passthrough_roundtrip-- basic low-cardinality DictionaryArray write+readtest_dict_passthrough_roundtrip_to_plain-- DictionaryArray input read back as plain StringArraytest_dict_passthrough_data_equivalence-- byte-identical output between Dict and plain pathstest_dict_passthrough_null_keys-- DictionaryArray with null keystest_dict_passthrough_mixed_batches-- DictionaryArray then StringArray for same column writertest_dict_passthrough_multiple_row_groups-- multiple row groups with separate dictionariestest_dict_passthrough_statistics_correctness-- min/max statistics match between Dict and plain pathstest_dict_passthrough_high_cardinality-- high-cardinality dict with small page size limit (fallback path)Integration tests (new file
parquet/tests/arrow_writer_dictionary.rs):dictionary_roundtrip_low_cardinality-- 4,096-row write+read roundtrip through public APIdictionary_and_plain_columns_roundtrip-- mixed DictionaryArray + StringArray columns in same batchdictionary_statistics_match_plain-- statistics from Dict path match plain StringArray pathdictionary_multi_row_group_roundtrip-- multi-row-group write+read with DictionaryArraydictionary_with_nulls_roundtrip-- DictionaryArray with null values through public APIAre there any user-facing changes?
No API changes. This is a transparent performance improvement for any user writing
DictionaryArraycolumns with dictionary encoding enabled (the default). The optimization activates automatically for low-cardinality dictionaries and produces byte-identical output.