perf: DictionaryArray passthrough optimization for Parquet writer by asuresh8 · Pull Request #9520 · apache/arrow-rs

asuresh8 · 2026-03-06T10:51:26Z

Which issue does this PR close?

Related to the optimization opportunity noted in #2322 (specifically this comment by @tustvold suggesting that DictionaryArray inputs could avoid re-hashing dictionary values).

Rationale for this change

When writing a DictionaryArray column with dictionary encoding enabled, the current ByteArrayEncoder interns every row's string value individually via self.interner.intern(value.as_ref()). This performs O(N) hash operations where N is the number of rows -- even though the input DictionaryArray already has all unique values pre-indexed with only D unique dictionary entries (D << N for low-cardinality data).

This is wasteful for the common case of categorical columns (status codes, country codes, enum-like strings, etc.) where D might be 5-50 while N is thousands or millions.

Why fewer hashes matters: Each intern() call performs a full ahash computation of the byte slice plus a hashbrown table probe. For a column with 15 unique values across 4096 rows, the existing code performs 4,096 hash+lookup operations. With this optimization, it performs only 15 hash+lookup operations plus 4,096 simple Vec index lookups -- effectively free compared to hashing.

The cost scales linearly with N. At millions of rows (common in analytics workloads), the redundant hashing becomes the dominant cost of dictionary encoding for low-cardinality string columns.

What changes are included in this PR?

A new encode_with_remap method on DictEncoder that builds a lazy remap table of size O(D):

Allocate a Vec<Option<u64>> of length D (the dictionary size)
For each row, extract the dictionary key (a simple integer index lookup)
Check remap[key]: if Some(interned_id), use the cached value (no hash). If None, intern the value once (one hash), cache the result in remap[key]
Each unique dictionary value is interned exactly once

The optimization activates in write_gather when all of these hold:

The input is a DictionaryArray (checked via data_type())
Dictionary encoding is enabled (self.dict_encoder.is_some())
The dictionary is low-cardinality: D <= N/2 (the remap table overhead is not worthwhile when most keys are unique)
No geo statistics accumulator is active (geo stats need per-value processing)

When any condition is not met, the code falls through to the existing encode path with zero overhead.

Output is byte-identical to the existing path. The remap table produces the same interned IDs in the same order -- it is purely a caching optimization that avoids redundant hash operations.

Benchmark results

New benchmark string_dictionary_low_cardinality -- 4,096 rows, 15 unique string values (simulating categorical columns):

Configuration	Before	After	Change
default	53.1 us	32.3 us	-39% time, +67% throughput
bloom_filter	85.3 us	50.9 us	-40% time, +70% throughput
parquet_2	54.8 us	34.1 us	-38% time, +64% throughput
zstd	58.4 us	37.6 us	-36% time, +56% throughput
zstd_parquet_2	59.3 us	38.9 us	-35% time, +53% throughput

Existing string_dictionary benchmark (high-cardinality, random data): no change in performance detected (p > 0.05 for all configurations), confirming zero regression on inputs where the optimization does not activate.

Are these changes tested?

Unit tests (in byte_array.rs):

test_dict_passthrough_roundtrip -- basic low-cardinality DictionaryArray write+read
test_dict_passthrough_roundtrip_to_plain -- DictionaryArray input read back as plain StringArray
test_dict_passthrough_data_equivalence -- byte-identical output between Dict and plain paths
test_dict_passthrough_null_keys -- DictionaryArray with null keys
test_dict_passthrough_mixed_batches -- DictionaryArray then StringArray for same column writer
test_dict_passthrough_multiple_row_groups -- multiple row groups with separate dictionaries
test_dict_passthrough_statistics_correctness -- min/max statistics match between Dict and plain paths
test_dict_passthrough_high_cardinality -- high-cardinality dict with small page size limit (fallback path)

Integration tests (new file parquet/tests/arrow_writer_dictionary.rs):

dictionary_roundtrip_low_cardinality -- 4,096-row write+read roundtrip through public API
dictionary_and_plain_columns_roundtrip -- mixed DictionaryArray + StringArray columns in same batch
dictionary_statistics_match_plain -- statistics from Dict path match plain StringArray path
dictionary_multi_row_group_roundtrip -- multi-row-group write+read with DictionaryArray
dictionary_with_nulls_roundtrip -- DictionaryArray with null values through public API

Are there any user-facing changes?

No API changes. This is a transparent performance improvement for any user writing DictionaryArray columns with dictionary encoding enabled (the default). The optimization activates automatically for low-cardinality dictionaries and produces byte-identical output.

When writing DictionaryArray columns with dictionary encoding enabled, use a lazy remap table to avoid re-hashing values that share the same dictionary key. Reduces hash operations from O(N) to O(D) where D is the number of unique dictionary values referenced in the batch. Activates automatically when D <= N/2 (low-cardinality dictionaries). High-cardinality inputs fall through to the existing path unchanged. Benchmark results (string_dictionary_low_cardinality, 4096 rows, 15 unique values): - default: -39% time (+67% throughput) - bloom_filter: -40% time (+70% throughput) - parquet_2: -38% time (+64% throughput) - zstd: -36% time (+56% throughput) Existing string_dictionary benchmark (high-cardinality): zero regression. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Dandandan · 2026-03-07T21:06:53Z

run benchmark arrow_writer

Dandandan · 2026-03-07T21:08:43Z

parquet/src/arrow/arrow_writer/byte_array.rs

+                    fresh
+                }
+            };
+            self.indices.push(interned);


Can use self.indices.extend instead by using indices.map and returning interned inside.

- Replace reserve+loop+push with extend+map per reviewer feedback - Fix no-default-features compilation (add #![cfg(feature = "arrow")] to test) - Fix clippy warnings (&Box<DataType> -> &DataType, vec![] -> array literals) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

alamb-ghbot · 2026-03-07T23:31:27Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing dict_array_optimization (6874ee0) to 5ba4515 diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
BENCH_BRANCH_NAME=dict_array_optimization
Results will be posted here when complete

alamb-ghbot · 2026-03-07T23:53:55Z

🤖: Benchmark completed

Details

group                                               dict_array_optimization                main
-----                                               -----------------------                ----
bool/bloom_filter                                   1.01    123.1±0.86µs     8.6 MB/sec    1.00    122.1±2.14µs     8.7 MB/sec
bool/default                                        1.02     54.8±0.85µs    19.3 MB/sec    1.00     53.7±0.59µs    19.8 MB/sec
bool/parquet_2                                      1.03     70.3±0.41µs    15.1 MB/sec    1.00     68.2±1.11µs    15.6 MB/sec
bool/zstd                                           1.01     66.8±1.25µs    15.9 MB/sec    1.00     66.3±0.69µs    16.0 MB/sec
bool/zstd_parquet_2                                 1.03     81.2±1.58µs    13.1 MB/sec    1.00     79.1±0.68µs    13.4 MB/sec
bool_non_null/bloom_filter                          1.00     99.6±0.87µs     5.7 MB/sec    1.00     99.9±0.39µs     5.7 MB/sec
bool_non_null/default                               1.00     21.6±0.28µs    26.5 MB/sec    1.02     21.9±0.15µs    26.1 MB/sec
bool_non_null/parquet_2                             1.00     36.9±0.15µs    15.5 MB/sec    1.07     39.7±0.22µs    14.4 MB/sec
bool_non_null/zstd                                  1.00     31.0±0.12µs    18.5 MB/sec    1.02     31.6±0.12µs    18.1 MB/sec
bool_non_null/zstd_parquet_2                        1.00     47.5±1.81µs    12.0 MB/sec    1.06     50.3±1.54µs    11.4 MB/sec
float_with_nans/bloom_filter                        1.01   863.5±11.62µs    63.6 MB/sec    1.00    856.9±7.86µs    64.1 MB/sec
float_with_nans/default                             1.01    506.2±5.47µs   108.6 MB/sec    1.00    502.6±8.60µs   109.4 MB/sec
float_with_nans/parquet_2                           1.00    742.5±8.45µs    74.0 MB/sec    1.00    741.4±7.19µs    74.1 MB/sec
float_with_nans/zstd                                1.00    671.5±8.59µs    81.8 MB/sec    1.00   670.1±11.54µs    82.0 MB/sec
float_with_nans/zstd_parquet_2                      1.00   910.4±10.06µs    60.4 MB/sec    1.00   910.1±10.52µs    60.4 MB/sec
list_primitive/bloom_filter                         1.00      2.4±0.03ms   876.1 MB/sec    1.00      2.4±0.03ms   880.2 MB/sec
list_primitive/default                              1.02  1756.5±15.13µs  1213.7 MB/sec    1.00  1716.2±39.21µs  1242.2 MB/sec
list_primitive/parquet_2                            1.00  1803.3±14.14µs  1182.2 MB/sec    1.00   1799.0±8.44µs  1185.0 MB/sec
list_primitive/zstd                                 1.01      2.9±0.01ms   725.9 MB/sec    1.00      2.9±0.02ms   735.7 MB/sec
list_primitive/zstd_parquet_2                       1.01      3.0±0.02ms   714.9 MB/sec    1.00      3.0±0.02ms   721.1 MB/sec
list_primitive_non_null/bloom_filter                1.00      2.8±0.03ms   767.2 MB/sec    1.01      2.8±0.02ms   759.3 MB/sec
list_primitive_non_null/default                     1.02  1821.5±52.37µs  1167.9 MB/sec    1.00  1779.0±10.00µs  1195.8 MB/sec
list_primitive_non_null/parquet_2                   1.00  1952.7±15.65µs  1089.4 MB/sec    1.02  1991.7±13.59µs  1068.1 MB/sec
list_primitive_non_null/zstd                        1.00      3.9±0.04ms   551.7 MB/sec    1.01      3.9±0.04ms   546.9 MB/sec
list_primitive_non_null/zstd_parquet_2              1.00      4.0±0.04ms   535.5 MB/sec    1.00      4.0±0.03ms   533.7 MB/sec
primitive/bloom_filter                              1.07      4.6±0.15ms    38.4 MB/sec    1.00      4.3±0.10ms    41.0 MB/sec
primitive/default                                   1.02    779.4±5.96µs   225.7 MB/sec    1.00    763.6±3.48µs   230.4 MB/sec
primitive/parquet_2                                 1.00    783.2±6.13µs   224.6 MB/sec    1.05   824.1±74.58µs   213.5 MB/sec
primitive/zstd                                      1.03  1095.2±28.45µs   160.6 MB/sec    1.00   1066.5±7.37µs   165.0 MB/sec
primitive/zstd_parquet_2                            1.01  1034.4±11.93µs   170.1 MB/sec    1.00   1022.4±5.53µs   172.1 MB/sec
primitive_non_null/bloom_filter                     1.02  1686.9±50.03µs   102.3 MB/sec    1.00  1655.8±28.45µs   104.2 MB/sec
primitive_non_null/default                          1.00    605.1±3.00µs   285.1 MB/sec    1.02    615.8±4.76µs   280.1 MB/sec
primitive_non_null/parquet_2                        1.00    610.6±5.18µs   282.6 MB/sec    1.02    623.2±8.66µs   276.8 MB/sec
primitive_non_null/zstd                             1.00    890.6±6.14µs   193.7 MB/sec    1.00    893.6±7.56µs   193.0 MB/sec
primitive_non_null/zstd_parquet_2                   1.00   899.4±11.92µs   191.8 MB/sec    1.01   910.5±11.04µs   189.5 MB/sec
string/bloom_filter                                 1.00  1281.9±14.82µs  1597.8 MB/sec    1.07  1366.7±21.09µs  1498.5 MB/sec
string/default                                      1.00    786.9±8.45µs     2.5 GB/sec    1.08   847.8±11.76µs     2.4 GB/sec
string/parquet_2                                    1.00   793.0±18.98µs     2.5 GB/sec    1.07    848.5±7.29µs     2.4 GB/sec
string/zstd                                         1.00      2.2±0.01ms   927.4 MB/sec    1.03      2.3±0.01ms   901.9 MB/sec
string/zstd_parquet_2                               1.00      2.2±0.01ms   926.2 MB/sec    1.04      2.3±0.02ms   889.4 MB/sec
string_and_binary_view/bloom_filter                 1.02    618.2±7.43µs   204.1 MB/sec    1.00    605.2±5.11µs   208.5 MB/sec
string_and_binary_view/default                      1.02    372.5±3.12µs   338.8 MB/sec    1.00    365.6±2.24µs   345.2 MB/sec
string_and_binary_view/parquet_2                    1.02    377.0±8.25µs   334.7 MB/sec    1.00    369.3±9.11µs   341.7 MB/sec
string_and_binary_view/zstd                         1.01   607.2±11.42µs   207.8 MB/sec    1.00   599.9±16.80µs   210.3 MB/sec
string_and_binary_view/zstd_parquet_2               1.01    592.6±8.58µs   212.9 MB/sec    1.00   585.6±10.38µs   215.5 MB/sec
string_dictionary/bloom_filter                      1.00    614.6±6.98µs  1679.2 MB/sec    1.07   655.7±13.46µs  1574.0 MB/sec
string_dictionary/default                           1.00    393.1±4.88µs     2.6 GB/sec    1.07    420.3±5.95µs     2.4 GB/sec
string_dictionary/parquet_2                         1.00    391.9±5.04µs     2.6 GB/sec    1.07    420.6±7.04µs     2.4 GB/sec
string_dictionary/zstd                              1.00   1103.2±9.72µs   935.5 MB/sec    1.04  1142.1±27.19µs   903.6 MB/sec
string_dictionary/zstd_parquet_2                    1.00   1103.1±4.51µs   935.5 MB/sec    1.03  1136.8±40.31µs   907.8 MB/sec
string_dictionary_low_cardinality/bloom_filter      1.00    107.3±1.13µs   193.4 MB/sec  
string_dictionary_low_cardinality/default           1.00     63.9±0.17µs   325.1 MB/sec  
string_dictionary_low_cardinality/parquet_2         1.00     64.4±0.23µs   322.6 MB/sec  
string_dictionary_low_cardinality/zstd              1.00     75.5±0.68µs   275.0 MB/sec  
string_dictionary_low_cardinality/zstd_parquet_2    1.00     75.8±0.46µs   274.1 MB/sec  
string_non_null/bloom_filter                        1.00  1779.8±22.62µs  1150.2 MB/sec    1.02  1819.2±24.37µs  1125.3 MB/sec
string_non_null/default                             1.00  1130.6±11.20µs  1810.6 MB/sec    1.04  1180.3±21.30µs  1734.4 MB/sec
string_non_null/parquet_2                           1.00   1146.0±8.16µs  1786.3 MB/sec    1.03   1180.1±9.63µs  1734.7 MB/sec
string_non_null/zstd                                1.00      3.0±0.02ms   673.9 MB/sec    1.03      3.1±0.05ms   656.9 MB/sec
string_non_null/zstd_parquet_2                      1.00      3.1±0.05ms   664.1 MB/sec    1.02      3.1±0.08ms   650.7 MB/sec

Dandandan · 2026-03-08T09:31:44Z

Ah @asuresh8 could you add the new benchmarks in a separate PR?

github-actions bot added the parquet Changes to the parquet crate label Mar 6, 2026

Fix lint issues.

6874ee0

Dandandan reviewed Mar 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: DictionaryArray passthrough optimization for Parquet writer#9520

perf: DictionaryArray passthrough optimization for Parquet writer#9520
asuresh8 wants to merge 3 commits intoapache:mainfrom
asuresh8:dict_array_optimization

asuresh8 commented Mar 6, 2026

Uh oh!

Dandandan commented Mar 7, 2026

Uh oh!

Dandandan Mar 7, 2026

Uh oh!

alamb-ghbot commented Mar 7, 2026

Uh oh!

alamb-ghbot commented Mar 7, 2026

Uh oh!

Dandandan commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

asuresh8 commented Mar 6, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Benchmark results

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Dandandan commented Mar 7, 2026

Uh oh!

Dandandan Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

alamb-ghbot commented Mar 7, 2026

Uh oh!

alamb-ghbot commented Mar 7, 2026

Uh oh!

Dandandan commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants