Skip to content

perf: DictionaryArray passthrough optimization for Parquet writer#9520

Open
asuresh8 wants to merge 3 commits intoapache:mainfrom
asuresh8:dict_array_optimization
Open

perf: DictionaryArray passthrough optimization for Parquet writer#9520
asuresh8 wants to merge 3 commits intoapache:mainfrom
asuresh8:dict_array_optimization

Conversation

@asuresh8
Copy link

@asuresh8 asuresh8 commented Mar 6, 2026

Which issue does this PR close?

Related to the optimization opportunity noted in #2322 (specifically this comment by @tustvold suggesting that DictionaryArray inputs could avoid re-hashing dictionary values).

Rationale for this change

When writing a DictionaryArray column with dictionary encoding enabled, the current ByteArrayEncoder interns every row's string value individually via self.interner.intern(value.as_ref()). This performs O(N) hash operations where N is the number of rows -- even though the input DictionaryArray already has all unique values pre-indexed with only D unique dictionary entries (D << N for low-cardinality data).

This is wasteful for the common case of categorical columns (status codes, country codes, enum-like strings, etc.) where D might be 5-50 while N is thousands or millions.

Why fewer hashes matters: Each intern() call performs a full ahash computation of the byte slice plus a hashbrown table probe. For a column with 15 unique values across 4096 rows, the existing code performs 4,096 hash+lookup operations. With this optimization, it performs only 15 hash+lookup operations plus 4,096 simple Vec index lookups -- effectively free compared to hashing.

The cost scales linearly with N. At millions of rows (common in analytics workloads), the redundant hashing becomes the dominant cost of dictionary encoding for low-cardinality string columns.

What changes are included in this PR?

A new encode_with_remap method on DictEncoder that builds a lazy remap table of size O(D):

  1. Allocate a Vec<Option<u64>> of length D (the dictionary size)
  2. For each row, extract the dictionary key (a simple integer index lookup)
  3. Check remap[key]: if Some(interned_id), use the cached value (no hash). If None, intern the value once (one hash), cache the result in remap[key]
  4. Each unique dictionary value is interned exactly once

The optimization activates in write_gather when all of these hold:

  • The input is a DictionaryArray (checked via data_type())
  • Dictionary encoding is enabled (self.dict_encoder.is_some())
  • The dictionary is low-cardinality: D <= N/2 (the remap table overhead is not worthwhile when most keys are unique)
  • No geo statistics accumulator is active (geo stats need per-value processing)

When any condition is not met, the code falls through to the existing encode path with zero overhead.

Output is byte-identical to the existing path. The remap table produces the same interned IDs in the same order -- it is purely a caching optimization that avoids redundant hash operations.

Benchmark results

New benchmark string_dictionary_low_cardinality -- 4,096 rows, 15 unique string values (simulating categorical columns):

Configuration Before After Change
default 53.1 us 32.3 us -39% time, +67% throughput
bloom_filter 85.3 us 50.9 us -40% time, +70% throughput
parquet_2 54.8 us 34.1 us -38% time, +64% throughput
zstd 58.4 us 37.6 us -36% time, +56% throughput
zstd_parquet_2 59.3 us 38.9 us -35% time, +53% throughput

Existing string_dictionary benchmark (high-cardinality, random data): no change in performance detected (p > 0.05 for all configurations), confirming zero regression on inputs where the optimization does not activate.

Are these changes tested?

Unit tests (in byte_array.rs):

  • test_dict_passthrough_roundtrip -- basic low-cardinality DictionaryArray write+read
  • test_dict_passthrough_roundtrip_to_plain -- DictionaryArray input read back as plain StringArray
  • test_dict_passthrough_data_equivalence -- byte-identical output between Dict and plain paths
  • test_dict_passthrough_null_keys -- DictionaryArray with null keys
  • test_dict_passthrough_mixed_batches -- DictionaryArray then StringArray for same column writer
  • test_dict_passthrough_multiple_row_groups -- multiple row groups with separate dictionaries
  • test_dict_passthrough_statistics_correctness -- min/max statistics match between Dict and plain paths
  • test_dict_passthrough_high_cardinality -- high-cardinality dict with small page size limit (fallback path)

Integration tests (new file parquet/tests/arrow_writer_dictionary.rs):

  • dictionary_roundtrip_low_cardinality -- 4,096-row write+read roundtrip through public API
  • dictionary_and_plain_columns_roundtrip -- mixed DictionaryArray + StringArray columns in same batch
  • dictionary_statistics_match_plain -- statistics from Dict path match plain StringArray path
  • dictionary_multi_row_group_roundtrip -- multi-row-group write+read with DictionaryArray
  • dictionary_with_nulls_roundtrip -- DictionaryArray with null values through public API

Are there any user-facing changes?

No API changes. This is a transparent performance improvement for any user writing DictionaryArray columns with dictionary encoding enabled (the default). The optimization activates automatically for low-cardinality dictionaries and produces byte-identical output.

When writing DictionaryArray columns with dictionary encoding enabled,
use a lazy remap table to avoid re-hashing values that share the same
dictionary key. Reduces hash operations from O(N) to O(D) where D is
the number of unique dictionary values referenced in the batch.

Activates automatically when D <= N/2 (low-cardinality dictionaries).
High-cardinality inputs fall through to the existing path unchanged.

Benchmark results (string_dictionary_low_cardinality, 4096 rows, 15 unique values):
- default:      -39% time (+67% throughput)
- bloom_filter: -40% time (+70% throughput)
- parquet_2:    -38% time (+64% throughput)
- zstd:         -36% time (+56% throughput)

Existing string_dictionary benchmark (high-cardinality): zero regression.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 6, 2026
@Dandandan
Copy link
Contributor

run benchmark arrow_writer

fresh
}
};
self.indices.push(interned);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use self.indices.extend instead by using indices.map and returning interned inside.

- Replace reserve+loop+push with extend+map per reviewer feedback
- Fix no-default-features compilation (add #![cfg(feature = "arrow")] to test)
- Fix clippy warnings (&Box<DataType> -> &DataType, vec![] -> array literals)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing dict_array_optimization (6874ee0) to 5ba4515 diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
BENCH_BRANCH_NAME=dict_array_optimization
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                               dict_array_optimization                main
-----                                               -----------------------                ----
bool/bloom_filter                                   1.01    123.1±0.86µs     8.6 MB/sec    1.00    122.1±2.14µs     8.7 MB/sec
bool/default                                        1.02     54.8±0.85µs    19.3 MB/sec    1.00     53.7±0.59µs    19.8 MB/sec
bool/parquet_2                                      1.03     70.3±0.41µs    15.1 MB/sec    1.00     68.2±1.11µs    15.6 MB/sec
bool/zstd                                           1.01     66.8±1.25µs    15.9 MB/sec    1.00     66.3±0.69µs    16.0 MB/sec
bool/zstd_parquet_2                                 1.03     81.2±1.58µs    13.1 MB/sec    1.00     79.1±0.68µs    13.4 MB/sec
bool_non_null/bloom_filter                          1.00     99.6±0.87µs     5.7 MB/sec    1.00     99.9±0.39µs     5.7 MB/sec
bool_non_null/default                               1.00     21.6±0.28µs    26.5 MB/sec    1.02     21.9±0.15µs    26.1 MB/sec
bool_non_null/parquet_2                             1.00     36.9±0.15µs    15.5 MB/sec    1.07     39.7±0.22µs    14.4 MB/sec
bool_non_null/zstd                                  1.00     31.0±0.12µs    18.5 MB/sec    1.02     31.6±0.12µs    18.1 MB/sec
bool_non_null/zstd_parquet_2                        1.00     47.5±1.81µs    12.0 MB/sec    1.06     50.3±1.54µs    11.4 MB/sec
float_with_nans/bloom_filter                        1.01   863.5±11.62µs    63.6 MB/sec    1.00    856.9±7.86µs    64.1 MB/sec
float_with_nans/default                             1.01    506.2±5.47µs   108.6 MB/sec    1.00    502.6±8.60µs   109.4 MB/sec
float_with_nans/parquet_2                           1.00    742.5±8.45µs    74.0 MB/sec    1.00    741.4±7.19µs    74.1 MB/sec
float_with_nans/zstd                                1.00    671.5±8.59µs    81.8 MB/sec    1.00   670.1±11.54µs    82.0 MB/sec
float_with_nans/zstd_parquet_2                      1.00   910.4±10.06µs    60.4 MB/sec    1.00   910.1±10.52µs    60.4 MB/sec
list_primitive/bloom_filter                         1.00      2.4±0.03ms   876.1 MB/sec    1.00      2.4±0.03ms   880.2 MB/sec
list_primitive/default                              1.02  1756.5±15.13µs  1213.7 MB/sec    1.00  1716.2±39.21µs  1242.2 MB/sec
list_primitive/parquet_2                            1.00  1803.3±14.14µs  1182.2 MB/sec    1.00   1799.0±8.44µs  1185.0 MB/sec
list_primitive/zstd                                 1.01      2.9±0.01ms   725.9 MB/sec    1.00      2.9±0.02ms   735.7 MB/sec
list_primitive/zstd_parquet_2                       1.01      3.0±0.02ms   714.9 MB/sec    1.00      3.0±0.02ms   721.1 MB/sec
list_primitive_non_null/bloom_filter                1.00      2.8±0.03ms   767.2 MB/sec    1.01      2.8±0.02ms   759.3 MB/sec
list_primitive_non_null/default                     1.02  1821.5±52.37µs  1167.9 MB/sec    1.00  1779.0±10.00µs  1195.8 MB/sec
list_primitive_non_null/parquet_2                   1.00  1952.7±15.65µs  1089.4 MB/sec    1.02  1991.7±13.59µs  1068.1 MB/sec
list_primitive_non_null/zstd                        1.00      3.9±0.04ms   551.7 MB/sec    1.01      3.9±0.04ms   546.9 MB/sec
list_primitive_non_null/zstd_parquet_2              1.00      4.0±0.04ms   535.5 MB/sec    1.00      4.0±0.03ms   533.7 MB/sec
primitive/bloom_filter                              1.07      4.6±0.15ms    38.4 MB/sec    1.00      4.3±0.10ms    41.0 MB/sec
primitive/default                                   1.02    779.4±5.96µs   225.7 MB/sec    1.00    763.6±3.48µs   230.4 MB/sec
primitive/parquet_2                                 1.00    783.2±6.13µs   224.6 MB/sec    1.05   824.1±74.58µs   213.5 MB/sec
primitive/zstd                                      1.03  1095.2±28.45µs   160.6 MB/sec    1.00   1066.5±7.37µs   165.0 MB/sec
primitive/zstd_parquet_2                            1.01  1034.4±11.93µs   170.1 MB/sec    1.00   1022.4±5.53µs   172.1 MB/sec
primitive_non_null/bloom_filter                     1.02  1686.9±50.03µs   102.3 MB/sec    1.00  1655.8±28.45µs   104.2 MB/sec
primitive_non_null/default                          1.00    605.1±3.00µs   285.1 MB/sec    1.02    615.8±4.76µs   280.1 MB/sec
primitive_non_null/parquet_2                        1.00    610.6±5.18µs   282.6 MB/sec    1.02    623.2±8.66µs   276.8 MB/sec
primitive_non_null/zstd                             1.00    890.6±6.14µs   193.7 MB/sec    1.00    893.6±7.56µs   193.0 MB/sec
primitive_non_null/zstd_parquet_2                   1.00   899.4±11.92µs   191.8 MB/sec    1.01   910.5±11.04µs   189.5 MB/sec
string/bloom_filter                                 1.00  1281.9±14.82µs  1597.8 MB/sec    1.07  1366.7±21.09µs  1498.5 MB/sec
string/default                                      1.00    786.9±8.45µs     2.5 GB/sec    1.08   847.8±11.76µs     2.4 GB/sec
string/parquet_2                                    1.00   793.0±18.98µs     2.5 GB/sec    1.07    848.5±7.29µs     2.4 GB/sec
string/zstd                                         1.00      2.2±0.01ms   927.4 MB/sec    1.03      2.3±0.01ms   901.9 MB/sec
string/zstd_parquet_2                               1.00      2.2±0.01ms   926.2 MB/sec    1.04      2.3±0.02ms   889.4 MB/sec
string_and_binary_view/bloom_filter                 1.02    618.2±7.43µs   204.1 MB/sec    1.00    605.2±5.11µs   208.5 MB/sec
string_and_binary_view/default                      1.02    372.5±3.12µs   338.8 MB/sec    1.00    365.6±2.24µs   345.2 MB/sec
string_and_binary_view/parquet_2                    1.02    377.0±8.25µs   334.7 MB/sec    1.00    369.3±9.11µs   341.7 MB/sec
string_and_binary_view/zstd                         1.01   607.2±11.42µs   207.8 MB/sec    1.00   599.9±16.80µs   210.3 MB/sec
string_and_binary_view/zstd_parquet_2               1.01    592.6±8.58µs   212.9 MB/sec    1.00   585.6±10.38µs   215.5 MB/sec
string_dictionary/bloom_filter                      1.00    614.6±6.98µs  1679.2 MB/sec    1.07   655.7±13.46µs  1574.0 MB/sec
string_dictionary/default                           1.00    393.1±4.88µs     2.6 GB/sec    1.07    420.3±5.95µs     2.4 GB/sec
string_dictionary/parquet_2                         1.00    391.9±5.04µs     2.6 GB/sec    1.07    420.6±7.04µs     2.4 GB/sec
string_dictionary/zstd                              1.00   1103.2±9.72µs   935.5 MB/sec    1.04  1142.1±27.19µs   903.6 MB/sec
string_dictionary/zstd_parquet_2                    1.00   1103.1±4.51µs   935.5 MB/sec    1.03  1136.8±40.31µs   907.8 MB/sec
string_dictionary_low_cardinality/bloom_filter      1.00    107.3±1.13µs   193.4 MB/sec  
string_dictionary_low_cardinality/default           1.00     63.9±0.17µs   325.1 MB/sec  
string_dictionary_low_cardinality/parquet_2         1.00     64.4±0.23µs   322.6 MB/sec  
string_dictionary_low_cardinality/zstd              1.00     75.5±0.68µs   275.0 MB/sec  
string_dictionary_low_cardinality/zstd_parquet_2    1.00     75.8±0.46µs   274.1 MB/sec  
string_non_null/bloom_filter                        1.00  1779.8±22.62µs  1150.2 MB/sec    1.02  1819.2±24.37µs  1125.3 MB/sec
string_non_null/default                             1.00  1130.6±11.20µs  1810.6 MB/sec    1.04  1180.3±21.30µs  1734.4 MB/sec
string_non_null/parquet_2                           1.00   1146.0±8.16µs  1786.3 MB/sec    1.03   1180.1±9.63µs  1734.7 MB/sec
string_non_null/zstd                                1.00      3.0±0.02ms   673.9 MB/sec    1.03      3.1±0.05ms   656.9 MB/sec
string_non_null/zstd_parquet_2                      1.00      3.1±0.05ms   664.1 MB/sec    1.02      3.1±0.08ms   650.7 MB/sec

@Dandandan
Copy link
Contributor

Ah @asuresh8 could you add the new benchmarks in a separate PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants