perf: sort multi-column runs via row format and coalesce them#23221
perf: sort multi-column runs via row format and coalesce them#23221Dandandan wants to merge 1 commit into
Conversation
Follow-up to apache#23202. For multi-column runs whose leading key is an expensive (string/dict/binary) comparison, sort via the Arrow row format (encode once + memcmp argsort) instead of `lexsort_to_indices`, and coalesce all runs into larger ones (not just single-column). Removes the lexicographic-comparator cost that made coalescing multi-column runs regress, so 1.2-1.8x faster on multi-column SortExec benchmarks. Single-column and primitive-leading keys keep `lexsort`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
run benchmark sort_tpch sort_tpch10 tpcds tpch10 |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/sort-rowformat-multicol-runs (1b99d3e) to 3bb9314 (merge-base) diff using: sort_tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/sort-rowformat-multicol-runs (1b99d3e) to 3bb9314 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/sort-rowformat-multicol-runs (1b99d3e) to 3bb9314 (merge-base) diff using: tpch10 File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/sort-rowformat-multicol-runs (1b99d3e) to 3bb9314 (merge-base) diff using: sort_tpch10 File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagesort_tpch — base (merge-base)
sort_tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch10 — base (merge-base)
tpch10 — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagesort_tpch10 — base (merge-base)
sort_tpch10 — branch
File an issue against this benchmark runner |
Which issue does this PR close?
Follow-up to #23202 (coalesce single-column sort runs). No specific issue.
Rationale for this change
#23202 coalesces buffered runs into fewer, larger runs to cut merge fan-in, but
only for single-column sorts. Multi-column runs were deliberately left one
run per batch, because coalescing them and sorting the larger run with
lexsort_to_indicesregresses:lexsort's comparator walks the key columnsone at a time with per-comparison dynamic dispatch (and heap touches for
strings/dictionaries), and that cost grows super-linearly with run size.
The fix is to change how a multi-column run is sorted, not just its size: for
keys whose leading column is an expensive (variable-length or dictionary)
comparison, encode the key once into the Arrow row format and argsort the
row indices with a cheap
memcmp. This is the same comparison the streamingmerge already uses (
RowCursorStream), so it is a natural fit — and it removesexactly the comparator cost that made coalescing regress. With that in place we
can coalesce all multi-column runs.
A microbenchmark (
sort_indices, added here) isolates the index-computationcost —
lexsort_to_indicesvs. row-encode+argsort — by key shape and run size(speedup = lexsort / rowsort; >1 means row format is faster):
This is what motivates the gate: row format wins big for string/dict-leading
keys (and more as runs grow), but loses for single-column keys and for
primitive-leading keys (where
lexsortshort-circuits on the cheap firstcolumn). So the row-format path is gated to multi-column keys with an expensive
leading column; everything else keeps
lexsort.End-to-end (
cargo bench --bench sort, theSortExeccases, vs. current main):String/dict-leading tuples win via coalesce + row-format sort; the
primitive-leading
mixed tuplewins ~1.4× purely from coalesce +lexsortonlarger runs (it now also gets coalesced). Single-column sorts are unchanged.
What changes are included in this PR?
sorted_indices()insorts/stream.rs: computes the sorted index permutation,choosing the Arrow row format for multi-column keys with an expensive leading
column (
use_row_format_sort) andlexsort_to_indicesotherwise (singlecolumn, primitive-leading, or
fetch/top-k).IncrementalSortIterator(the single point every in-memory sort run flowsthrough) now calls
sorted_indices(). No merge/cursor changes.in_mem_sort_streamnow coalesces all runs, not just single-column ones.datafusion/core/benches/sort_indices.rs.Are these changes tested?
Existing
sorts::unit tests pass (NULLs, sort options, multi-column merge,spill). Correctness of the row-format ordering is covered by the existing
multi-column sort tests; the new microbenchmark documents the perf rationale.
Are there any user-facing changes?
No (internal performance change only).