Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip Decode, Skip I/O#186
Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip Decode, Skip I/O#186zhuqi-lucas wants to merge 15 commits into
Conversation
A walkthrough of the sort pushdown work landed and in flight on Apache DataFusion. Covers: - Why SortExec is expensive and what `Exact` / `Inexact` ordering mean at runtime (static `fetch` vs `TopK` dynamic filter). - Phase 1 (#19064): the `PushdownSort` rule + reverse row-group case. - Phase 2 (#21182): statistics-based file sort that upgrades `Unsupported` to `Exact`, eliminating the `SortExec` on non-overlapping ASC scans. Includes the BufferExec compensation (#21426) so the SPM above doesn't lose its implicit memory buffer. - Reverse scans: today's row-group reverse (Inexact, #18817) and the community decision to wait for arrow-rs page-level reverse (#9937) before pursuing Exact reverse, after memory-profile pushback on the original row-group-level proposal. - Benchmarks: 2.1×-49× on the ASC-LIMIT sort_pushdown suite. - What's next: the dynamic / TopK-driven path (#21351 merged, #21733, #21712, #21956, #21580) including the precise RG-pruning vs mid-stream-early-return distinction, and the EnsureRequirements unification (#21976). - Links into the prior dynamic filters and limit pruning posts so the series reads as a coherent thread.
The post previously only mentioned #21956 in passing. The PR is landing the full mechanism — `try_pushdown_sort` decision tree, two flags on `ParquetSource`, three composable runtime steps (file reorder + RG reorder + reverse), and a `sort_prefix`- preserving short-circuit — so cover it as a dedicated phase between Phase 2 and the existing reverse-scan section. - TL;DR: add a Phase 3 bullet alongside Phase 1 and Phase 2. - Phase 1: replace the in-flight `#21956` aside with a cross-link to the new section. - Phase 2: keep the caveat about function-wrapped sorts but note that #21956's `Inexact` path now covers them via monotonicity inference. - New `## Phase 3` section with two SVG diagrams: a decision tree for the three protocol outcomes, and a three-step pipeline for the `Inexact` runtime. Covers the two-flag design, the nested file/RG layers, when EXPLAIN surfaces each flag, and four scenarios where Phase 3 does not fire (aggregations, multi- column secondary keys, function-wrapped sorts without a declared ordering, source declares a forward prefix of the request). - "What's Next": rename the old "Phase 3 — filter + sort" bullet to "Filtered reverse TopK end-to-end" so the label doesn't clash with the new section, and add a follow-up bullet referencing #22198 for multi-column / function-wrapped reorder.
Add an ### Empirical note subsection inside "Reverse Scans for `ORDER BY ... DESC`" that records what we found running an in-house RG-level `Exact` reverse against upstream `Inexact` + `TopK`: - `LIMIT N` does not propagate as a static stop signal in the Inexact path. The dynamic filter pushdown can stats-prune *subsequent* row groups once the threshold tightens, but inside the row group `TopK` is currently reading the sort column has to be fully decoded so the filter can be evaluated row by row. `LIMIT 10` on a 1M-row row group is still ~1M sort-column decodes regardless of N. LIMIT only saves work on non-sort columns inside that row group and on whole subsequent row groups the threshold prunes. - `SortExec` stays on top of `Inexact`, so the final ordering pass and per-row heap maintenance are both extra costs the `Exact` path (which deletes `SortExec`) does not pay. Then explain why we run RG-level `Exact` in production but did not upstream it: parquet does not allow partial row-group reads, so any RG-level `Exact` implementation peaks at one whole row group (~128 MB) of decoded data in memory — the same constraint that closed `#18817`. Our runtime advantage comes from skipping heap / filter / `SortExec` overhead, not from decoding less. Frame the page-level `Exact` reverse work in arrow-rs `#9937` as the path that keeps the runtime win we measured while bringing peak memory back into the streaming regime via `OffsetIndex` page-level seek.
…tion
Two corrections in the empirical-note / reverse-scans section:
1. The internal RG-level `Exact` reverse path was incorrectly
described as "walks pages from the back, decodes each batch,
reverses row-wise, stops the moment K rows have been delivered."
That is actually the page-level `Exact` shape (arrow-rs #9937),
not the in-house RG-level implementation. Parquet does not allow
partial row-group reads, so the in-house path has to decode the
entire target row group, reverse the buffer in memory, take the
first K rows, and stop — same memory profile as #18817's
proposal. The runtime advantage over `Inexact` + `TopK` comes
from removing the per-row heap maintenance, dynamic-filter
evaluation, and `SortExec` final ordering pass, not from
decoding less sort-column data. Sort col decode on the target
row group is the same on both paths.
2. The arrow-rs #9937 paragraph I previously added duplicated the
technical detail already present in the long-standing
"That primitive is the page-level reverse traversal..."
paragraph. Replaced with a one-sentence bridge ("The fix that
keeps both the runtime win and a streaming memory profile is
page-level `Exact` reverse via arrow-rs #9937, described next.")
so the existing paragraph carries the explanation without
repetition.
All three sort-pushdown PRs have now landed, so the chronological 'Phase 1/2/3' framing is less useful for readers than a capability breakdown. Sections are now: - The PushdownSort Rule (#19064) - Sort Elimination via Statistics (#21182) - Runtime Reorder for TopK Convergence (#21956) - Reverse Scans for ORDER BY ... DESC (unchanged) In-body Phase references replaced with PR numbers or capability names; anchor links updated; references section restructured.
…ction' The previous wording 'nest by construction' could be read as a code- enforced property. It's actually a logical consequence of using the same sort key (min) at both file and row-group level: a file's min(col) is just the minimum over its row groups' min(col) values, so the most-promising file contains the most-promising row group. The rewritten paragraph spells that out and ties it to why TopK's dynamic filter tightens fast.
Match the dynamic-filter blog's style: narrative talks about capabilities/mechanisms, not 'PR #21956 did X / PR #19064 introduced Y'. The 81 inline PR/issue references in the body were dropping the reader out of the narrative; they belong in a single Issues-and-PRs list at the end. Changes: - TL;DR: drop 6 inline PR refs from the 4 capability bullets - Body sections (PushdownSort Rule, Sort Elimination, Runtime Reorder, Reverse Scans, Empirical Note): drop ~30 inline refs to historical PRs; replace with capability names or 'the rule' / 'the runtime reorder path' style descriptions - What's Next: switch from [#NNNNN] format to named markdown links (matching dynamic-filter's Future Work style) - Issues for new contributors: same conversion - References section: rewrite using full URLs (no link-ref indirection); split into 'Landed' vs 'In flight / open' for clarity Net: ~90 lines removed, all PR/issue numbers now consolidated at the bottom of the post.
The previous draft mixed merged work, in-flight work, and runtime-cost analysis into a single 'Reverse Scans' section and a sprawling 'What's Next' section. Reorganize so the post answers three clear questions in sequence: 1. What's merged today? (Sort Elimination via Statistics + benchmark, Runtime Reorder for TopK Convergence, Reverse Scans for DESC) — unchanged content, just kept tight. 2. Where do those merged features still leave performance on the table? New 'Current Bottlenecks' section with three explicitly numbered bottlenecks: SortExec stays / sort column fully decoded inside open RG / file-granular scheduling can't close the tap mid-file. Pulls in the runtime-cost content that used to be buried in an 'Empirical note' subsection. 3. How does each next-step optimization remove a specific bottleneck? New 'Roadmap' section maps page-level Exact reverse to bottlenecks 1+2, row-group-level dynamic early termination to bottleneck 3, and shows the in-flight 17x-60x pipeline benchmark as a preview of what stacking these mechanisms can deliver. Smaller follow-ups (EnsureRequirements, OFFSET pushdown, multi-column reorder) collected at the end of the roadmap section as a short 'Other follow-ups' bullet list.
- 'Why not full Exact reverse' paragraph: cut reviewer attribution and forward-pointers that were already in the bottleneck/roadmap sections that follow. - TL;DR: trim Runtime Reorder + Reverse Scans bullets to capability and impact; drop implementation mechanics like 'stamps two flags' and 'three-step pipeline'. - 'The PushdownSort Rule' section: cut three paragraphs of 'covered in X below' forward-references that were repeating the section TOC. - Function-wrapped parenthetical in Sort Elimination: 4 lines to 2. - Single-partition vs multi-partition edge case: drop the trailing 'which is why the example is drawn that way' tangent. - 'What this change does not affect' note: trimmed redundant prose. - Remove all references to the limit-pruning blog (intro mention, link definition, References section bullet) — that work is about static LIMIT as an I/O optimization, separate problem from sort ordering.
…mmetry explicit Per code in datasource-parquet/src/source.rs:849-870, the reversed- satisfies branch is 'strictly more powerful than the column-in-schema check' — it runs the request through EquivalenceProperties's full reasoning machinery and handles function monotonicity, constants from filters, equivalence relationships, and multi-column composite orderings. The blog had been treating reverse as just step 3 of the runtime pipeline, which undersold its standalone reach. Structural changes: - Drop the standalone 'Reverse Scans for ORDER BY DESC' H2 section; reverse is now a case of the Inexact runtime reorder path. - Rename Runtime Reorder section to 'Runtime Reorder for TopK and DESC Queries'; intro now lists three classes that fall outside Exact (unsorted, overlapping, DESC). - 'try_pushdown_sort' subsection rewritten as 'Two independent triggers for Inexact', describing column-in-schema vs reversed- satisfies as separate signals with the latter being strictly more powerful. - 'Three runtime steps' subsection: step 3 now explicitly notes when steps 1-2 are skipped and only the iteration reverse runs. - New 'ORDER BY DESC in practice' subsection right after the 3-step pipeline, explaining the RGs-descending-x-rows-ascending stream. - Move reverse-scan.svg from the deleted Reverse Scans section into the Roadmap > Page-level Exact reverse subsection where it illustrates the 128 MB vs 1 MB peak comparison directly. Accuracy fix: - 'Multi-column reorder follow-ups' bullet was inaccurate — said the reorder 'only fires on plain columns'. The reverse path does handle function-wrapped and multi-column cases via EquivalenceProperties; only the stats reorder step is restricted. Updated wording to scope the limitation correctly.
…shdown EnsureRequirements (#21976) is a rule-unification effort for EnforceDistribution+EnforceSorting. Touches the same area but isn't a sort pushdown optimization. OFFSET pushdown (#21828) is about LIMIT/OFFSET pruning. Same kind of tangent as the limit-pruning blog reference removed earlier — it's LIMIT optimization, not sort pushdown. The remaining 'Multi-column and function-wrapped reorder follow-ups' bullet is actually directly about sort pushdown's reorder step (#22198), so it stays. With the other two removed, 'Other follow-ups' collapsed to a single point — promoted to its own subsection 'Extending the stats reorder step' for clarity. Also dropped the corresponding entries from the References section.
… (k-way merge) The 'switch work queue from PartitionedFile to RG descriptors' fix is sufficient for non-overlapping ranges (post-reorder), where the first file globally has the smallest values and subsequent files are already stats-pruned. For overlapping ranges, the next smallest value could sit in any of several open files — matching the non-overlap efficiency requires explicit k-way merge across open files' next-RG mins. The dynamic filter does this implicitly (RGs with max < threshold are dropped), but explicit comparison closes the tap earlier when the filter tightens slowly.
970b6d6 to
03d1e79
Compare
Restructure the post so the merged Runtime Row-Group Dynamic Pruning (#22450) is the centerpiece, matching the community-talk narrative: - TL;DR now leads with the three-layer pruning stack (file + RG + row) and the topk_tpch headline (5/11 queries 3-4x, 0 regressions, -44% total). - Exact path keeps credit attribution explicit: EnforceSorting already handled the simple on-disk-sorted case; Phase 2 closes the wrong-on-disk-order gap. BufferExec story expanded with the SPM stall diagnosis. - Inexact path split into Tier 1 (file-level, already had early stop) vs Tier 2 (RG-level, no early stop until #22450) so the gap that #22450 closes is named explicitly. - New section: #22450 mechanics - architecture-3-eras, transition() drain/decide/drive, RowGroupPruner watch/rebuild/prune, cascading prune walkthrough. - New section: Three-layer pruning stack with the same DynamicFilter driving all three layers, plus the topk_tpch benchmark table. - Old "Current Bottlenecks" + "Roadmap" merged into "Future Directions": A) page-level Exact reverse (arrow-rs #9937), B) page-level dynamic prune at RG boundary (#23216). - Updated landed-PRs list to include #20839 (push-based decoder, the prerequisite for #22450) and #22450 itself. Pulls in 9 PNGs from the community-call deck for the #22450-specific diagrams (architecture, transition, pruner, cascade, three-layer stack, topk_tpch results, page-level future).
03d1e79 to
fb489ff
Compare
There was a problem hiding this comment.
Pull request overview
Adds a new long-form blog post documenting Apache DataFusion’s multi-release “sort pushdown” work (v52→v55+), centered on runtime row-group dynamic pruning (#22450), and introduces a set of new diagrams (SVGs + referenced PNGs) to support the narrative.
Changes:
- Added a new blog post:
Sort Pushdown in DataFusion: Skip Sorts, Skip Decode, Skip I/O(scheduled-dated 2026-07-05). - Added multiple new SVG diagrams under
content/images/sort-pushdown/illustrating sort pushdown phases, decision flow, runtime pipeline, buffering, and benchmarks. - Wired the post to those assets via
/blog/images/sort-pushdown/...image references.
Reviewed changes
Copilot reviewed 1 out of 19 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| content/blog/2026-07-05-sort-pushdown.md | New blog post covering sort pushdown and runtime pruning, referencing the accompanying diagrams/assets. |
| content/images/sort-pushdown/reverse-scan.svg | New diagram contrasting row-group reverse vs page-level reverse scan. |
| content/images/sort-pushdown/pr21956-runtime-pipeline.svg | New diagram of the runtime reorder pipeline driven by ParquetSource flags. |
| content/images/sort-pushdown/pr21956-decision.svg | New decision-tree diagram for try_pushdown_sort outcomes. |
| content/images/sort-pushdown/plan-diff.svg | New before/after EXPLAIN diagram showing sort elimination. |
| content/images/sort-pushdown/phase2-stats-overlap.svg | New diagram explaining non-overlap vs overlap using min/max stats. |
| content/images/sort-pushdown/phase1-file-reorder.svg | New diagram showing file reorder by min/max stats to enable sort elimination. |
| content/images/sort-pushdown/buffer-exec.svg | New diagram explaining BufferExec as a bounded per-partition buffer. |
| content/images/sort-pushdown/buffer-exec-stall.svg | New diagram illustrating SPM stalls without buffering vs behavior with BufferExec. |
| content/images/sort-pushdown/benchmark.svg | New benchmark bar-chart diagram for the sort_pushdown suite results. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@zhuqi-lucas is this blog ready for review? |
|
sort-pushdown-community-call.pdf
Thanks @alamb , it should be ready, and mostly i copied content from my recent China Meetup slide above. |
|
Starting to check this one out |
alamb
left a comment
There was a problem hiding this comment.
Thanks for this blog @zhuqi-lucas - I got about half way thought it in detail and I think it could be published as is so I am approving it.
However, I think we could potentially make it a much stronger narrative so I spent quite a while providing a bunch of stylistic feedback for your review
Suggestion: Make the writing more concise
There are several places in this blog (I highlighted some of them) where I think you can get across the same idea with about 1/3 of the text -- this will make the blog much easier to read and understand in my opinion.
LLMs are pretty good at helping reduce the content length if you give them that as an explicit goal and help identify unnecessary content (I left several inline suggestions)
Suggestion: reframe as a general Database technology
My biggest suggestion is to change how the blog presents the information (the information is great).
I think this post is currently structured as a "what did we do" narrative arc and present the way the work unfolded over time. However, I think that may not be as interesting to people as a post that teaches them some general principles that may be applicable to their own systems (using our work on DataFusion as an example)
So concretely that would mean something like
- Explain the problem (data is often partly sorted and the query engine doesn't know about the existing sortedness )
- Explain techniques to take advantage of pre-existing sort orders
- present results about how well it works
Suggestion: any time you refer to the internals of DataFUsion, link to the docs
I think most readers will not know what a SortPreservingMergeExec is. So I recommend either
- refer to this in more general terms ("sorted merge opertator")
- link to the docs.rs or source code so it is clear what you are talking about
Some links not working
for some reason links with ``` are not working. I think this is due to the asf pelican rendering system
| [Apache Parquet]: https://parquet.apache.org/ | ||
|
|
||
| [Apache DataFusion] has always been able to skip the sort in the | ||
| *trivial* case: when the catalog declares an ordering (`WITH ORDER` |
There was a problem hiding this comment.
I am not sure I would call the code for the existing sort orderings "trivial" 😆 -- I would probably use the term "exact" -- so like "been able to skip the sort in the exact case"
Maybe you could also link to this blog from @akurmustafa that explains what the existing function is : https://datafusion.apache.org/blog/2025/03/11/ordering-analysis/
There was a problem hiding this comment.
I also recommend linking to the docs about WITH ORDER as it isn't SQL standard (it is a datafusion specific thing)
| or parquet `sorting_columns`) **and** the on-disk file listing | ||
| already matches that order, the `EnsureRequirements` rule sees that | ||
| the scan's `output_ordering` already satisfies the request and |
There was a problem hiding this comment.
This is kind of a lot of detail -- I know what it means but many readers might not -- it might help to write this in non-datafusion technical terms. So instead of "when the catalog declares ..." something more like
"when the data on disk is perfectly sorted and table definition includes the ordering, DataFusion can avoid redundant sorts, as explained in the blog ...."
| or parquet `sorting_columns`) **and** the on-disk file listing | ||
| already matches that order, the `EnsureRequirements` rule sees that | ||
| the scan's `output_ordering` already satisfies the request and | ||
| **removes the redundant `SortExec`**. The hard cases were |
There was a problem hiding this comment.
And then in this next sentence, rather than calling it "hard" maybe call it "another" case and describe in terms of user visible behavior: "This blog is about taking advantage of sorting even when the files may not be perfectly sorted and DataFusion is not told about any ordering up front"
|
|
||
| *Qi Zhu, [Massive](https://www.massive.com/)* | ||
|
|
||
| Many [Apache Parquet] datasets are already sorted on disk. Time-series |
There was a problem hiding this comment.
I suggest you start this blog with a 2-3 sentence abstract / summary of its contents, so that a potential reader can quickly decide if they want to read it or not
Maybe something like
"DataFusion now automatically takes advantage of sortedess in the data, even when the data is only partially sorted and DataFusion does not know the sort order ahead of time. This blog explains why this is an important usecase and how Datafusion achieves this, through a combination of pushing down sort information, and adaptive runtime reordering and pruning"
| *Qi Zhu, [Massive](https://www.massive.com/)* | ||
|
|
||
| Many [Apache Parquet] datasets are already sorted on disk. Time-series | ||
| files are usually written in ingestion-time order. Event logs are sharded |
There was a problem hiding this comment.
Here are a few other motivating observations that might be worth working in
- Data often has a partial sort order that is either not entirely sorted (e.g. ingestion order vs data order) or else the database doesn't know about the existing ordering
- Modern data lakes based on Apache Icerberg and others often have to work with data as it is / was written and don't have the luxury of making a second copy, or resorting -- so being able to automatically exploit existing sortedness is important
| [#19064]: https://github.com/apache/datafusion/pull/19064 | ||
| [#21182]: https://github.com/apache/datafusion/pull/21182 | ||
|
|
||
| * Three files: `a.parquet`, `b.parquet`, `c.parquet`. |
| **strips the ordering entirely**. From the optimizer's point of view | ||
| the scan now has no declared ordering. Physical planning translates | ||
| the user's `ORDER BY` into a `SortExec`, and the earlier pipeline | ||
| rule [`EnsureRequirements`] (which consolidates the old |
| in the **wrong order** on disk (e.g. alphabetical by name, not by | ||
| time). | ||
|
|
||
| `validated_output_ordering()` looks at this, sees that the |
There was a problem hiding this comment.
I think this is a lot of implementation detail that will not mean much to anyone else -- just saying "previously DataFusion could not recognize the sort order, but now it can" is probably enough, with a link to the code for anyone who wants more details
|
|
||
| [`EnsureRequirements`]: https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/ensure_requirements/mod.rs | ||
|
|
||
| Stats-based sort elimination fixes this in `PushdownSort`, which |
There was a problem hiding this comment.
This is also full of implementation details -- I suggest you focus on just explaining the conditions under which the transformation can be done (non overlapping min/max boundaries)
| sorts via monotonicity inference — but stats-based sort elimination | ||
| still needs a plain column.) | ||
|
|
||
| <img src="/blog/images/sort-pushdown/phase2-stats-overlap.svg" alt="Detecting non-overlapping ranges via min/max statistics" width="100%" class="img-fluid"/> |


Summary
Blog post on the multi-release sort pushdown effort in Apache DataFusion (v52 → v55+), restructured around the newly-merged apache/datafusion#22450 as the centerpiece.
The post now covers three landed capabilities plus two future directions, all built on top of the Dynamic Filters primitive:
Exactpath —PushdownSortdeletesSortExecwhen statistics prove the scan is already ordered.BufferExeckeeps the multi-partition path from regressing on SPM stalls.Inexactpath — runtime reorder forTopKandDESCqueries. File-level early stop already worked; row-group-level early stop was the gap.TopKthreshold drives a freshPruningPredicate, and matching row groups are physically removed from the openRecordBatchstream viainto_builder() → with_row_groups() → build(). Zero I/O, zero decompress, zero decode, not even the filter column.Together these compose into a three-layer pruning stack (file + row-group + row), all driven by the same
TopKdynamic filter.Benchmarks documented
sort_pushdown(Exact path)ORDER BY ... LIMITruns 27× / 49× faster; fullORDER BYruns ~2× fastertopk_tpch(#22450, TPC-H SF1,LIMIT 100)What changed since the original draft
transition()· drain · decide · driveRowGroupPruner· watch · rebuild · prunetopk_tpchbenchmarkExactPath with honest credit: the trivial case (declared ordering + matching on-disk file list) has always worked viaEnsureRequirements; Phase 2 (#21182) is what closes the wrong-on-disk-order gap.Bottlenecks + Roadmapframing with a tight Future Directions section: A) page-level Exact reverse (gated on arrow-rs#9937); B) page-level dynamic prune at RG boundary (#23216, pure DataFusion — no new arrow-rs API needed).apache/datafusionandapache/arrow-rssources. Dropped names that no longer exist (e.g.EnforceSorting→EnsureRequirements,PagePruningPredicate→PagePruningAccessPlanFilter).transition()LOC framing).2026-07-05-sort-pushdown.mdto better align with publish timing.Test plan
apache/datafusionandapache/arrow-rssource.🤖 Generated with Claude Code