Skip to content

Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip Decode, Skip I/O#186

Open
zhuqi-lucas wants to merge 15 commits into
apache:mainfrom
zhuqi-lucas:blog-sort-pushdown
Open

Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip Decode, Skip I/O#186
zhuqi-lucas wants to merge 15 commits into
apache:mainfrom
zhuqi-lucas:blog-sort-pushdown

Conversation

@zhuqi-lucas

@zhuqi-lucas zhuqi-lucas commented May 14, 2026

Copy link
Copy Markdown
Contributor

Summary

Blog post on the multi-release sort pushdown effort in Apache DataFusion (v52 → v55+), restructured around the newly-merged apache/datafusion#22450 as the centerpiece.

The post now covers three landed capabilities plus two future directions, all built on top of the Dynamic Filters primitive:

  1. Exact pathPushdownSort deletes SortExec when statistics prove the scan is already ordered. BufferExec keeps the multi-partition path from regressing on SPM stalls.
  2. Inexact path — runtime reorder for TopK and DESC queries. File-level early stop already worked; row-group-level early stop was the gap.
  3. Runtime row-group dynamic pruning (#22450) — the new piece. At every row-group boundary inside an open file, the live TopK threshold drives a fresh PruningPredicate, and matching row groups are physically removed from the open RecordBatch stream via into_builder() → with_row_groups() → build(). Zero I/O, zero decompress, zero decode, not even the filter column.

Together these compose into a three-layer pruning stack (file + row-group + row), all driven by the same TopK dynamic filter.

Benchmarks documented

Suite Headline
sort_pushdown (Exact path) ORDER BY ... LIMIT runs 27× / 49× faster; full ORDER BY runs ~2× faster
topk_tpch (#22450, TPC-H SF1, LIMIT 100) 5 of 11 queries get 3–4× faster, 0 regressions, total runtime drops −44% (248.8 ms → 139.1 ms)

What changed since the original draft

  • Restructured around #22450 (just merged) as the post's centerpiece, with new sections on:
    • Architecture · three eras of who drives the I/O + decode loop (pre-#20839 / #20839 / #22450)
    • transition() · drain · decide · drive
    • RowGroupPruner · watch · rebuild · prune
    • Cascading prune · how the heap eats row groups
    • Three-layer pruning stack (file + RG + row, stacked)
    • topk_tpch benchmark
  • Reframed the Exact Path with honest credit: the trivial case (declared ordering + matching on-disk file list) has always worked via EnsureRequirements; Phase 2 (#21182) is what closes the wrong-on-disk-order gap.
  • Replaced the old Bottlenecks + Roadmap framing with a tight Future Directions section: A) page-level Exact reverse (gated on arrow-rs#9937); B) page-level dynamic prune at RG boundary (#23216, pure DataFusion — no new arrow-rs API needed).
  • Added explicit dependency acknowledgement: none of the runtime parts of this post are possible without TopK's dynamic filter pushdown — the prior Dynamic Filters post is recommended background.
  • Added a reference to the umbrella EPIC issue #23036 for phase-by-phase tracking of all the related PRs and follow-ups.
  • Cross-checked every code identifier (type names, function names, file paths) against the actual apache/datafusion and apache/arrow-rs sources. Dropped names that no longer exist (e.g. EnforceSortingEnsureRequirements, PagePruningPredicatePagePruningAccessPlanFilter).
  • Pulled 9 PNGs from the community-call deck for the #22450-specific diagrams; corrected two whose text overlays no longer matched the body (cascade threshold wording; transition() LOC framing).
  • Date filename moved to 2026-07-05-sort-pushdown.md to better align with publish timing.

Test plan

  • Rendered locally with the Pelican Docker image — all 18 images render, the TOC builds, all internal links resolve.
  • All PR / issue / blog-post links checked against current state (merged vs in-flight); closed PRs (#18817, #21712) and closed issues (#17348) removed from "in flight / open".
  • All type/function names grep-verified against current apache/datafusion and apache/arrow-rs source.
  • PR moves to ready-for-review when reviewers are ready to look.

🤖 Generated with Claude Code

A walkthrough of the sort pushdown work landed and in flight on Apache
DataFusion. Covers:

- Why SortExec is expensive and what `Exact` / `Inexact` ordering mean
  at runtime (static `fetch` vs `TopK` dynamic filter).
- Phase 1 (#19064): the `PushdownSort` rule + reverse row-group case.
- Phase 2 (#21182): statistics-based file sort that upgrades
  `Unsupported` to `Exact`, eliminating the `SortExec` on
  non-overlapping ASC scans. Includes the BufferExec compensation
  (#21426) so the SPM above doesn't lose its implicit memory buffer.
- Reverse scans: today's row-group reverse (Inexact, #18817) and the
  community decision to wait for arrow-rs page-level reverse (#9937)
  before pursuing Exact reverse, after memory-profile pushback on the
  original row-group-level proposal.
- Benchmarks: 2.1×-49× on the ASC-LIMIT sort_pushdown suite.
- What's next: the dynamic / TopK-driven path (#21351 merged, #21733,
  #21712, #21956, #21580) including the precise RG-pruning vs
  mid-stream-early-return distinction, and the EnsureRequirements
  unification (#21976).
- Links into the prior dynamic filters and limit pruning posts so the
  series reads as a coherent thread.
The post previously only mentioned #21956 in passing. The PR is
landing the full mechanism — `try_pushdown_sort` decision tree,
two flags on `ParquetSource`, three composable runtime steps
(file reorder + RG reorder + reverse), and a `sort_prefix`-
preserving short-circuit — so cover it as a dedicated phase
between Phase 2 and the existing reverse-scan section.

- TL;DR: add a Phase 3 bullet alongside Phase 1 and Phase 2.
- Phase 1: replace the in-flight `#21956` aside with a cross-link
  to the new section.
- Phase 2: keep the caveat about function-wrapped sorts but note
  that #21956's `Inexact` path now covers them via monotonicity
  inference.
- New `## Phase 3` section with two SVG diagrams: a decision tree
  for the three protocol outcomes, and a three-step pipeline for
  the `Inexact` runtime. Covers the two-flag design, the nested
  file/RG layers, when EXPLAIN surfaces each flag, and four
  scenarios where Phase 3 does not fire (aggregations, multi-
  column secondary keys, function-wrapped sorts without a
  declared ordering, source declares a forward prefix of the
  request).
- "What's Next": rename the old "Phase 3 — filter + sort" bullet
  to "Filtered reverse TopK end-to-end" so the label doesn't
  clash with the new section, and add a follow-up bullet
  referencing #22198 for multi-column / function-wrapped reorder.
Add an ### Empirical note subsection inside "Reverse Scans for
`ORDER BY ... DESC`" that records what we found running an in-house
RG-level `Exact` reverse against upstream `Inexact` + `TopK`:

- `LIMIT N` does not propagate as a static stop signal in the
  Inexact path. The dynamic filter pushdown can stats-prune
  *subsequent* row groups once the threshold tightens, but inside
  the row group `TopK` is currently reading the sort column has to
  be fully decoded so the filter can be evaluated row by row.
  `LIMIT 10` on a 1M-row row group is still ~1M sort-column
  decodes regardless of N. LIMIT only saves work on non-sort
  columns inside that row group and on whole subsequent row
  groups the threshold prunes.
- `SortExec` stays on top of `Inexact`, so the final ordering
  pass and per-row heap maintenance are both extra costs the
  `Exact` path (which deletes `SortExec`) does not pay.

Then explain why we run RG-level `Exact` in production but did
not upstream it: parquet does not allow partial row-group reads,
so any RG-level `Exact` implementation peaks at one whole row
group (~128 MB) of decoded data in memory — the same constraint
that closed `#18817`. Our runtime advantage comes from skipping
heap / filter / `SortExec` overhead, not from decoding less.

Frame the page-level `Exact` reverse work in arrow-rs `#9937` as
the path that keeps the runtime win we measured while bringing
peak memory back into the streaming regime via `OffsetIndex`
page-level seek.
…tion

Two corrections in the empirical-note / reverse-scans section:

1. The internal RG-level `Exact` reverse path was incorrectly
   described as "walks pages from the back, decodes each batch,
   reverses row-wise, stops the moment K rows have been delivered."
   That is actually the page-level `Exact` shape (arrow-rs #9937),
   not the in-house RG-level implementation. Parquet does not allow
   partial row-group reads, so the in-house path has to decode the
   entire target row group, reverse the buffer in memory, take the
   first K rows, and stop — same memory profile as #18817's
   proposal. The runtime advantage over `Inexact` + `TopK` comes
   from removing the per-row heap maintenance, dynamic-filter
   evaluation, and `SortExec` final ordering pass, not from
   decoding less sort-column data. Sort col decode on the target
   row group is the same on both paths.

2. The arrow-rs #9937 paragraph I previously added duplicated the
   technical detail already present in the long-standing
   "That primitive is the page-level reverse traversal..."
   paragraph. Replaced with a one-sentence bridge ("The fix that
   keeps both the runtime win and a streaming memory profile is
   page-level `Exact` reverse via arrow-rs #9937, described next.")
   so the existing paragraph carries the explanation without
   repetition.
All three sort-pushdown PRs have now landed, so the chronological
'Phase 1/2/3' framing is less useful for readers than a capability
breakdown. Sections are now:
- The PushdownSort Rule (#19064)
- Sort Elimination via Statistics (#21182)
- Runtime Reorder for TopK Convergence (#21956)
- Reverse Scans for ORDER BY ... DESC (unchanged)

In-body Phase references replaced with PR numbers or capability names;
anchor links updated; references section restructured.
…ction'

The previous wording 'nest by construction' could be read as a code-
enforced property. It's actually a logical consequence of using the
same sort key (min) at both file and row-group level: a file's
min(col) is just the minimum over its row groups' min(col) values, so
the most-promising file contains the most-promising row group. The
rewritten paragraph spells that out and ties it to why TopK's dynamic
filter tightens fast.
Match the dynamic-filter blog's style: narrative talks about
capabilities/mechanisms, not 'PR #21956 did X / PR #19064 introduced
Y'. The 81 inline PR/issue references in the body were dropping the
reader out of the narrative; they belong in a single Issues-and-PRs
list at the end.

Changes:
- TL;DR: drop 6 inline PR refs from the 4 capability bullets
- Body sections (PushdownSort Rule, Sort Elimination, Runtime Reorder,
  Reverse Scans, Empirical Note): drop ~30 inline refs to historical
  PRs; replace with capability names or 'the rule' / 'the runtime
  reorder path' style descriptions
- What's Next: switch from [#NNNNN] format to named markdown links
  (matching dynamic-filter's Future Work style)
- Issues for new contributors: same conversion
- References section: rewrite using full URLs (no link-ref
  indirection); split into 'Landed' vs 'In flight / open' for
  clarity

Net: ~90 lines removed, all PR/issue numbers now consolidated at
the bottom of the post.
The previous draft mixed merged work, in-flight work, and runtime-cost
analysis into a single 'Reverse Scans' section and a sprawling 'What's
Next' section. Reorganize so the post answers three clear questions in
sequence:

1. What's merged today? (Sort Elimination via Statistics + benchmark,
   Runtime Reorder for TopK Convergence, Reverse Scans for DESC) —
   unchanged content, just kept tight.
2. Where do those merged features still leave performance on the
   table? New 'Current Bottlenecks' section with three explicitly
   numbered bottlenecks: SortExec stays / sort column fully decoded
   inside open RG / file-granular scheduling can't close the tap
   mid-file. Pulls in the runtime-cost content that used to be buried
   in an 'Empirical note' subsection.
3. How does each next-step optimization remove a specific bottleneck?
   New 'Roadmap' section maps page-level Exact reverse to bottlenecks
   1+2, row-group-level dynamic early termination to bottleneck 3, and
   shows the in-flight 17x-60x pipeline benchmark as a preview of what
   stacking these mechanisms can deliver.

Smaller follow-ups (EnsureRequirements, OFFSET pushdown, multi-column
reorder) collected at the end of the roadmap section as a short
'Other follow-ups' bullet list.
- 'Why not full Exact reverse' paragraph: cut reviewer attribution and
  forward-pointers that were already in the bottleneck/roadmap
  sections that follow.
- TL;DR: trim Runtime Reorder + Reverse Scans bullets to capability
  and impact; drop implementation mechanics like 'stamps two flags'
  and 'three-step pipeline'.
- 'The PushdownSort Rule' section: cut three paragraphs of
  'covered in X below' forward-references that were repeating
  the section TOC.
- Function-wrapped parenthetical in Sort Elimination: 4 lines to 2.
- Single-partition vs multi-partition edge case: drop the trailing
  'which is why the example is drawn that way' tangent.
- 'What this change does not affect' note: trimmed redundant prose.
- Remove all references to the limit-pruning blog (intro mention,
  link definition, References section bullet) — that work is about
  static LIMIT as an I/O optimization, separate problem from sort
  ordering.
…mmetry explicit

Per code in datasource-parquet/src/source.rs:849-870, the reversed-
satisfies branch is 'strictly more powerful than the column-in-schema
check' — it runs the request through EquivalenceProperties's full
reasoning machinery and handles function monotonicity, constants from
filters, equivalence relationships, and multi-column composite
orderings. The blog had been treating reverse as just step 3 of the
runtime pipeline, which undersold its standalone reach.

Structural changes:
- Drop the standalone 'Reverse Scans for ORDER BY DESC' H2 section;
  reverse is now a case of the Inexact runtime reorder path.
- Rename Runtime Reorder section to 'Runtime Reorder for TopK and
  DESC Queries'; intro now lists three classes that fall outside
  Exact (unsorted, overlapping, DESC).
- 'try_pushdown_sort' subsection rewritten as 'Two independent
  triggers for Inexact', describing column-in-schema vs reversed-
  satisfies as separate signals with the latter being strictly more
  powerful.
- 'Three runtime steps' subsection: step 3 now explicitly notes when
  steps 1-2 are skipped and only the iteration reverse runs.
- New 'ORDER BY DESC in practice' subsection right after the 3-step
  pipeline, explaining the RGs-descending-x-rows-ascending stream.
- Move reverse-scan.svg from the deleted Reverse Scans section into
  the Roadmap > Page-level Exact reverse subsection where it
  illustrates the 128 MB vs 1 MB peak comparison directly.

Accuracy fix:
- 'Multi-column reorder follow-ups' bullet was inaccurate — said the
  reorder 'only fires on plain columns'. The reverse path does
  handle function-wrapped and multi-column cases via
  EquivalenceProperties; only the stats reorder step is restricted.
  Updated wording to scope the limitation correctly.
…shdown

EnsureRequirements (#21976) is a rule-unification effort for
EnforceDistribution+EnforceSorting. Touches the same area but isn't
a sort pushdown optimization.

OFFSET pushdown (#21828) is about LIMIT/OFFSET pruning. Same kind of
tangent as the limit-pruning blog reference removed earlier — it's
LIMIT optimization, not sort pushdown.

The remaining 'Multi-column and function-wrapped reorder follow-ups'
bullet is actually directly about sort pushdown's reorder step
(#22198), so it stays. With the other two removed, 'Other follow-ups'
collapsed to a single point — promoted to its own subsection
'Extending the stats reorder step' for clarity.

Also dropped the corresponding entries from the References section.
… (k-way merge)

The 'switch work queue from PartitionedFile to RG descriptors' fix is
sufficient for non-overlapping ranges (post-reorder), where the first
file globally has the smallest values and subsequent files are
already stats-pruned. For overlapping ranges, the next smallest value
could sit in any of several open files — matching the non-overlap
efficiency requires explicit k-way merge across open files' next-RG
mins. The dynamic filter does this implicitly (RGs with max < threshold
are dropped), but explicit comparison closes the tap earlier when the
filter tightens slowly.
@zhuqi-lucas zhuqi-lucas changed the title Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip I/O Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip Row Groups, Skip I/O Jun 29, 2026
Restructure the post so the merged Runtime Row-Group Dynamic Pruning
(#22450) is the centerpiece, matching the community-talk narrative:

- TL;DR now leads with the three-layer pruning stack (file + RG + row)
  and the topk_tpch headline (5/11 queries 3-4x, 0 regressions, -44%
  total).
- Exact path keeps credit attribution explicit: EnforceSorting already
  handled the simple on-disk-sorted case; Phase 2 closes the
  wrong-on-disk-order gap. BufferExec story expanded with the SPM
  stall diagnosis.
- Inexact path split into Tier 1 (file-level, already had early stop)
  vs Tier 2 (RG-level, no early stop until #22450) so the gap that
  #22450 closes is named explicitly.
- New section: #22450 mechanics - architecture-3-eras, transition()
  drain/decide/drive, RowGroupPruner watch/rebuild/prune, cascading
  prune walkthrough.
- New section: Three-layer pruning stack with the same DynamicFilter
  driving all three layers, plus the topk_tpch benchmark table.
- Old "Current Bottlenecks" + "Roadmap" merged into "Future Directions":
  A) page-level Exact reverse (arrow-rs #9937), B) page-level dynamic
  prune at RG boundary (#23216).
- Updated landed-PRs list to include #20839 (push-based decoder, the
  prerequisite for #22450) and #22450 itself.

Pulls in 9 PNGs from the community-call deck for the #22450-specific
diagrams (architecture, transition, pruner, cascade, three-layer stack,
topk_tpch results, page-level future).
@zhuqi-lucas zhuqi-lucas changed the title Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip Row Groups, Skip I/O Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip Decode, Skip I/O Jun 29, 2026
@zhuqi-lucas zhuqi-lucas marked this pull request as ready for review June 29, 2026 02:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new long-form blog post documenting Apache DataFusion’s multi-release “sort pushdown” work (v52→v55+), centered on runtime row-group dynamic pruning (#22450), and introduces a set of new diagrams (SVGs + referenced PNGs) to support the narrative.

Changes:

  • Added a new blog post: Sort Pushdown in DataFusion: Skip Sorts, Skip Decode, Skip I/O (scheduled-dated 2026-07-05).
  • Added multiple new SVG diagrams under content/images/sort-pushdown/ illustrating sort pushdown phases, decision flow, runtime pipeline, buffering, and benchmarks.
  • Wired the post to those assets via /blog/images/sort-pushdown/... image references.

Reviewed changes

Copilot reviewed 1 out of 19 changed files in this pull request and generated no comments.

Show a summary per file
File Description
content/blog/2026-07-05-sort-pushdown.md New blog post covering sort pushdown and runtime pruning, referencing the accompanying diagrams/assets.
content/images/sort-pushdown/reverse-scan.svg New diagram contrasting row-group reverse vs page-level reverse scan.
content/images/sort-pushdown/pr21956-runtime-pipeline.svg New diagram of the runtime reorder pipeline driven by ParquetSource flags.
content/images/sort-pushdown/pr21956-decision.svg New decision-tree diagram for try_pushdown_sort outcomes.
content/images/sort-pushdown/plan-diff.svg New before/after EXPLAIN diagram showing sort elimination.
content/images/sort-pushdown/phase2-stats-overlap.svg New diagram explaining non-overlap vs overlap using min/max stats.
content/images/sort-pushdown/phase1-file-reorder.svg New diagram showing file reorder by min/max stats to enable sort elimination.
content/images/sort-pushdown/buffer-exec.svg New diagram explaining BufferExec as a bounded per-partition buffer.
content/images/sort-pushdown/buffer-exec-stall.svg New diagram illustrating SPM stalls without buffering vs behavior with BufferExec.
content/images/sort-pushdown/benchmark.svg New benchmark bar-chart diagram for the sort_pushdown suite results.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@alamb

alamb commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

@zhuqi-lucas is this blog ready for review?

@zhuqi-lucas

Copy link
Copy Markdown
Contributor Author

sort-pushdown-community-call.pdf

@zhuqi-lucas is this blog ready for review?

Thanks @alamb , it should be ready, and mostly i copied content from my recent China Meetup slide above.

@alamb

alamb commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Starting to check this one out

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this blog @zhuqi-lucas - I got about half way thought it in detail and I think it could be published as is so I am approving it.

However, I think we could potentially make it a much stronger narrative so I spent quite a while providing a bunch of stylistic feedback for your review

Suggestion: Make the writing more concise

There are several places in this blog (I highlighted some of them) where I think you can get across the same idea with about 1/3 of the text -- this will make the blog much easier to read and understand in my opinion.

LLMs are pretty good at helping reduce the content length if you give them that as an explicit goal and help identify unnecessary content (I left several inline suggestions)

Suggestion: reframe as a general Database technology

My biggest suggestion is to change how the blog presents the information (the information is great).

I think this post is currently structured as a "what did we do" narrative arc and present the way the work unfolded over time. However, I think that may not be as interesting to people as a post that teaches them some general principles that may be applicable to their own systems (using our work on DataFusion as an example)

So concretely that would mean something like

  1. Explain the problem (data is often partly sorted and the query engine doesn't know about the existing sortedness )
  2. Explain techniques to take advantage of pre-existing sort orders
  3. present results about how well it works

Suggestion: any time you refer to the internals of DataFUsion, link to the docs

I think most readers will not know what a SortPreservingMergeExec is. So I recommend either

  1. refer to this in more general terms ("sorted merge opertator")
  2. link to the docs.rs or source code so it is clear what you are talking about

Some links not working

for some reason links with ``` are not working. I think this is due to the asf pelican rendering system

Image

[Apache Parquet]: https://parquet.apache.org/

[Apache DataFusion] has always been able to skip the sort in the
*trivial* case: when the catalog declares an ordering (`WITH ORDER`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I would call the code for the existing sort orderings "trivial" 😆 -- I would probably use the term "exact" -- so like "been able to skip the sort in the exact case"

Maybe you could also link to this blog from @akurmustafa that explains what the existing function is : https://datafusion.apache.org/blog/2025/03/11/ordering-analysis/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also recommend linking to the docs about WITH ORDER as it isn't SQL standard (it is a datafusion specific thing)

Comment on lines +42 to +44
or parquet `sorting_columns`) **and** the on-disk file listing
already matches that order, the `EnsureRequirements` rule sees that
the scan's `output_ordering` already satisfies the request and

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of a lot of detail -- I know what it means but many readers might not -- it might help to write this in non-datafusion technical terms. So instead of "when the catalog declares ..." something more like

"when the data on disk is perfectly sorted and table definition includes the ordering, DataFusion can avoid redundant sorts, as explained in the blog ...."

or parquet `sorting_columns`) **and** the on-disk file listing
already matches that order, the `EnsureRequirements` rule sees that
the scan's `output_ordering` already satisfies the request and
**removes the redundant `SortExec`**. The hard cases were

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then in this next sentence, rather than calling it "hard" maybe call it "another" case and describe in terms of user visible behavior: "This blog is about taking advantage of sorting even when the files may not be perfectly sorted and DataFusion is not told about any ordering up front"


*Qi Zhu, [Massive](https://www.massive.com/)*

Many [Apache Parquet] datasets are already sorted on disk. Time-series

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you start this blog with a 2-3 sentence abstract / summary of its contents, so that a potential reader can quickly decide if they want to read it or not

Maybe something like

"DataFusion now automatically takes advantage of sortedess in the data, even when the data is only partially sorted and DataFusion does not know the sort order ahead of time. This blog explains why this is an important usecase and how Datafusion achieves this, through a combination of pushing down sort information, and adaptive runtime reordering and pruning"

*Qi Zhu, [Massive](https://www.massive.com/)*

Many [Apache Parquet] datasets are already sorted on disk. Time-series
files are usually written in ingestion-time order. Event logs are sharded

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a few other motivating observations that might be worth working in

  1. Data often has a partial sort order that is either not entirely sorted (e.g. ingestion order vs data order) or else the database doesn't know about the existing ordering
  2. Modern data lakes based on Apache Icerberg and others often have to work with data as it is / was written and don't have the luxury of making a second copy, or resorting -- so being able to automatically exploit existing sortedness is important

[#19064]: https://github.com/apache/datafusion/pull/19064
[#21182]: https://github.com/apache/datafusion/pull/21182

* Three files: `a.parquet`, `b.parquet`, `c.parquet`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good example

**strips the ordering entirely**. From the optimizer's point of view
the scan now has no declared ordering. Physical planning translates
the user's `ORDER BY` into a `SortExec`, and the earlier pipeline
rule [`EnsureRequirements`] (which consolidates the old

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW this doesnt' seem to be a link

Image

in the **wrong order** on disk (e.g. alphabetical by name, not by
time).

`validated_output_ordering()` looks at this, sees that the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a lot of implementation detail that will not mean much to anyone else -- just saying "previously DataFusion could not recognize the sort order, but now it can" is probably enough, with a link to the code for anyone who wants more details


[`EnsureRequirements`]: https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/ensure_requirements/mod.rs

Stats-based sort elimination fixes this in `PushdownSort`, which

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also full of implementation details -- I suggest you focus on just explaining the conditions under which the transformation can be done (non overlapping min/max boundaries)

sorts via monotonicity inference — but stats-based sort elimination
still needs a plain column.)

<img src="/blog/images/sort-pushdown/phase2-stats-overlap.svg" alt="Detecting non-overlapping ranges via min/max statistics" width="100%" class="img-fluid"/>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image is great -- I recommend adding a figure label to explain it to someone who is just skimming (not reading hte article linerly)

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants