

+
+The interesting backstory is that **DataFusion didn't actually own
+this loop until recently**. Three eras:
+
+* **Pre-[#20839]**: arrow-rs owned the I/O + decode loop as a black
+ box; DataFusion only called `.next()` and served byte ranges. The
+ row-group list was frozen at construction, so once the loop started,
+ no mid-stream decisions were possible.
+* **[#20839]**: the push-based parquet decoder moved the loop into
+ DataFusion. The capability to insert a decision mid-loop now
+ existed — but the loop went from `drain` straight to `drive`, with
+ no decision point.
+* **[#22450]**: adds the missing decision point. At every row-group
+ boundary, the loop pauses to ask the runtime pruner whether the
+ remaining row groups are still worth reading.
+
+### The loop, and the decision point [#22450] adds
+
+


+
+A common question at this point: "if [#22450] prunes whole row
+groups, does that replace the `RowFilter` row-level prune that the
+`Inexact` path was already using?" **No** — the three layers stack,
+and they're driven by the **same** `TopK` dynamic filter. (The
+"Tier 1 / Tier 2" framing earlier maps to "Layer 0 / Layer A"
+below — same partition, different lens. Layer B is what runs on
+each row group after Layer A keeps it.)
+
+* **Layer 0 · file-level** (`file_pruner` + `EarlyStoppingStream`).
+ Cuts dead files before they're opened. The only layer that skips
+ parquet metadata I/O entirely. Already shipped before [#22450] —
+ this is Tier 1.
+* **Layer A · row-group-level** ([#22450]). Cuts dead row groups
+ inside open files at every row-group boundary. Bytes never
+ fetched, filter column never decoded. **This is the layer that
+ fills the Tier 2 gap** ("× no early stop yet" pre-[#22450]).
+* **Layer B · row-level** (`RowFilter`). For row groups that
+ survive Layer A, the filter is still evaluated row-by-row to
+ build a `RowSelection`. Rows that fail the predicate get their
+ *projection* columns short-circuited via arrow-rs's
+ `selects_any()`, but the *filter* column is necessarily read.
+ This layer has the highest residual cost (the filter column),
+ but also the finest granularity.
+
+The same dynamic filter drives all three. A single insertion into
+the `TopK` heap becomes a new threshold that Layer B applies
+per-row immediately (in the currently-open row group), and Layer A
+re-applies to remaining row groups at the next boundary. No layer
+subsumes another — Layer A prunes on metadata alone (never touching
+the filter column), while Layer B is finer-grained but has to read
+the filter column to decide.
+
+### Benchmark · `topk_tpch` (TPC-H SF1, `LIMIT 100`)
+
+
+
+The [`topk_tpch`](https://github.com/apache/datafusion/blob/main/benchmarks/src/sort_tpch.rs) benchmark runs 11 TPC-H SF1 queries, all of the
+shape `ORDER BY ... LIMIT 100`, comparing `main` against the same
+branch with [#22450] enabled. Headline numbers:
+
+| Metric | Value |
+| ----------------------------------- | ---------------------------------- |
+| Total wall-clock (sum of 11 queries) | 248.8 ms → 139.1 ms (**−44%**) |
+| Queries with ≥2× speedup vs main | **5 of 11** (Q2, Q4, Q8, Q9, Q10) |
+| Queries with regression vs main | **0** |
+| Best single-query speedup | **~4×** |
+
+The five queries with significant speedups all use `l_orderkey`
+as the **leading** sort key — lineitem's physical sort key, a
+`BIGINT` with ~1.5M distinct values per SF1, so per-RG `min/max`
+ranges are cleanly disjoint and `Layer A` can cascade-prune
+aggressively. The non-winners (Q1, Q3, Q5, Q6, Q7, Q11) lead with
+`l_linenumber` (cardinality 7), `l_comment`, or `l_shipmode` —
+columns whose per-RG ranges overlap heavily because they're not the
+physical sort order. (Q5–Q7 still *include* `l_orderkey`, but only
+as a third-key tie-breaker — the leading key is what controls RG-level
+disjointness.) A tighter threshold doesn't translate into clean
+RG-level boundaries to prune at, so `Layer B` (row-level) still does
+its share of the work.
+
+The takeaway isn't "5 out of 11", it's "**zero regressions and
+no-op when the data doesn't help, 3–4× when it does**". The sweet
+spot — sort key aligned with the physical layout — is the common
+case for time-series, partitioned tables, and ingestion-ordered
+event logs.
+
+## Future Directions
+
+Two complementary directions are open. The first needs an upstream
+arrow-rs primitive; the second is pure DataFusion plumbing on top
+of [#22450]:
+
+### A · Page-level `Exact` reverse · arrow-rs [#9937]
+
+
+
+[#22450] prunes whole row groups at row-group boundaries. The
+finer-grained extension prunes whole **pages** within a surviving
+row group. The signal is the same dynamic filter, just re-applied
+at page granularity — for any page whose `max(col)` is already
+below the threshold, the filter column's bytes for that page can be
+skipped along with the projection columns.
+
+Today's page-level pruning runs once at file open using the static
+query predicate. Future B extends [#22450]'s "refresh at RG
+boundary" pattern to also rebuild the page-level filter with the
+live threshold, so upcoming row groups get tighter page selections
+mid-scan. Same arrow-rs API [#22450] already uses — no new
+primitive needed. Tracked in [apache/datafusion#23216].
+
+[apache/datafusion#23216]: https://github.com/apache/datafusion/issues/23216
+
+Conceptually this is the same idea as [#22450] stepped down one
+level: every level of the parquet hierarchy gets to chip off its
+share of the residue from the level above.
+
+## Acknowledgements
+
+Thank you to [@adriangb], [@alamb], [@xudong963], [@2010YOUY01], and
+[@Dandandan] for reviewing the design and the patches across many
+iterations. The DataFusion community's willingness to engage deeply
+with optimizer changes — including the ones that touch foundational
+invariants like who-drives-the-decode-loop — is what made this work
+possible.
+
+[@alamb]: https://github.com/alamb
+[@adriangb]: https://github.com/adriangb
+[@xudong963]: https://github.com/xudong963
+[@2010YOUY01]: https://github.com/2010YOUY01
+[@Dandandan]: https://github.com/Dandandan
+
+## References
+
+Umbrella issue tracking the entire effort:
+
+* **[EPIC] Sort Pushdown · skip sorts and skip IO for ORDER BY / TopK queries: [apache/datafusion#23036](https://github.com/apache/datafusion/issues/23036)** — phase-by-phase status of all the PRs and follow-ups.
+
+Prior post this work builds on:
+
+* [Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries][dyn-filters-blog] — the dynamic filter primitive `TopK` uses.
+
+Landed PRs that make up the merged work:
+
+* `MinMaxStatistics` foundation: [apache/datafusion#9593](https://github.com/apache/datafusion/pull/9593)
+* `PushdownSort` rule + row-group reverse: [apache/datafusion#19064](https://github.com/apache/datafusion/pull/19064)
+* Reverse-output redesign: [apache/datafusion#19446](https://github.com/apache/datafusion/pull/19446), [apache/datafusion#19557](https://github.com/apache/datafusion/pull/19557)
+* Sort elimination via statistics: [apache/datafusion#21182](https://github.com/apache/datafusion/pull/21182)
+* `BufferExec` capacity for sort elimination: [apache/datafusion#21426](https://github.com/apache/datafusion/pull/21426)
+* Push-based parquet decoder (DataFusion owns the loop): [apache/datafusion#20839](https://github.com/apache/datafusion/pull/20839)
+* Morsel-style work scheduling: [apache/datafusion#21351](https://github.com/apache/datafusion/pull/21351)
+* Runtime reorder for `TopK` convergence: [apache/datafusion#21956](https://github.com/apache/datafusion/pull/21956)
+* **Runtime row-group dynamic pruning ([#22450])** — the centerpiece of this post.
+
+In flight / open:
+
+* Page-level reverse (arrow-rs): [apache/arrow-rs#9937](https://github.com/apache/arrow-rs/pull/9937), discussion in [apache/arrow-rs#9934](https://github.com/apache/arrow-rs/issues/9934)
+* `peek_next_row_group` API for per-RG `fully_matched` RowFilter skip (arrow-rs): [apache/arrow-rs#10158](https://github.com/apache/arrow-rs/pull/10158)
+* Page-level dynamic prune at RG boundary (Future B): [apache/datafusion#23216](https://github.com/apache/datafusion/issues/23216)
+* Per-RG `fully_matched` RowFilter skip on top of [#22450] (blocked on arrow-rs#10158): [apache/datafusion#23067](https://github.com/apache/datafusion/issues/23067)
+* Multi-column / function-wrapped stats reorder follow-ups: [apache/datafusion#22198](https://github.com/apache/datafusion/issues/22198)
+
+Concretely useful issues for new contributors:
+
+* [Add more `ExecutionPlan` impls to support sort pushdown][more-impls-issue].
+
+[more-impls-issue]: https://github.com/apache/datafusion/issues/19394
+
+Benchmark suites: [sort_pushdown](https://github.com/apache/datafusion/tree/main/benchmarks/queries/sort_pushdown), [topk_tpch](https://github.com/apache/datafusion/blob/main/benchmarks/src/sort_tpch.rs).
diff --git a/content/images/sort-pushdown/arch_one_glance.png b/content/images/sort-pushdown/arch_one_glance.png
new file mode 100644
index 0000000000000000000000000000000000000000..7e8dafc3a15c85f423ac2d43a77a23360350f433
GIT binary patch
literal 132456
zcmeFY^;cU#*Z3PsftDhL;@(mmLUFfZ#oe{I6bnTH!Ac9o9g0I~@Is3RcXtc!?oJ@c
zO`mtY_qq2Uxa2jf@SlaO{|LR?C22vYS1AXxbK>*Scmb@mKNsZGL~IA~5+V;ZMPm9yQ{9
zqPY#99JPK1zx8xuWfoCqxBYvIB}Y!hd{rFOwpu4IrJKZg$lRO -P{$t3~G>Ntn#^8e`o@}b`6H<
zXfr%Vul93lyY39LNaMZ3-TAk6>}16@uKyvB2dCqgmKZ(P8Md>V?Si~5WV5j$3;W;N
z$xrtR9D~|*GrafwEdXT090DfEvUqN1pM;QaCr|$Nv%?srWb{wvJaaMRYq4*PF;E}V
zsOe1&&RWGWL>T-uH66zg;uZOHfMxB-@-b(mN+y**qHSF>C6{hGKzX=0%KU$;sf%xw
z5ZkT!be@km4T^FSTRQx1)6i(vuXk#*l42bPUswpX`|04T@AwEMbOe-oEJynVY$Q7G
zWBjIaj67g80Utn3Qv~lzPPGTY7HB8)2iVzKq<`=mQHjB#j7}7*s!OIZ)w24ngO_Vo
zvXJQ)$CL_9;~-F %cnK(5J{GLkrNCIw)lNaL03e$NCl7l6-4&D#YAS!#y6}cC~
zNC-lu`acZB8Qbnt3ooFORz+@Zc#}C)X}~!CHsAxFq+S21LQo4PL?MSsZ*{Q7>2gH{
zde-H?>wtlP(6F*V+ol8r9nf+P)*#+MKy@aY>7eM?zdw(}fxXwyE^TH;)NH^!AIAWj
zDd{=sb25fg*FQN-s(OMhaBVlkQuDq51H4p?>u5_RpQaj&
z^l3r-gNqD9t04Pw&DyvSNYTO8@$?9do(uR2l$w>XP)Z4=+eQ*ao|x~+_7MqU3bCr9
zE3`>KIT&|8t2trkv>D12dkm&be=4--v_2gNBy!?P0Kz0UIB<9bS$BQ}PZ9qKe!^WU
z+UXNm|08mm%{c2t86J7Hf(0qImOM}ItJEsYInJ+R&|UxERK1D-@?dX$hwfx7lV=f2
z4GN`fwz&8|Vn^*tG|V&5*szOiEM<1B>6T|^6u*VFtcL8}w=g+u1otIv>9^QI;*~kr
z%CREd7vj4wUef!Z*{C`Ql%=
zj0?Y0jE
fiy`E0`{Ar`)wt(McP{4j7ejX1v;2
zdA}c=!rCM=86rmQ_zZyzY}6o$R%}>PlupH2Gyy(sug;5BAQ2HQHQ~qR@5z78xnC9=
zoLfpWmGR3YF;4tlbu*1P!-FOPl;#+xY05L>vokN6XZa*|8
z_bNV#xT9wR`~Fw5^;B_dJlCY!KF2w^j*n<0