Skip to content

fix: Batch size limit in re-spill compounds#23286

Open
EmilyMatt wants to merge 3 commits into
apache:mainfrom
EmilyMatt:sort-respill-perf
Open

fix: Batch size limit in re-spill compounds#23286
EmilyMatt wants to merge 3 commits into
apache:mainfrom
EmilyMatt:sort-respill-perf

Conversation

@EmilyMatt

@EmilyMatt EmilyMatt commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

This fixes an issue in my previous design for the re-spill in sort - whenever we half a stream, we half self.batch_size, even if its not the same stream being halved, so for each re-spill we get a smaller and smaller output batch, which really hurts performance in the merge.
We can avoid it by just keeping the batch_size_limit in the spilled file struct. then we just use the lowest limit whenever we merge N streams, making it dynamic and much more robust.

Has tests.

Breaks API, by adding a new public field to SortedSpillFile

@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label Jul 2, 2026
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion-physical-plan v54.0.0 (current)
       Built [  43.782s] (current)
     Parsing datafusion-physical-plan v54.0.0 (current)
      Parsed [   0.128s] (current)
    Building datafusion-physical-plan v54.0.0 (baseline)
       Built [  37.108s] (baseline)
     Parsing datafusion-physical-plan v54.0.0 (baseline)
      Parsed [   0.134s] (baseline)
    Checking datafusion-physical-plan v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.623s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field SortedSpillFile.batch_size_limit in /home/runner/work/datafusion/datafusion/datafusion/physical-plan/src/sorts/streaming_merge.rs:74

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  82.935s] datafusion-physical-plan

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants