Skip to content

Aggregations Support Partitioning::Range#23239

Open
gene-bordegaray wants to merge 1 commit into
apache:mainfrom
gene-bordegaray:gene.bordegaray/2026/06/range-partitioned-aggregations-main
Open

Aggregations Support Partitioning::Range#23239
gene-bordegaray wants to merge 1 commit into
apache:mainfrom
gene-bordegaray:gene.bordegaray/2026/06/range-partitioned-aggregations-main

Conversation

@gene-bordegaray

@gene-bordegaray gene-bordegaray commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Range partitioning can satisfy aggregate hash partitioning: equal group keys are already partitioned, even though the partitioning is not hash-based.

This is the first unary-operator implementation from the range partitioning discussion before making broader public API changes around HashPartitioned / KeyPartitioned.

What changes are included in this PR?

  • Let compatible range partitioning satisfy aggregate hash distribution requirements in EnforceDistribution
  • Keep this private to aggregate planning for now to not make public API changes to Distribution enum variants yet until more operators are supported

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes. Range-partitioned aggregate plans can now avoid hash repartitioning.

@github-actions github-actions Bot added physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jun 29, 2026
@gene-bordegaray gene-bordegaray changed the title Support range partitioned aggregations Aggregations Support Partitioning::Range Jun 29, 2026
plan.is::<RepartitionExec>()
}

/// Temporary check while `HashPartitioned` is being migrated to `KeyPartitioned`

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont know if these are worth to be shared in the crate if we are gong to eventually remove them,

@gene-bordegaray gene-bordegaray marked this pull request as ready for review June 29, 2026 14:00
@gene-bordegaray

Copy link
Copy Markdown
Contributor Author

cc: @2010YOUY01 @gabotechs @stuhood

Comment on lines +768 to +770
AggregateExec: mode=FinalPartitioned, gby=[a@0 as a], aggr=[]
AggregateExec: mode=Partial, gby=[a@0 as a], aggr=[]
DataSourceExec: file_groups={4 groups: [[p0], [p1], [p2], [p3]]}, projection=[a, b, c, d, e], output_partitioning=Range([a@0 ASC], [(10), (20), (30)], 4), file_type=parquet

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I think we don't need the Partial there right? we should be fine with just a FinalPartitioned aggregation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this happens in a later optimizer rule that collapses these partial -> finals where approacpriate: datafusion/physical-optimizer/src/combine_partial_final_agg.rs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is why it shows up correctly in the slt tests 👍

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to move the unit-test coverage added in this file to end-to-end SLT tests instead?

It seems the same test goal can still be achieved at the SLT level, and those sql tests should be more stable across optimizer refactors.

For example, whether the initial physical plan uses a two-stage aggregation or a single-stage aggregation feels implementation-specific. A future refactor might legitimately change that plan shape, which would require updating these unit tests. At that point, it may be harder to recover the original intent of each assertion, and some coverage could accidentally be lost during the refactor. (while SLT behavior won't change a lot even after aggressive refactors)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, tweaking the plan like this:

    let plan = CombinePartialFinalAggregate::new().optimize(plan, &ConfigOptions::new())?;

Produces the following plan:

    AggregateExec: mode=SinglePartitioned, gby=[a@0 as a], aggr=[]
      DataSourceExec: file_groups={4 groups: [[p0], [p1], [p2], [p3]]}, projection=[a, b, c, d, e], output_partitioning=Range([a@0 ASC], [(10), (20), (30)], 4), file_type=parquet

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion whether porting this to SLT tests or also leaving them here. If there are things that are covered here but not in SLT, it's probably worth also covering them in SLT.

@2010YOUY01 2010YOUY01 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! The high-level shape of the PR LGTM. I just need a bit more time to understand what EnforceDistribution is doing before finishing the review — that code looks a little intimidating 😅

Comment on lines +768 to +770
AggregateExec: mode=FinalPartitioned, gby=[a@0 as a], aggr=[]
AggregateExec: mode=Partial, gby=[a@0 as a], aggr=[]
DataSourceExec: file_groups={4 groups: [[p0], [p1], [p2], [p3]]}, projection=[a, b, c, d, e], output_partitioning=Range([a@0 ASC], [(10), (20), (30)], 4), file_type=parquet

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to move the unit-test coverage added in this file to end-to-end SLT tests instead?

It seems the same test goal can still be achieved at the SLT level, and those sql tests should be more stable across optimizer refactors.

For example, whether the initial physical plan uses a two-stage aggregation or a single-stage aggregation feels implementation-specific. A future refactor might legitimately change that plan shape, which would require updating these unit tests. At that point, it may be harder to recover the original intent of each assertion, and some coverage could accidentally be lost during the refactor. (while SLT behavior won't change a lot even after aggressive refactors)

.is_satisfied()
{
.is_satisfied();
let range_satisfies_aggregate_distribution =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this PR is already treating HashPartitioned as KeyPartitioned logically, but the formal renaming PR is left to be done in a future PR 🤔

If that's the case, should we directly implement this logic into range_partitioning.satisfaction()? Otherwise we still have to do it after the formal renaming.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following on our discussion in #23236, I think we'll maintain the two HashPartitioned(deprecated) and KeyPartitioned and treat them as equal.

Probably that can be done in a preliminary PR.

@gabotechs gabotechs left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach looks good to me, besides a bit more test coverage and other suggestions, I think this is good.

Comment on lines +1212 to +1215
let should_add_hash_repartition = hash_necessary
&& needs_hash_repartition
&& !range_satisfied_for_aggregate;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bool threading in this file is pretty mind-bending, although it seems like it's just how it is, I cannot think of a way of simplifying it...

This PR actually manages that complexity relatively well. I'll think a bit more about it and see if there's a way we can make it easier, but I don't think this is a blocker as long as we have good test coverage.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know... this rule was the first optimizer rule I spent time in and man, its a hard one to digest. It is on my bucket list to clean this guy up more too

Comment on lines 30 to 35
# TEST 1: Aggregate on Range Partition Column
# Scanning range_key preserves source Range partitioning metadata.
# Planning still inserts Hash repartitioning today; later optimizer PRs can
# use this baseline to show when the repartition is removed.
# Planning does not need Hash repartitioning because Range partitioning
# colocates rows with equal range_key values.
##########

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we can add a bit more coverage to this file?

For example, adding some positive and negative tests that play with subset satisfaction + range partitioning, and trying to get all the boolean code paths in enforce_distribution.rs to execute.

.is_satisfied()
{
.is_satisfied();
let range_satisfies_aggregate_distribution =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following on our discussion in #23236, I think we'll maintain the two HashPartitioned(deprecated) and KeyPartitioned and treat them as equal.

Probably that can be done in a preliminary PR.

@gene-bordegaray

Copy link
Copy Markdown
Contributor Author

Approach looks good to me, besides a bit more test coverage and other suggestions, I think this is good.

@gabotechs awesome, going to get that preliminary one in first and rebase it, thank you 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate optimizer Optimizer rules physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow Range partitioning to satisfy grouped aggregation requirements

3 participants