Skip to content

fix: coalesce the merged key of RIGHT/FULL USING/NATURAL joins#22998

Open
Phoenix500526 wants to merge 10 commits into
apache:mainfrom
Phoenix500526:issue/22881
Open

fix: coalesce the merged key of RIGHT/FULL USING/NATURAL joins#22998
Phoenix500526 wants to merge 10 commits into
apache:mainfrom
Phoenix500526:issue/22881

Conversation

@Phoenix500526

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

A USING / NATURAL join exposes its join key as a single merged column whose value, per the SQL standard, is COALESCE(left.key, right.key). DataFusion resolved an unqualified reference to that merged key to the left column unconditionally.

For RIGHT and FULL joins the left key is NULL-padded on rows that exist only on the right, so the merged key came out NULL instead of the value that is actually present. The wrong NULL is silent and propagates into GROUP BY, ORDER BY (changing row order) and SELECT *, corrupting results; WHERE on the merged key additionally failed with an "ambiguous reference" error. INNER / LEFT are unaffected, since their left key is never NULL-padded.

create table a(k int, x int) as values (1, 10), (2, 20), (3, 30);
create table b(k int, y int) as values (2, 200), (3, 300), (4, 400);

select k from a right join b using (k) order by k nulls last;
-- before: 2, 3, NULL      after: 2, 3, 4

What changes are included in this PR?

The unqualified merged key of a RIGHT / FULL`` USING / NATURAL join now resolves to COALESCE(left, right) everywhere it can be referenced:

  • column normalization (SELECT, ORDER BY, GROUP BY, HAVING, QUALIFY);
  • WHERE predicates — previously rejected as ambiguous; they now resolve against the join's real USING columns;
  • wildcard expansion (SELECT *).

Mechanics:

  • a new LogicalPlan::outer_using_key_pairs() returns the (left, right) key pairs of RIGHT / FULL USING / NATURAL joins;
  • the merged key is materialized as CASE WHEN left IS NOT NULL THEN left ELSE right END — the exact form coalesce is simplified to, so it can be built in datafusion-expr without depending on the functions crate — aliased to the key name so the output column keeps its name.
    INNER / LEFT joins and qualified access (a.k / b.k) are left unchanged.

Are these changes tested?

Yes. A new sqllogictest file join_using_merged_key.slt covers the merged key for RIGHT / FULL / NATURAL joins across SELECT, SELECT *, WHERE and ORDER BY, with INNER / LEFT, qualified a.k / b.k access, and the explicit coalesce(a.k, b.k) ... ON form as regression guards.

The full sqllogictest suite and the datafusion-common /datafusion-expr / datafusion-sql unit tests pass with no regressions.

Are there any user-facing changes?

Yes. For RIGHT / FULL USING / NATURAL joins, the merged join key now returns COALESCE(left, right) (the value from whichever side is present) instead of NULL for rows that exist only on the right, and referencing the merged key in WHERE no longer raises an ambiguous-reference error. Queries that do not use RIGHT / FULL USING / NATURAL joins are unaffected.

No breaking public API changes (the change only adds new public items).

@neilconway neilconway left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sending this PR! Overall I agree that the current behavior is wrong and should be fixed.

I think the current approach produces an error for queries like SELECT k FROM a FULL JOIN b USING(k) ORDER BY a.k, because the sort introduces a hidden projection column.

Comment thread datafusion/expr/src/expr_rewriter/mod.rs Outdated
/// For these join types the left key is NULL-padded on rows that exist only
/// on the right, so an unqualified reference to the merged key must resolve
/// to `COALESCE(left, right)` rather than to the left column alone.
pub fn outer_using_key_pairs(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name could be improved, since we handle some outer joins (right, full) but not others (left).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename outer_using_key_pairs to right_or_full_using_key_pairs

/// Like [`normalize_col_with_schemas_and_ambiguity_check`], but additionally
/// resolves the merged key of a `RIGHT` / `FULL` `USING` / `NATURAL` join (given
/// as `(left, right)` column pairs) to `COALESCE(left, right)`.
pub fn normalize_col_with_schemas_ambiguity_and_outer_using(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unfortunate to leak an implementation detail ("USING needs special handling for outer joins") into the public API surface.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I drop this function and resolve the merged key in the SQL planner instead, so the USING-outer-join detail no longer leaks into the generic expr public API.

@Phoenix500526

Copy link
Copy Markdown
Contributor Author

Thanks for sending this PR! Overall I agree that the current behavior is wrong and should be fixed.

I think the current approach produces an error for queries like SELECT k FROM a FULL JOIN b USING(k) ORDER BY a.k, because the sort introduces a hidden projection column.

Hi, @neilconway I dug into this one. The failure is a schema clash: ORDER BY a.k drags a.k into the same projection as the merged key k, and for a FULL join k is COALESCE(a.k, b.k) (unqualified). DFSchema won't allow an unqualified k next to a qualified a.k — that's the "would be ambiguous" error.

I found two ways out, neither perfectly clean.

One is to give the merged key a fake qualifier, e.g. COALESCE(a.k, b.k) AS __join_virtual_table__.k, so it's qualified and stops clashing. That actually fixes both ORDER BY a.k and SELECT k, a.k. The catch is the fake table leaks: you see it in EXPLAIN, and unparse falls over because the qualifier points at nothing real — plan_to_sql emits SQL referencing a table that doesn't exist and it won't re-plan:

SELECT __join_virtual_table__.order_id FROM (
  SELECT CASE WHEN o1.order_id IS NOT NULL THEN o1.order_id ELSE o2.order_id END AS order_id, o1.order_id
  FROM orders o1 FULL JOIN orders o2 USING(order_id) ORDER BY o1.order_id)
-- re-plan: qualified field name o1.order_id and unqualified field name order_id would be ambiguous

The other is to push the Sort below the merged-key projection so it reads a.k straight off the join:

Projection: COALESCE(a.k, b.k) AS k
  Sort: a.k
    Full Join: Using a.k = b.k

Clean plan, fixes the ORDER BY case. But now the Sort sits directly on the join with no projection, so unparse falls back to SELECT *, * (one star per side), which expands the merged key twice and again won't re-plan:

SELECT ... FROM (SELECT *, * FROM orders o1 FULL JOIN orders o2 USING(order_id) ORDER BY o1.order_id)
-- re-plan: Projections require unique expression names ... order_id ... order_id

So both fix execution but both trip up unparse — a FULL USING merged key is a COALESCE the unparser doesn't turn back into USING. Neither feels clean enough to me — do you have a better approach in mind for this case?

@neilconway

Copy link
Copy Markdown
Contributor

@Phoenix500526 Thanks for revising this! The FULL case needs some more consideration -- I'll dig into it and try to respond on Monday.

@Phoenix500526

Phoenix500526 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

@Phoenix500526 Thanks for revising this! The FULL case needs some more consideration -- I'll dig into it and try to respond on Monday.

@neilconway Thanks so much for offering to dig into the FULL case! Good news though: I kept poking at it and found a third approach that fixes FULL and survives the unparse round-trip, so you're off the hook 😄 I'd still love to hear what you think, of course.

Quick rundown:

Idea: rename the merged key to a reserved internal name, not a fake qualifier.

Both of my earlier attempts died at unparse because a FULL USING merged key is a COALESCE the unparser can't turn back into USING. So instead of dodging the { k, a.k } clash with a phantom qualifier, I dodge it with a plain reserved name.

During sort push-down, when a qualified sort column (a.k) is about to be folded into a projection that already exposes the unqualified merged key k, I rename that merged key to __datafusion_merged_key_k first, then restore the original name in a wrapper projection:

Projection: __datafusion_merged_key_k AS k                    -- wrapper restores `k`
  Sort: a.k DESC
    Projection: CASE WHEN a.k IS NOT NULL THEN a.k ELSE b.k END AS __datafusion_merged_key_k, a.k
      Full Join: Using a.k = b.k

The folded schema is now { __datafusion_merged_key_k, a.k }— distinct names, no ambiguity. If the sort itself also references the merged key (ORDER BY a.k, k), I rewrite that reference to the same internal name so the multi-key sort keeps resolving. (ORDER BY a.k, k was legal before the merged-key work, so this keeps it from regressing.)

I've added round-trip tests (USING and NATURAL FULL) plus execution coverage for ORDER BY a.k and ORDER BY a.k, k.

One loose end I should flag: the internal name leaks into EXPLAIN. The result column is k and the unparsed SQL is clean, but the intermediate plan nodes still show __datafusion_merged_key_k:

logical_plan
01)Projection: __datafusion_merged_key_k AS k
02)--Sort: a.k DESC NULLS LAST
03)----Projection: CASE WHEN a.k IS NOT NULL THEN a.k ELSE b.k END AS __datafusion_merged_key_k, a.k

It's purely cosmetic — no effect on results or round-trip SQL — but it's not the prettiest. Hiding it completely would mean touching the generic plan Display, which felt like overkill, so I left it for now. If you spot a cleaner way, I'm all ears.

@neilconway

Copy link
Copy Markdown
Contributor

@Phoenix500526 thanks for the update on this! I'm on vacation until July 8; happy to look at this when I get back.

match join_type {
// The left key is NULL-padded on right-only rows; the right key is
// always present, so the merged key is just the right column.
JoinType::Right => return Ok(Expr::Column(r.clone())),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes SELECT k, b.k FROM a RIGHT JOIN b USING(k) fail because both select items become b.k. Could we alias the merged key here, e.g. Expr::Column(r.clone()).alias(col.name.clone()), so k stays distinct from an explicit b.k?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Add coverage that the merged key of a USING / NATURAL join is
COALESCE(left, right) across SELECT, WHERE, ORDER BY and wildcard
expansion, including RIGHT / FULL joins where the left key is NULL-padded.

Refs apache#22881
Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
A USING / NATURAL join exposes its join key as a single merged column
whose value, per the SQL standard, is COALESCE(left, right). DataFusion
resolved an unqualified reference to that merged key to the left column
unconditionally.

For RIGHT and FULL joins the left key is NULL-padded on rows that exist
only on the right, so the merged key came out NULL instead of the value
that is actually present. That wrong NULL is silent and propagates into
ORDER BY and other uses of the key, changing query results.

Resolve the unqualified merged key of RIGHT / FULL joins to
COALESCE(left, right) so it carries the value from whichever side is
present, matching the explicit `coalesce(a.k, b.k) ... ON a.k = b.k`
form. INNER / LEFT are unaffected, since their left key is never
NULL-padded.

Refs apache#22881
Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
A USING / NATURAL merged key can be referenced unqualified anywhere a
column can, including in WHERE. But WHERE resolved unqualified references
against only the columns the predicate itself mentions, not the join's
USING columns, so an unqualified merged-key reference matched both the
left and right key with no USING context and was rejected as ambiguous.

Resolve WHERE predicates the same way as the SELECT list -- against the
join's real USING columns -- so the merged key is recognized and, for
RIGHT / FULL joins, takes the value from whichever side is present
(COALESCE(left, right)) instead of erroring.

Refs apache#22881
Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
A wildcard (`SELECT *`) over a USING / NATURAL join collapses the
duplicated join key to a single merged-key column. That surviving column
was the left key, which is NULL-padded on rows present only on the right
of a RIGHT / FULL join, so `SELECT *` showed NULL for the merged key on
those rows.

Expand the merged key in a wildcard as COALESCE(left, right) for
RIGHT / FULL joins so it shows the value from whichever side is present,
consistent with how an explicit reference to the key resolves and with
the SQL standard.

Closes apache#22881
Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
Resolve an unqualified reference to the merged key of a USING / NATURAL
join to the side that is never NULL-padded, addressing review feedback
that always coalescing handled LEFT and RIGHT asymmetrically: the right
key for RIGHT, COALESCE(left, right) for FULL, and the left key
(unchanged) for INNER / LEFT.

For RIGHT this yields a plain column instead of a computed COALESCE, so
the merged key no longer collides with a qualified key in the same
projection schema -- queries like `ORDER BY a.k` or `SELECT k, a.k` over
a RIGHT join now plan cleanly. FULL still needs the computed COALESCE and
keeps that limitation.

Also drop the public normalize_col_with_schemas_ambiguity_and_outer_using
and resolve the merged key in the SQL planner instead, so the
USING-outer-join detail no longer leaks into the generic expr public API.
Rename outer_using_key_pairs to right_or_full_using_key_pairs (it never
covered LEFT) and carry the join type. Correct two stale integration
snapshots to the right-side merged key.

Refs apache#22881
Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
A FULL USING / NATURAL join exposes its merged key as an unqualified
COALESCE(left, right) column. Ordering by a qualified key of the same
join (`ORDER BY a.k`) folds that qualified `a.k` into the projection that
already carries the unqualified merged `k`, building an illegal
`{ k, a.k }` schema, so such queries errored at plan time.

During sort push-down, rename a colliding merged key to a reserved
internal name before folding the qualified sort column, and restore the
original name in the wrapper projection. References to the merged key in
the sort itself (`ORDER BY a.k, k`) are rewritten to the same internal
name so multi-key sorts keep resolving; that shape was legal before the
merged-key work, so this avoids a regression.

Teach the unparser to flatten a wrapper projection that purely renames
an inner column over a Sort, so the rewritten plan still round-trips
through plan -> SQL -> plan.

Refs apache#22881

Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
A USING / NATURAL join exposes one merged key for a shared join column,
ut the original side keys remain addressable through qualified
references. Selecting the merged key next to a qualified side key,
such as `SELECT k, a.k FROM a LEFT JOIN b USING (k)`, previously built
a projection whose two output expressions resolved to the same qualified
name and failed DataFusion's unique-field check.

Detect direct unaliased projection items that select both the merged key
and a qualified side key of the same USING / NATURAL join. Internally
qualify only the merged-key output with the reserved merged-key qualifier
so the projection schema remains distinct while preserving the user-visible
column name.

Teach the unparser to render that reserved merged-key qualifier back as the
original unqualified key, keeping plan -> SQL -> plan round trips valid.

Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
Resolve unqualified USING / NATURAL merged-key references to an
internally qualified merged-key expression. The expression uses the left
key for left-preserving joins, the right key for right-preserving joins,
and CASE / COALESCE semantics for FULL joins, while preserving the
user-visible key name.

This removes late merged-key rewrite handling from SELECT and sort
push-down paths because the merged key now carries a distinct internal
identity during normalization.

Refs apache#22881

Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
Store USING and NATURAL merged keys under one reserved internal
qualifier instead of embedding the column name in the qualifier. This
keeps merged keys distinct from side columns while simplifying unparser
recognition.

Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
@github-actions github-actions Bot added the optimizer Optimizer rules label Jun 29, 2026
Keep wildcard output schemas on their visible input columns while still
evaluating RIGHT and FULL USING keys with merged-key semantics. Also
unwrap the reserved merged-key alias for GROUP BY planning so invalid
ORDER BY keys report the aggregate validation error instead of a schema
lookup failure.

Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RIGHT/FULL/NATURAL JOIN ... USING(k) does not coalesce the join key (returns NULL for right-only rows)

3 participants