fix: outer reference confusing calcite optimizations#964
Conversation
|
this seems to require a |
nielspardon
left a comment
There was a problem hiding this comment.
Nice catch on the crash this fixes. When a Calcite optimizer only partially decorrelates a plan (the owning Correlate is rewritten into a join, but $corN.col is left in a Filter condition whose getVariablesSet() is now empty), the CorrelationId was never registered, the access was dropped from fieldAccessDepthMap, and RexExpressionConverter.visitFieldAccess then unboxed a null depth into an int → NPE. Registering the id in visit(Filter) removes that NPE, and for genuinely-nested references (depth ≥ 1) it produces correct outer references.
One concern with the result: for the common single-reference case the pre-scan registers the correlation at depth 0, so the converter emits newRootStructOuterReference(..., stepsOut=0). That is an invalid Substrait plan, not just an odd one — the proto is explicit that steps_out must be >= 1:
OuterReference.steps_out: "Number of subquery boundaries to traverse up for this field's reference. Must be >= 1."
— https://github.com/substrait-io/substrait/blob/8675ba395ce966e593ca66575930f59dc359b419/proto/substrait/algebra.proto#L1649-L1653
Consistently, FieldReference.isOuterReference() is outerReferenceStepsOut().orElse(0) > 0, and on the reverse path a 0 reference satisfies neither isSimpleRootReference() nor isOuterReference() and falls through both branches of ExpressionRexConverter.visit(FieldReference). (Emitting a plain non-outer reference instead is also wrong: the field index is in the correlated relation's coordinate system, not the Filter's input — e.g. $cor0.SS_ITEM_SK is index 2 in STORE_SALES but index 2 in ITEM is I_REC_START_DATE.)
So I'd suggest pairing the resolver fix with a guard so the unrepresentable depth-0 case fails loudly instead of producing an invalid plan.
Proposal 1 — guard the converter (isthmus/.../expression/RexExpressionConverter.java, visitFieldAccess, CORREL_VARIABLE case; that file isn't in this diff, hence here rather than inline):
case CORREL_VARIABLE:
{
Integer stepsOut = relVisitor.getFieldAccessDepth(fieldAccess);
// Substrait requires steps_out >= 1 for an outer reference
// (proto/substrait/algebra.proto, OuterReference.steps_out: "Must be >= 1.").
// A null/0 depth means the correlation does not step out of any subquery,
// typically a plan that was only partially decorrelated (the owning Correlate
// was rewritten into a join but the $cor reference remains in this Filter).
// We can't represent that with steps_out, so fail clearly rather than emit an
// invalid steps_out=0 reference. Id-based resolution (substrait#1031) would
// handle this; tracked in #869.
if (stepsOut == null || stepsOut == 0) {
throw new UnsupportedOperationException(
String.format(
"Cannot convert outer reference %s: it does not step out of any "
+ "subquery (steps_out=%s). This usually indicates a plan that was "
+ "only partially decorrelated; Substrait requires steps_out >= 1.",
fieldAccess, stepsOut));
}
return FieldReference.newRootStructOuterReference(
fieldAccess.getField().getIndex(),
typeConverter.toSubstrait(fieldAccess.getType()),
stepsOut);
}This also subsumes the original NPE (the null case) under one clear message.
Follow-up: the proper long-term fix — id-based outer-reference resolution from substrait-io/substrait#1031, which would let these partially-decorrelated / DAG plans actually convert — is already tracked under #869 (Update to Substrait v0.89.0). No new issue needed; just point the guard comment at #869.
Signed-off-by: matthew brian white <whitemat@uk.ibm.com>
Signed-off-by: matthew brian white <whitemat@uk.ibm.com>
Signed-off-by: matthew brian white <whitemat@uk.ibm.com>
fix: outer reference confusing post calcite optimizations