-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29375: FULL OUTER JOIN is failing with Unexpected hash table key type DATE #6239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Explanation: In
But the Line 821 in c237510
For
|
|
Will address sonar issues post review comments. I',m willing to move the if-else pattern + old switch style with jdk21 'switch expressions'. Reviewers can let me know. will file separate jira to migrate especially in vectorization. |
|
CC @zabetak , can you please help with the review? |
|
Hey @Aggarwal-Raghav, I am on holidays till Dec 29, with intermittent and not stable internet connection. Not sure if I will find time to check this before then. |
No worries. Enjoy !! 😅 |
|
Hi @Aggarwal-Raghav ,
Also I could see there are many UT test cases, may be you can add more test cases related to DATE HashTableKeyType. For example MapJoinTestConfig.java |
Thanks for the thorough review @mdayakar , i checked the above places you mentioned. For [2] and [3], it is already handled in this PR. For [1] and |
|
zabetak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM! I left some comments for minor improvements and potentially few unit tests more.
(The Sonar issues are not worth fixing; at least not now)
| -- Test timestamp column | ||
| create table tbl3 (id int, event_date timestamp); | ||
| create table tbl4 (id int, event_date timestamp); | ||
|
|
||
| insert into tbl3 values (1, '2025-12-17 10:20:30'), (2, '2025-12-17 11:20:30'); | ||
| insert into tbl4 values (2, '2025-12-17 11:20:30'), (3, '2025-12-17 09:20:30'); | ||
|
|
||
| select tbl3.id, tbl3.event_date from tbl3 full outer join tbl4 on tbl3.event_date = tbl4.event_date order by tbl3.id; | ||
|
|
||
| -- Test Double column | ||
| create table tbl5 (id int, val double); | ||
| create table tbl6 (id int, val double); | ||
|
|
||
| insert into tbl5 values (1, 5.6D), (2, 3.2D); | ||
| insert into tbl6 values (2, 3.2D), (3, 7.2D); | ||
|
|
||
| select tbl5.id, tbl5.val from tbl5 full outer join tbl6 on tbl5.val = tbl6.val order by tbl5.id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we adding tests for TIMESTAMP and DOUBLE types? They don't seem to be in the same code path with DATE. Are we fixing anything with respect to those data types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not fixing anything for DOUBLE and TIMESTAMP types, just added as they were no covered in vector_full_outer_join.q or vector_full_outer_join2.q. Will remove it.
| insert into tbl1 values (1, '2023-01-01'), (2, '2023-01-02'), (3, '2023-01-03'); | ||
| insert into tbl2 values (2, '2023-01-02'), (3, '2023-01-04'), (4, '2023-01-05'); | ||
|
|
||
| select tbl1.id, tbl1.event_date from tbl1 full outer join tbl2 on tbl1.event_date = tbl2.event_date order by tbl1.id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are performing a join it would be nice to SELECT also columns from tbl2 otherwise we can't tell if the result is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will use select *
| if (!keyBinarySortableDeserializeRead.readNextField()) { | ||
| return false; | ||
| } | ||
| switch (hashMap.hashTableKeyType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here seems very similar to org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashUtil#deserializeLongKey. Should we use this method instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, it's doable as well. IMO it's better to move class from
package org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast => package org.apache.hadoop.hive.ql.exec.vector.mapjoin
and rename it from VectorMapJoinFastLongHashUtil => VectorMapJoinLongHashUtil
As there is clear segregation between fast and optimized flow , importing org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashUtil#deserializeLongKey in org.apache.hadoop.hive.ql.exec.vector.mapjoin.optimized.VectorMapJoinOptimizedLongHashMap can make it confusing.
Let me know on this.
| insert into tbl1 values (1, '2023-01-01'), (2, '2023-01-02'), (3, '2023-01-03'); | ||
| insert into tbl2 values (2, '2023-01-02'), (3, '2023-01-04'), (4, '2023-01-05'); | ||
|
|
||
| select tbl1.id, tbl1.event_date from tbl1 full outer join tbl2 on tbl1.event_date = tbl2.event_date order by tbl1.id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also print the plan using explain vectorization detail in order to ensure that we are indeed using the expected vectorized operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
| random, | ||
| VectorRandomRowSource.SupportedTypes.ALL, | ||
| 4, | ||
| /* allowNulls */ false, /* isUnicodeOk */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Drop redundant comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
| HashTableKeyType.DATE, | ||
| verifyTable, | ||
| new String[] {"date"}, | ||
| /* doClipping */ false, /* useExactBytes */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Drop redundant comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
| case DATE: | ||
| hashTableKeyType = HashTableKeyType.DATE; | ||
| break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have unit tests exploiting this config? Do we need to add something in TestMapJoinOperator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add a testcase testDate0 in TestMapJoinOperator. As testString0 makes use of the DATE type, but it does so as a Value column, not as a Join Key.
|
Thanks for the review @zabetak , I'll accomodate the suggestions. |
|
Will push changes based on your input on #6239 (comment) |



What changes were proposed in this pull request?
Check HIVE-29375 for repro and stacktrace
Why are the changes needed?
To make
DATEtype support for full outer join, which is, getting converted to map join because ofhive.optimize.dynamic.partition.hashjoin=true;Does this PR introduce any user-facing change?
No
How was this patch tested?
Wrote a new q file and will see CI output