Skip to content

Conversation

@Aggarwal-Raghav
Copy link
Contributor

What changes were proposed in this pull request?

Check HIVE-29375 for repro and stacktrace

Why are the changes needed?

To make DATE type support for full outer join, which is, getting converted to map join because of hive.optimize.dynamic.partition.hashjoin=true;

Does this PR introduce any user-facing change?

No

How was this patch tested?

Wrote a new q file and will see CI output

mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile=vector_full_outer_join_date.q -Drat.skip -Dtest.output.overwrite -Pitests -pl itests/qtest

@Aggarwal-Raghav
Copy link
Contributor Author

Explanation:

In Vectorizer.java the hashTableKeyTypeis getting set as DATE for DATE column type

hashTableKeyType = HashTableKeyType.DATE;

But the DATE type is not present VectorMapJoinOuterGenerateResultOperator in :

For Double and Timestamp Columns, they are working without the patch as well because the default hashTableKeyType is MULTI_KEY

hashTableKeyType = HashTableKeyType.MULTI_KEY;

@Aggarwal-Raghav
Copy link
Contributor Author

Will address sonar issues post review comments. I',m willing to move the if-else pattern + old switch style with jdk21 'switch expressions'. Reviewers can let me know. will file separate jira to migrate especially in vectorization.

@Aggarwal-Raghav
Copy link
Contributor Author

CC @zabetak , can you please help with the review?

@zabetak
Copy link
Member

zabetak commented Dec 19, 2025

Hey @Aggarwal-Raghav, I am on holidays till Dec 29, with intermittent and not stable internet connection. Not sure if I will find time to check this before then.

@Aggarwal-Raghav
Copy link
Contributor Author

Hey @Aggarwal-Raghav, I am on holidays till Dec 29, with intermittent and not stable internet connection. Not sure if I will find time to check this before then.

No worries. Enjoy !! 😅

@mdayakar
Copy link
Contributor

mdayakar commented Dec 19, 2025

Hi @Aggarwal-Raghav ,
I am not a vectorization feature expert but as per code changes I feel you missed below places adding DATE HashTableKeyType, please check.

  1. CheckFastRowHashMap.java
  2. VectorMapJoinOptimizedLongHashMap.java
  3. VectorMapJoinOptimizedLongHashMap.java

Also I could see there are many UT test cases, may be you can add more test cases related to DATE HashTableKeyType. For example MapJoinTestConfig.java

@Aggarwal-Raghav
Copy link
Contributor Author

Hi @Aggarwal-Raghav , I am not a vectorization feature expert but as per code changes I feel you missed below places adding DATE HashTableKeyType, please check.

  1. CheckFastRowHashMap.java
  2. VectorMapJoinOptimizedLongHashMap.java
  3. VectorMapJoinOptimizedLongHashMap.java

Also I could see there are many UT test cases, may be you can add more test cases related to DATE HashTableKeyType. For example MapJoinTestConfig.java

Thanks for the thorough review @mdayakar , i checked the above places you mentioned. For [2] and [3], it is already handled in this PR. For [1] and MapJoinTestConfig.java which are test classes, I'll evaluate on adding more test cases for DATE type and update in sometime.

@sonarqubecloud
Copy link

Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM! I left some comments for minor improvements and potentially few unit tests more.

(The Sonar issues are not worth fixing; at least not now)

Comment on lines +13 to +29
-- Test timestamp column
create table tbl3 (id int, event_date timestamp);
create table tbl4 (id int, event_date timestamp);

insert into tbl3 values (1, '2025-12-17 10:20:30'), (2, '2025-12-17 11:20:30');
insert into tbl4 values (2, '2025-12-17 11:20:30'), (3, '2025-12-17 09:20:30');

select tbl3.id, tbl3.event_date from tbl3 full outer join tbl4 on tbl3.event_date = tbl4.event_date order by tbl3.id;

-- Test Double column
create table tbl5 (id int, val double);
create table tbl6 (id int, val double);

insert into tbl5 values (1, 5.6D), (2, 3.2D);
insert into tbl6 values (2, 3.2D), (3, 7.2D);

select tbl5.id, tbl5.val from tbl5 full outer join tbl6 on tbl5.val = tbl6.val order by tbl5.id;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding tests for TIMESTAMP and DOUBLE types? They don't seem to be in the same code path with DATE. Are we fixing anything with respect to those data types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fixing anything for DOUBLE and TIMESTAMP types, just added as they were no covered in vector_full_outer_join.q or vector_full_outer_join2.q. Will remove it.

insert into tbl1 values (1, '2023-01-01'), (2, '2023-01-02'), (3, '2023-01-03');
insert into tbl2 values (2, '2023-01-02'), (3, '2023-01-04'), (4, '2023-01-05');

select tbl1.id, tbl1.event_date from tbl1 full outer join tbl2 on tbl1.event_date = tbl2.event_date order by tbl1.id;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are performing a join it would be nice to SELECT also columns from tbl2 otherwise we can't tell if the result is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will use select *

if (!keyBinarySortableDeserializeRead.readNextField()) {
return false;
}
switch (hashMap.hashTableKeyType) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here seems very similar to org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashUtil#deserializeLongKey. Should we use this method instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, it's doable as well. IMO it's better to move class from

package org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast => package org.apache.hadoop.hive.ql.exec.vector.mapjoin

and rename it from VectorMapJoinFastLongHashUtil => VectorMapJoinLongHashUtil

As there is clear segregation between fast and optimized flow , importing org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashUtil#deserializeLongKey in org.apache.hadoop.hive.ql.exec.vector.mapjoin.optimized.VectorMapJoinOptimizedLongHashMap can make it confusing.

Let me know on this.

insert into tbl1 values (1, '2023-01-01'), (2, '2023-01-02'), (3, '2023-01-03');
insert into tbl2 values (2, '2023-01-02'), (3, '2023-01-04'), (4, '2023-01-05');

select tbl1.id, tbl1.event_date from tbl1 full outer join tbl2 on tbl1.event_date = tbl2.event_date order by tbl1.id;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also print the plan using explain vectorization detail in order to ensure that we are indeed using the expected vectorized operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

random,
VectorRandomRowSource.SupportedTypes.ALL,
4,
/* allowNulls */ false, /* isUnicodeOk */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Drop redundant comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

HashTableKeyType.DATE,
verifyTable,
new String[] {"date"},
/* doClipping */ false, /* useExactBytes */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Drop redundant comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Comment on lines +397 to +399
case DATE:
hashTableKeyType = HashTableKeyType.DATE;
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have unit tests exploiting this config? Do we need to add something in TestMapJoinOperator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a testcase testDate0 in TestMapJoinOperator. As testString0 makes use of the DATE type, but it does so as a Value column, not as a Join Key.

@Aggarwal-Raghav
Copy link
Contributor Author

Thanks for the review @zabetak , I'll accomodate the suggestions.

@Aggarwal-Raghav
Copy link
Contributor Author

Aggarwal-Raghav commented Jan 4, 2026

Will push changes based on your input on #6239 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants