Skip to content

Conversation

@sarutak
Copy link
Member

@sarutak sarutak commented Dec 6, 2025

What changes were proposed in this pull request?

This PR aims to check offset and length on copying in UTF8String#reverse.
For details, see https://lists.apache.org/thread/d9pvkh3jbsq8lc33v75kmwq5wg57422h (Only PMC members can read with login).
To avoid performance regression, this PR choose to check offset and length rather than validate the input UTF-8 string.

Why are the changes needed?

For safety.

Does this PR introduce any user-facing change?

Yes, but doesn't break compatibility.

How was this patch tested?

Example queries mentioned in this thread works even though the results are broken.
All the operation defined in UTF8String are expected to work correctly with valid UTF-8 strings so the broken results with invalid UTF-8 strings should be reasonable.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Dec 6, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @sarutak .

Merged to master/4.1 for Apache Spark 4.1.0.

dongjoon-hyun pushed a commit that referenced this pull request Dec 6, 2025
…on copying

### What changes were proposed in this pull request?
This PR aims to check offset and length on copying in `UTF8String#reverse`.
For details, see https://lists.apache.org/thread/d9pvkh3jbsq8lc33v75kmwq5wg57422h (Only PMC members can read with login).
To avoid performance regression, this PR choose to check offset and length rather than validate the input UTF-8 string.

### Why are the changes needed?
For safety.

### Does this PR introduce _any_ user-facing change?
Yes, but doesn't break compatibility.

### How was this patch tested?
Example queries mentioned in [this thread](https://lists.apache.org/thread/d9pvkh3jbsq8lc33v75kmwq5wg57422h) works even though the results are broken.
All the operation defined in `UTF8String` are expected to work correctly with valid UTF-8 strings so the broken results with invalid UTF-8 strings should be reasonable.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #53366 from sarutak/fix-utf8-reverse.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit e2722b8)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants