feat: Add Spark-compatible encode function to datafusion-spark#21331
feat: Add Spark-compatible encode function to datafusion-spark#21331JeelRajodiya wants to merge 14 commits into
encode function to datafusion-spark#21331Conversation
|
run benchmarks |
|
Hi @Zeel-e6x, thanks for the request (#21331 (comment)). Only whitelisted users can trigger benchmarks. Allowed users: Dandandan, Fokko, Jefffrey, Omega359, adriangb, alamb, asubiotto, brunal, buraksenn, cetra3, codephage2020, comphead, erenavsarogullari, etseidl, friendlymatthew, gabotechs, geoffreyclaude, grtlr, haohuaijin, jonathanc-n, kevinjqliu, klion26, kosiew, kumarUjjawal, kunalsinghdadhwal, liamzwbao, mbutrovich, mzabaluev, neilconway, rluvaton, sdf-jkl, timsaucer, xudong963, zhuqi-lucas. File an issue against this benchmark runner |
xanderbailey
left a comment
There was a problem hiding this comment.
Looks good to me but you’ll need a committer to Approve also! Thanks for the PR!
|
Hey @xanderbailey, Do I need to mention the maintainers for review? if yes please suggest whom incase you know. |
|
They will normally pick it up within a week or so. If not we can ping them here. |
|
Thanks @xanderbailey and @JeelRajodiya -- the PR load is pretty intense! I started the CI for this PR |
22a2705 to
bf46433
Compare
|
I pushed the fixes for clippy errors. @alamb Can you rerun the checks please? |
|
Please rerun the checks |
| } | ||
| Ok(bytes) | ||
| } | ||
| _ => exec_err!( |
There was a problem hiding this comment.
Spark also supports UTF-32. It would be worth adding a comment here explaining why this isn't or can't be supported.
arguments = """
Arguments:
* str - a string expression
* charset - one of the charsets 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16', 'UTF-32' to encode `str` into a BINARY. It is case insensitive.
""",
There was a problem hiding this comment.
I missed adding support for UTF-32, I've added it now with respective tests.
55f4694 to
835ae8d
Compare
Implements `encode(string_or_binary, charset)` that converts a string or binary value into binary using the specified character encoding, matching Apache Spark's behavior.
In ANSI mode (default), encoding a character that cannot be represented in the target charset (e.g. non-ASCII char in US-ASCII) returns an error. In legacy mode, unmappable characters are silently replaced with '?'.
835ae8d to
6cb99a7
Compare
|
Thanks for iterating on this @JeelRajodiya. One issue I noticed: This PR returns Once challenge for this PR is that there is different behavior across Spark versions for the |
dd0ad0e to
151ac23
Compare
|
Hey @andygrove, I realized that I shouldn't be using Moreover we should target Spark 3.5 which is more permissive and doesn't return errors when null inputs are passed. it simply replaces it with Below are the references to the spark definitions protected override def nullSafeEval(input1: Any, input2: Any): Any = {
input1.asInstanceOf[UTF8String].toString.getBytes(toCharset)
}Just calls Java's Spark 4.1's case class Encode(str, charset, legacyCharsets: Boolean, legacyErrorAction: Boolean)
def this(value, charset) =
this(value, charset, SQLConf.get.legacyJavaCharsets, SQLConf.get.legacyCodingErrorAction)
These Let me know if we need to iterate on this further. P.S I've fixed the BOM issue |
Spark 3.5 and 4.1 both emit UTF-32 as UTF-32BE without a BOM. Our
previous implementation prepended a 0000FEFF BOM, which didn't match
any Spark version. Fix this so encode('A', 'UTF-32') produces
00000041 (4 bytes), matching Spark.
Also add a doc comment clarifying:
- Target Spark version (3.5 charset behavior, accepts aliases)
- UTF-32 semantics (alias for UTF-32BE)
- ANSI mode mapping to Spark 3.5 vs 4.0 unmappable-char behavior
151ac23 to
701850d
Compare
|
@andygrove Can you reivew the PR? |
…ry types and use the binary arm in encode_dispatch
Rationale
The
datafusion-sparkcrate is missing theencodefunction. Spark'sencode(expr, charset)converts a string or binary value into binary using a specified character encoding — commonly used in Spark SQL workloads and needed by engines built on DataFusion that target Spark compatibility.What changes are included in this PR?
Adds
SparkEncodetodatafusion-spark's string functions, emulating Spark 3.5 semantics. It supports US-ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE, including common aliases (UTF8,LATIN1, etc.) and case-insensitive matching. The charset can be a constant or a per-row column. Binary input is decoded as lossy UTF-8 (invalid bytes → U+FFFD) before re-encoding, and unmappable characters are silently replaced with?, matching Spark.Are these changes tested?
Yes. Coverage lives in
encode.slt(sqllogictest) and exercises all charsets and aliases, case-insensitive matching, null value/charset handling, per-row charsets, binary input (Binary/LargeBinary/BinaryView) with lossy UTF-8,Utf8Viewinput, and the unsupported-charset error. A Rust unit test covers return-field nullability.Are there any user-facing changes?
New
encodescalar function available when usingdatafusion-spark.