feat: Add Spark-compatible `encode` function to datafusion-spark by JeelRajodiya · Pull Request #21331 · apache/datafusion

JeelRajodiya · 2026-04-03T06:46:27Z

Rationale

The datafusion-spark crate is missing the encode function. Spark's encode(expr, charset) converts a string or binary value into binary using a specified character encoding — commonly used in Spark SQL workloads and needed by engines built on DataFusion that target Spark compatibility.

What changes are included in this PR?

Adds SparkEncode to datafusion-spark's string functions, emulating Spark 3.5 semantics. It supports US-ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE, including common aliases (UTF8, LATIN1, etc.) and case-insensitive matching. The charset can be a constant or a per-row column. Binary input is decoded as lossy UTF-8 (invalid bytes → U+FFFD) before re-encoding, and unmappable characters are silently replaced with ?, matching Spark.

Are these changes tested?

Yes. Coverage lives in encode.slt (sqllogictest) and exercises all charsets and aliases, case-insensitive matching, null value/charset handling, per-row charsets, binary input (Binary/LargeBinary/BinaryView) with lossy UTF-8, Utf8View input, and the unsupported-charset error. A Rust unit test covers return-field nullability.

Are there any user-facing changes?

New encode scalar function available when using datafusion-spark.

Zeel-e6x · 2026-04-03T10:49:13Z

run benchmarks

adriangbot · 2026-04-03T10:49:15Z

Hi @Zeel-e6x, thanks for the request (#21331 (comment)). Only whitelisted users can trigger benchmarks. Allowed users: Dandandan, Fokko, Jefffrey, Omega359, adriangb, alamb, asubiotto, brunal, buraksenn, cetra3, codephage2020, comphead, erenavsarogullari, etseidl, friendlymatthew, gabotechs, geoffreyclaude, grtlr, haohuaijin, jonathanc-n, kevinjqliu, klion26, kosiew, kumarUjjawal, kunalsinghdadhwal, liamzwbao, mbutrovich, mzabaluev, neilconway, rluvaton, sdf-jkl, timsaucer, xudong963, zhuqi-lucas.

File an issue against this benchmark runner

xanderbailey

Looks good to me but you’ll need a committer to Approve also! Thanks for the PR!

JeelRajodiya · 2026-04-07T15:48:14Z

Hey @xanderbailey, Do I need to mention the maintainers for review? if yes please suggest whom incase you know.
I'm planning to open more PRs for implementing other functions but I'm waiting for this PR to get merged.

xanderbailey · 2026-04-07T16:19:42Z

They will normally pick it up within a week or so. If not we can ping them here.

alamb · 2026-04-07T18:13:13Z

Thanks @xanderbailey and @JeelRajodiya -- the PR load is pretty intense! I started the CI for this PR

JeelRajodiya · 2026-04-07T18:55:56Z

I pushed the fixes for clippy errors. @alamb Can you rerun the checks please?

JeelRajodiya · 2026-04-08T18:20:25Z

Please rerun the checks

andygrove · 2026-04-12T15:10:15Z

+            }
+            Ok(bytes)
+        }
+        _ => exec_err!(


Spark also supports UTF-32. It would be worth adding a comment here explaining why this isn't or can't be supported.

arguments = """ Arguments: * str - a string expression * charset - one of the charsets 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16', 'UTF-32' to encode `str` into a BINARY. It is case insensitive. """,

I missed adding support for UTF-32, I've added it now with respective tests.

Implements `encode(string_or_binary, charset)` that converts a string or binary value into binary using the specified character encoding, matching Apache Spark's behavior.

In ANSI mode (default), encoding a character that cannot be represented in the target charset (e.g. non-ASCII char in US-ASCII) returns an error. In legacy mode, unmappable characters are silently replaced with '?'.

andygrove · 2026-04-17T14:33:55Z

Thanks for iterating on this @JeelRajodiya. One issue I noticed:

spark-sql> SELECT hex(encode('A', 'UTF-32'));                                                                                                                                                                        
00000041

This PR returns 0000FEFF00000041, with a BOM. Both Spark 3.5 and Spark 4.1 return the four-byte form.

Once challenge for this PR is that there is different behavior across Spark versions for the encode expression. Which Spark version is this PR targeting? It would be good to document that.

JeelRajodiya · 2026-04-19T12:27:55Z

Hey @andygrove, I realized that I shouldn't be using enable_ansi_mode flag inside encode function. In the spark definition they are not binding the ansi mode to encode function.

Moreover we should target Spark 3.5 which is more permissive and doesn't return errors when null inputs are passed. it simply replaces it with ?. But I've added a TODO in the doc comment pointing at the two real Spark 4.1 configs so a follow-up PR can wire them properly.

Below are the references to the spark definitions
Spark 3.5's Encode.scala:

protected override def nullSafeEval(input1: Any, input2: Any): Any = {
  input1.asInstanceOf[UTF8String].toString.getBytes(toCharset)
}

Just calls Java's String.getBytes, which replaces unmappable chars with the charset's default byte (?). No legacyErrorAction, no config, no exception.

Spark 4.1's Encode.scala added two new configs for the strict behavior:

case class Encode(str, charset, legacyCharsets: Boolean, legacyErrorAction: Boolean)
  def this(value, charset) =
    this(value, charset, SQLConf.get.legacyJavaCharsets, SQLConf.get.legacyCodingErrorAction)

Setting legacyErrorAction=true restores the Spark 3.5 ? behavior.

These spark.sql.legacy.javaCharsets and spark.sql.legacy.codingErrorAction are supported in 4.1 version. which can be left for future PR. Currently the PR targets Spark 3.5. I've added mentioned in the doc comment as well.

Let me know if we need to iterate on this further.

P.S I've fixed the BOM issue

Spark 3.5 and 4.1 both emit UTF-32 as UTF-32BE without a BOM. Our previous implementation prepended a 0000FEFF BOM, which didn't match any Spark version. Fix this so encode('A', 'UTF-32') produces 00000041 (4 bytes), matching Spark. Also add a doc comment clarifying: - Target Spark version (3.5 charset behavior, accepts aliases) - UTF-32 semantics (alias for UTF-32BE) - ANSI mode mapping to Spark 3.5 vs 4.0 unmappable-char behavior

JeelRajodiya · 2026-05-10T19:16:37Z

@andygrove Can you reivew the PR?

…ry types and use the binary arm in encode_dispatch

github-actions Bot added the spark label Apr 3, 2026

xanderbailey reviewed Apr 3, 2026

View reviewed changes

Comment thread datafusion/spark/src/function/string/encode.rs Outdated

Comment thread datafusion/spark/src/function/string/encode.rs

Comment thread datafusion/spark/src/function/string/encode.rs

xanderbailey approved these changes Apr 4, 2026

View reviewed changes

JeelRajodiya force-pushed the feat/spark-encode-function branch from 22a2705 to bf46433 Compare April 7, 2026 18:54

andygrove reviewed Apr 12, 2026

View reviewed changes

Comment thread datafusion/spark/src/function/string/encode.rs

github-actions Bot added the core Core DataFusion crate label Apr 15, 2026

JeelRajodiya force-pushed the feat/spark-encode-function branch from 55f4694 to 835ae8d Compare April 15, 2026 05:32

github-actions Bot removed the core Core DataFusion crate label Apr 15, 2026

JeelRajodiya added 8 commits April 15, 2026 11:10

feat: Add Spark-compatible encode function to datafusion-spark

2709f2b

Implements `encode(string_or_binary, charset)` that converts a string or binary value into binary using the specified character encoding, matching Apache Spark's behavior.

test: Add LargeBinary, BinaryView, and emoji surrogate pair tests

fbf8a07

refactor: Make extract_charset private

91222a3

fix: Inline format args to satisfy clippy::uninlined_format_args

0a434fe

fix remaining clippy error

295416f

fix the fmt error

db519be

feat: Add UTF-32 charset support

93f1ec9

feat: Add ANSI mode support for unmappable characters

6cb99a7

In ANSI mode (default), encoding a character that cannot be represented in the target charset (e.g. non-ASCII char in US-ASCII) returns an error. In legacy mode, unmappable characters are silently replaced with '?'.

JeelRajodiya force-pushed the feat/spark-encode-function branch from 835ae8d to 6cb99a7 Compare April 15, 2026 05:40

JeelRajodiya requested a review from andygrove April 15, 2026 05:43

move tests to slt file

b6dcd45

github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label Apr 16, 2026

andygrove mentioned this pull request Apr 17, 2026

test: add SQL tests documenting Spark encode behavior apache/datafusion-comet#3975

Closed

JeelRajodiya force-pushed the feat/spark-encode-function branch 3 times, most recently from dd0ad0e to 151ac23 Compare April 19, 2026 12:23

JeelRajodiya force-pushed the feat/spark-encode-function branch from 151ac23 to 701850d Compare April 19, 2026 12:29

Merge branch 'main' into feat/spark-encode-function

9900e96

Jefffrey reviewed Jun 18, 2026

View reviewed changes

JeelRajodiya added 3 commits June 29, 2026 11:21

Merge branch 'main' into feat/spark-encode-function

ee22456

move the tests to slt file, added coerce args so we preserve the bina…

ac43977

…ry types and use the binary arm in encode_dispatch

add support for charset per rows

7cb0731

JeelRajodiya requested a review from Jefffrey June 29, 2026 11:17

Uh oh!

Conversation

JeelRajodiya commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zeel-e6x commented Apr 3, 2026

Uh oh!

adriangbot commented Apr 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xanderbailey left a comment

Choose a reason for hiding this comment

Uh oh!

JeelRajodiya commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xanderbailey commented Apr 7, 2026

Uh oh!

alamb commented Apr 7, 2026

Uh oh!

JeelRajodiya commented Apr 7, 2026

Uh oh!

JeelRajodiya commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

JeelRajodiya Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andygrove commented Apr 17, 2026

Uh oh!

JeelRajodiya commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JeelRajodiya commented May 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

JeelRajodiya commented Apr 3, 2026 •

edited

Loading

JeelRajodiya commented Apr 7, 2026 •

edited

Loading

JeelRajodiya commented Apr 8, 2026 •

edited

Loading

JeelRajodiya Apr 15, 2026 •

edited

Loading

JeelRajodiya commented Apr 19, 2026 •

edited

Loading