Spark DataSource V2 read and write benchmarks #13955

geserdugarov · 2025-09-22T03:31:42Z

geserdugarov
Sep 22, 2025

Integration of Spark Datasource V2 was done in RFC-38. However, there were multiple issues with advertising a Hudi table as V2 without actually implementing certain APIs, and with using custom relation rule to fall back to V1 API. As a result, the current implementation of HoodieCatalog and Spark3DefaultSource returns a V1Table instead of HoodieInternalV2Table, in order to address performance regressions.

Performance issues were not revealed in the initial PR due to the absence of proper benchmarking for such changes. Therefore, to restart this work, it is important first to decide how to benchmark the changes. Among other things, Datasource V1 allows custom logic, such as the use of Hudi indexes, which is not straightforward to implement in Datasource V2. So we need to consider cases like this in the benchmarking scenarios.

If anybody has already gone down this path, please share your insights. Any suggestions about scenarios that should be considered are also welcome.

vinothchandar · 2025-09-24T00:24:42Z

vinothchandar
Sep 24, 2025
Collaborator

@leesf tagging you in case you have some old context to add/capture here.

0 replies

geserdugarov · 2025-11-13T04:02:00Z

geserdugarov
Nov 13, 2025
Author

From my point of view the main stages are the following:

prepare benchmarks for write scenarios to check is it true that V2 wouldn't provide enough flexibility, and V1 is more performant for integration with Hudi:
- if write path will use V1: prepare list of all APIs that could call either V1 or V2,
- prepare design of hybrid call V1 for write and V2 to read,
- if write path will use V2: support all missed APIs for performant V2 write path,
prepare benchmarks for read scenarios to check that switching to V2 read path doesn't decrease performance,
implement V2 read path.

0 replies

geserdugarov · 2025-11-24T08:36:08Z

geserdugarov
Nov 24, 2025
Author

For queries, when we read and write in the same table:

-- 1st example
INSERT INTO hudi_tbl
SELECT * FROM hudi_tbl WHERE ...

-- 2nd example
UPDATE hudi_tbl t
SET somecol = somecol + 100
WHERE EXISTS (
  SELECT 1
  FROM hudi_tbl s
  WHERE s.id = t.id
    AND s.anothercol > 100
);

combining of V1 write and V2 read could be tricky.

I suppose, we could change the focus on full support of DataSource V2 without performance drop (read and write) instead of trying to support V1 write and V2 read simultaneously. In this case, we also would have to resolve compatibility issues from the V1 >> V2 migration point of view, not some complex hybrid migration with a lot of edge cases.

0 replies

geserdugarov · 2025-11-24T09:01:21Z

geserdugarov
Nov 24, 2025
Author

For the start, I will use read from Kafka topic (8 partitions) and direct write to Hudi table (MOR, upsert, bucket index, 16 buckets) for benchmarking:
https://github.com/geserdugarov/test-hudi-issues/blob/main/common/read-from-kafka-write-to-hudi.py

This PySpark script will be run on local PC, which will be a driver, and will submit a job to the remote Spark cluster (Spark 3.5.7) with 8 executors (3 CPUs, 8 GB memory for each):
https://github.com/geserdugarov/test-hudi-issues/blob/main/utils/spark_configuration.py

The data in the Kafka topic is lineitem table from TPC-H benchmark (scale factor = 10, 60 mln records). All records are unique for now.

Write scenario (4 commits in total):

(4.5 mln * 8) = 36 mln records in the 1st commit,
(1 mln * 8) = 8 mln records per commit in the following 3 commits.
All table services (compaction, cleaning, compaction scheduling) are disabled.

Hudi table, Spark event log directory and SQL warehouse directory are placed in a separate HDFS cluster to prevent any data transfer to the driver.

For Hudi 1.1.0 (V1 is used) the total time is about 17 min.

1 reply

geserdugarov Nov 24, 2025
Author

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark DataSource V2 read and write benchmarks #13955

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Spark DataSource V2 read and write benchmarks #13955

Uh oh!

geserdugarov Sep 22, 2025

Replies: 4 comments · 1 reply

Uh oh!

vinothchandar Sep 24, 2025 Collaborator

Uh oh!

geserdugarov Nov 13, 2025 Author

Uh oh!

Uh oh!

geserdugarov Nov 24, 2025 Author

Uh oh!

Uh oh!

geserdugarov Nov 24, 2025 Author

Uh oh!

geserdugarov Nov 24, 2025 Author

geserdugarov
Sep 22, 2025

Replies: 4 comments 1 reply

vinothchandar
Sep 24, 2025
Collaborator

geserdugarov
Nov 13, 2025
Author

geserdugarov
Nov 24, 2025
Author

geserdugarov
Nov 24, 2025
Author

geserdugarov Nov 24, 2025
Author