Spark DataSource V2 read and write benchmarks #13955
Replies: 4 comments 1 reply
-
|
@leesf tagging you in case you have some old context to add/capture here. |
Beta Was this translation helpful? Give feedback.
-
|
From my point of view the main stages are the following:
|
Beta Was this translation helpful? Give feedback.
-
|
For queries, when we read and write in the same table: -- 1st example
INSERT INTO hudi_tbl
SELECT * FROM hudi_tbl WHERE ...
-- 2nd example
UPDATE hudi_tbl t
SET somecol = somecol + 100
WHERE EXISTS (
SELECT 1
FROM hudi_tbl s
WHERE s.id = t.id
AND s.anothercol > 100
);combining of V1 write and V2 read could be tricky. I suppose, we could change the focus on full support of DataSource V2 without performance drop (read and write) instead of trying to support V1 write and V2 read simultaneously. In this case, we also would have to resolve compatibility issues from the V1 >> V2 migration point of view, not some complex hybrid migration with a lot of edge cases. |
Beta Was this translation helpful? Give feedback.
-
|
For the start, I will use read from Kafka topic (8 partitions) and direct write to Hudi table (MOR, upsert, bucket index, 16 buckets) for benchmarking: This PySpark script will be run on local PC, which will be a driver, and will submit a job to the remote Spark cluster (Spark 3.5.7) with 8 executors (3 CPUs, 8 GB memory for each): The data in the Kafka topic is Write scenario (4 commits in total):
Hudi table, Spark event log directory and SQL warehouse directory are placed in a separate HDFS cluster to prevent any data transfer to the driver. For Hudi 1.1.0 (V1 is used) the total time is about 17 min. |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
Integration of Spark Datasource V2 was done in RFC-38. However, there were multiple issues with advertising a Hudi table as V2 without actually implementing certain APIs, and with using custom relation rule to fall back to V1 API. As a result, the current implementation of
HoodieCatalogandSpark3DefaultSourcereturns aV1Tableinstead ofHoodieInternalV2Table, in order to address performance regressions.Performance issues were not revealed in the initial PR due to the absence of proper benchmarking for such changes. Therefore, to restart this work, it is important first to decide how to benchmark the changes. Among other things, Datasource V1 allows custom logic, such as the use of Hudi indexes, which is not straightforward to implement in Datasource V2. So we need to consider cases like this in the benchmarking scenarios.
If anybody has already gone down this path, please share your insights. Any suggestions about scenarios that should be considered are also welcome.
Beta Was this translation helpful? Give feedback.
All reactions