Add TableSink operator with Java/Spark implementations#665
Add TableSink operator with Java/Spark implementations#665harrygav wants to merge 8 commits intoapache:mainfrom
Conversation
…ations and simple tests
|
Thanks @harrygav, this is great! Could we make |
|
Thank you - just to make the tests running, how's about mocking the JDBC layer? Wrap |
I will take a look and update the PR to continue the discussion! |
|
Hey @harrygav, any news on this? Apparently a table sink is crucial for many things and we would like to start using it already. |
|
Also, another question: wouldn't it make sense to also have a sink for a database? Now you have implemented one execution operator for Java and one for Spark but why not for a database? |
|
Hi @zkaoudi, nice to hear that the PR will be useful, I will follow up on this by the end of the week! Thanks for your input, I think there are many things to be clarified for the sink operator, but I guess we will figure them out once we know more about the targeted use cases we want to cover. With the current implementation, you could do the ETL pipeline you mention through the Java or Spark platforms: e.g., Source(Java/Spark from DBMS1)->ETL(Java/Spark)->Sink(Java/Spark to DBMS2). Or where you thinking to write from DBMS1 into DBMS2 directly without any intermediate Java/Spark platform step? That would also be interesting for some use cases (improved perf) but becomes cumbersome to maintain in terms of interoperability. Let me know what you think! |
|
Yes I was thinking about directly writing from DBMS1 to DBMS2. For example, if you have two tables in two DBMSs and you want to join them and write the result into DBMS2 without doing the join in Spark or java. What do you mean with issues of interoperability? |
|
But on a second thought, what I described above as a scenario is more like a conversion operator in addition to a sink. You would ideally want to create a temp table to do the join and then persist the result. |
Introduce generic type support and dialect-aware SQL mapping using Calcite. Add extensive H2 integration tests for Java and Spark covering various edge cases.
|
Hi all, picking up this one again. I just pushed a commit with the update:
I think it would be wise to add a couple of DBMSes for the tests, which would also be useful for the source operators or supporting JDBC platforms themselves. This could be done either through their embedded versions or through maven testcontainers. Then, we could add support for sinks on other platforms, e.g., the JDBC platform, to also support DBMS->DBMS workloads. Let me know what you think about the PR, and if we want to do some of the next steps (e.g., testing) here or in another PR. |
.asf.yaml
Outdated
| description: Apache Wayang is the first cross-platform data processing system. | ||
| homepage: https://wayang.apache.org/ | ||
| description: Apache Wayang(incubating) is the first cross-platform data processing system. | ||
| homepage: https://wayang.incubator.apache.org/ |
There was a problem hiding this comment.
Do not modify this file, it seems it is an old version.
There was a problem hiding this comment.
Sorry, this change file mistakenly sneaked in after the rebase.
|
Thanks a lot for your contribution Harry! |
Summary
This PR introduces a new
TableSinkoperator for writingRecorddata into a database table via JDBC, with implementations for the Java and Spark platforms.Opening as Draft to start discussion on the operator design and expected behavior.
Changes
New operator:
TableSink(inwayang-basic)UnarySink<Record>that targets a table name and accepts JDBC connectionPropertiesmode(e.g. overwrite) and optional column namesJava platform:
JavaTableSink(inwayang-java)overwriteby dropping the target table firstSpark platform:
SparkTableSink(inwayang-spark)TableSinkoperatorNotes / open questions
VARCHARs)modebehavior (overwrite vs append, etc.) should be agreed on and formalized.How to use / test
To run end-to-end locally, you currently need an external PostgreSQL instance available and provide JDBC connection details (driver/url/user/password) in the test setup/environment.