Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #62 +/- ##
===========================================
+ Coverage 52.41% 91.91% +39.49%
===========================================
Files 58 54 -4
Lines 12936 4837 -8099
===========================================
- Hits 6781 4446 -2335
+ Misses 6155 391 -5764 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
For rollback behavior, can't we just make a copy of the original dataframe to do the next operation so we have something to fall back on? |
lixiliu
left a comment
There was a problem hiding this comment.
I am fine with switching to Ibis for its ability to avoid the ways we have to handle different backend now. There's less changes to the code structure that I expected.
I think Claude is overcomplicating the mapping logic a bit and could use some more iteration there.
| """Convert time columns with from_schema to to_schema configuration.""" | ||
|
|
||
|
|
||
| def _ensure_mapping_types_match_source( |
There was a problem hiding this comment.
good call, but it is redundant with the data type handling in the mapping process later.
| df_mapping = _ensure_mapping_types_match_source(df_mapping, from_schema, backend) | ||
|
|
||
| # Debug: Print mapping DF around target time | ||
| try: |
| if left_type is None or right_type is None: | ||
| return left_col == right_col | ||
|
|
||
| left_is_unknown = not hasattr(left_type, "is_timestamp") or str(left_type).startswith( |
There was a problem hiding this comment.
Feels incomplete and inconsistent.
Generally we only want to change the right_table key because left_table is the input data, right_table is the mapping table. IMO it's safer to have the right key conform to the left key only.
| if resampling_operation: | ||
| query = query.group_by(*groupby_stmt) | ||
| predicates = _build_join_predicates(left_table, right_table, keys) | ||
| joined = left_table.join(right_table, predicates) |
There was a problem hiding this comment.
These join operations are clean, but the rest seems unnecessarily complicated.
I think Claude tried to preserve the existing way of handling potential column conflicts, which is to split the final columns into left_columns, right_columns, and joined_columns, but it also wrote additional code to handle potential conflicts all over again. That's why _build_select_columns and _build_query take in so many input variables and so complicated.
This is a prototype, mostly generated by Claude. The goal is to see if we can simplify interaction with dsgrid when running Spark jobs. The current code based on SQLAlchemy requires a separate Spark session: dsgrid creates a session with pyspark and chronify relies on an Apache Thrift Server (Hive).
We would get the following benefits by migrating:
We would lose this functionality in SQLAlchemy:
Outstanding work: