fix(rust/sedona-query-planner): Ensure user provided RS_EnsureLoaded call preserves metadata#969
Conversation
| /// Logical optimizer rule that wraps async scalar UDF calls with | ||
| /// `sd_restore_metadata` to preserve field metadata stripped at the | ||
| /// physical layer (DF-22662). | ||
| #[derive(Default, Debug)] | ||
| pub struct WrapAsyncUdfRule; |
There was a problem hiding this comment.
This is the real addition...we wrap AsyncUdfs with a UDF that applies extra metadata to the output (based on expr.to_field(), which is correct for an async udf at the logical level.
| /// Registry/UDF name of the metadata restoration helper. | ||
| pub const RESTORE_METADATA_NAME: &str = "sd_restore_metadata"; | ||
|
|
||
| /// Build the sync metadata-restore UDF with the given metadata map. The rule | ||
| /// constructs one per rewrite, embedding the metadata that should be stamped | ||
| /// onto the output field. `Stable` volatility so CSE can deduplicate identical | ||
| /// `sd_restore_metadata(...)` subtrees. | ||
| pub fn restore_metadata_udf(metadata: HashMap<String, String>) -> Arc<ScalarUDF> { | ||
| Arc::new(ScalarUDF::new_from_impl(RestoreMetadata::new(metadata))) | ||
| } |
There was a problem hiding this comment.
This was extracted out of the previous "reraster" (since it could in theory do things that aren't raster)
| def test_rs_ensureloaded(con, sedona_testing): | ||
| path = sedona_testing / "data/raster/sentinel2.tif" | ||
| t = con.sql("SELECT RS_FromPath($1) AS raster", params=(str(path),)) | ||
| tab = t.select(raster=t.raster.funcs.rs_ensureloaded()).to_arrow_table() | ||
| r = tab["raster"][0].as_py() | ||
| assert r.height == 512 | ||
| assert r.width == 512 | ||
|
|
||
| assert len(r.bands) == 1 | ||
| b = r.bands[0] | ||
| assert b.shape == (512, 512) | ||
| arr = b.to_numpy() | ||
| assert arr.shape == (512, 512) | ||
| assert arr.dtype == "uint16" | ||
| assert arr[0, 0] == 2324 |
There was a problem hiding this comment.
Pull request overview
This PR fixes raster metadata loss when users explicitly call RS_EnsureLoaded(...) by moving the DataFusion async-UDF metadata workaround into its own logical optimizer rule (WrapAsyncUdfRule) and adding targeted tests for both the rule and the Python-facing behavior.
Changes:
- Simplifies
RS_EnsureLoadedasync invocation argument handling to preserve metadata for user-provided calls. - Introduces
sd_restore_metadata+WrapAsyncUdfRuleto wrap async scalar UDFs and re-stampreturn_fieldmetadata (DF-22662 workaround), and wires it into the optimizer pipeline. - Updates Python raster function tests and raster band buffer access to reflect/validate the restored metadata behavior.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| rust/sedona-raster-functions/src/rs_ensure_loaded.rs | Adjusts async invocation to use to_array(...) to better preserve metadata. |
| rust/sedona-query-planner/src/wrap_async_udf.rs | New optimizer rule to wrap async scalar UDFs with sd_restore_metadata (plus unit tests). |
| rust/sedona-query-planner/src/restore_metadata.rs | New sync identity UDF that stamps stored metadata onto its output field (plus unit tests). |
| rust/sedona-query-planner/src/optimizer.rs | Registers WrapAsyncUdfRule alongside ensure-loaded rule ahead of CSE. |
| rust/sedona-query-planner/src/lib.rs | Exposes new modules; removes old reraster workaround module. |
| rust/sedona-query-planner/src/ensure_loaded.rs | Removes the old reraster wrapper; now emits rs_ensureloaded(arg) directly; updates tests and rule ordering assertions. |
| rust/sedona-query-planner/src/ensure_loaded_reraster.rs | Deletes the previous DF-22662 workaround implementation. |
| python/sedonadb/tests/functions/test_raster_functions.py | Refactors raster function tests and adds a regression test for rs_ensureloaded() behavior. |
| python/sedonadb/tests/functions/conftest.py | Removes the now-unneeded raster_con fixture. |
| python/sedonadb/python/sedonadb/raster.py | Allows band .data access after loading when outdb_uri exists but bytes are present. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """The band data as a typed, shaped memoryview.""" | ||
| if self.outdb_uri is not None: | ||
| source_data = self.source_data | ||
| if self.outdb_uri is not None and len(source_data) == 0: |
There was a problem hiding this comment.
you should also make sure that this is not a 0 byte raster. if hte product of the shape is 0 then it is always indb and [] is the byte array
| /// `sd_restore_metadata` to preserve field metadata stripped at the | ||
| /// physical layer (DF-22662). | ||
| #[derive(Default, Debug)] | ||
| pub struct WrapAsyncUdfRule; |
There was a problem hiding this comment.
can we add an idempotency test for this rule? we want to make sure we dont rewrap
| let Ok((_, return_field)) = expr.to_field(schema) else { | ||
| // Can't determine return field; skip wrapping. | ||
| return Ok(Transformed::no(expr)); | ||
| }; |
There was a problem hiding this comment.
should we throw an internal error in this case? I dont think it should happen but if it does, its probably an internal sedona db bug
There was a problem hiding this comment.
actually maybe this will tend to be caused by the user, like an ambiguous column name in the join result
We can't show off the fact that we can load pixels if we can't call it!
In #886 we added an optimizer rule that wraps pixel-requiring functions in RS_EnsureLoaded AND an extra function that works around async UDFs dropping metadata at the physical plan level. This doesn't handle normal calls to RS_EnsureLoaded, which are useful for testing the loaders or generally moving back into Python land.
This PR separates separates out the rule that wraps async udfs in the metadata-fixing function and tests it separately.