feat(raster): transparent OutDb byte loading via a GDAL loader hook#849
Draft
james-willis wants to merge 3 commits into
Draft
feat(raster): transparent OutDb byte loading via a GDAL loader hook#849james-willis wants to merge 3 commits into
james-willis wants to merge 3 commits into
Conversation
9695f7e to
05e1af8
Compare
05e1af8 to
a9cb62b
Compare
Adds a process-wide OutDb loader hook (OnceLock<fn>) in sedona-raster so
non-GDAL consumers (Arrow FFI, numeric kernels) can resolve pixels from
schema-OutDb bands transparently. BandRefImpl's three byte-access methods
(data, nd_buffer, contiguous_data) route through one private
source_bytes() helper that yields either the zero-copy Arrow column
(schema-InDb) or loader-resolved bytes anchored in a per-band OnceCell
(schema-OutDb). No new trait, no caller-visible API change.
Reframes is_indb() as a schema-level discriminator ("does the Arrow data
column carry these bytes inline?") rather than a runtime byte-availability
check; trait-default doc updated and the structural fix to drop the
default is tracked as DB-21. BandRefImpl already overrides correctly by
inspecting the column directly.
The post-load check_view_buffer_bounds invocation against loader output
mirrors the construction-time check the InDb path runs; a mismatched
source_shape vs bytes-length combination errors cleanly instead of
panicking inside materialize_strided.
9 new unit tests in outdb_loader.rs use a single URI-dispatched mock
loader (OnceLock allows exactly one registration per process) to cover:
loader bytes flowing through nd_buffer / contiguous_data / data, is_indb
staying false post-load, OnceCell caching across calls, missing outdb_uri,
loader errors, undersized loader output (bound-check trip), and the
data() infallible shim collapsing errors to &[].
Implements the GDAL backend for the OutDb loader hook introduced in the previous commit. `register_outdb_loader()` installs `gdal_load` as the process-wide handler; SedonaContext::new() calls this once per process via the next commit. gdal_load parses the `#band=N` URI fragment with the existing `parse_outdb_source`, retrieves the source dataset through the thread-local GDALDatasetCache (so the underlying GDAL Dataset is opened at most once per (thread, URI)), and reads the requested band as row-major bytes via `read_as_bytes`. 2-D only for v1; higher-rank reads need MDArray support and are out of scope. Required visibility bumps: - `GDALDatasetCache::get_or_create_outdb_source` → pub(crate) - `mod source_uri;` un-gated from `#[cfg(test)]` - new `arrow-schema` dep so the loader can build `ArrowError`s `#[allow(dead_code)]` on `gdal_common` / `gdal_dataset_provider` is kept; the loader uses a subset (with_gdal, convert_gdal_err, thread_local_cache, get_or_create_outdb_source) but the broader VRT machinery still awaits its consumers (issue apache#804). Three integration tests using temp-file GeoTIFFs cover the happy path, intra-band cache reuse on repeated calls, and `#band=N` selection on multi-band sources.
Wires sedona_raster_gdal::register_outdb_loader() into SedonaContext::new_from_context() after the existing raster-functions registration so every session has the OutDb byte path available without explicit setup. The registration is idempotent (`OnceLock::set` is first-write-wins), so repeated SedonaContext::new() calls in one process are safe.
a9cb62b to
7ad99b7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Make
BandRef::nd_buffer(),contiguous_data(), anddata()Just Work for OutDb bands (bands whose Arrowdatacolumn is empty and whoseoutdb_uripoints elsewhere). Today those methods returnNotYetImplemented, blocking every UDF and downstream consumer from reading OutDb pixel bytes — the entire reason OutDb references exist as a schema feature is moot without a working byte path.Approach: a single statically-typed function-pointer hook in
sedona-raster, populated bysedona-raster-gdalat session bootstrap. No trait, no HashMap-keyed registry, no async — "compiled in, not pluggable". The only abstraction introduced is the one the crate-boundary forces (sedona-raster-gdaldepends onsedona-raster, not the reverse, so a direct call fromBandRefImplto GDAL is impossible).Changes
rust/sedona-raster/src/outdb_loader.rs—OutDbLoadRequest,OutDbBandLoaderfn-pointer type,OnceLock,set_outdb_band_loader, internalload_outdbdispatcher.rust/sedona-raster/src/array.rs—BandRefImplgainsoutdb_loaded: OnceCell<Vec<u8>>(lifetime anchor, not a cross-band cache).nd_buffer(),contiguous_data(),data()route through a newsource_bytes()helper that zero-copies from the Arrowdatacolumn for schema-InDb bands and falls back to the loader hook (anchored inoutdb_loaded) for schema-OutDb bands. Overflow-safe bounds check (checked_mul+try_into::<usize>). The legacydata()accessor logs the error vialog::error!before collapsing to&[]so a missing loader registration produces a diagnostic instead of a silent empty buffer.rust/sedona-raster/src/traits.rs—BandRef::is_indb()no longer has a default impl: the previous default (derived fromnd_buffer()emptiness) misclassifies OutDb bands as InDb once a loader is installed. All concrete impls now provide an explicit empty-column check. Adds a Sync caveat doc note ("implementations are not required to beSync").rust/sedona-raster/src/lib.rs— re-exportsset_outdb_band_loader,OutDbBandLoader,OutDbLoadRequest.rust/sedona-raster-gdal/src/outdb_loader.rs—gdal_loadfn +register_outdb_loader. 2-D[y,x]enforcement, pre-read pixel-type check (errors cleanly if the file's GDAL pixel type differs from the band metadata's claim), usesGDALDatasetCache::get_or_create_outdb_source.rust/sedona/src/context.rs— callssedona_raster_gdal::register_outdb_loader()fromSedonaContext::new_from_context()next to the existing function-set registration.Tests
sedona-raster(mock loader): 9 unit tests covering loader registration, round-trip viand_buffer()/contiguous_data()/data(), per-band cache reuse, missing-uri error, undersized-loader-output error, loader-failure propagation,is_indbinvariant after loader runs, anddata()empty-on-error fallback.sedona-raster(process-isolated integration):tests/no_loader_registered.rs— separate test binary soOUTDB_BAND_LOADERstays unset; asserts the "no OutDb loader registered" diagnostic message verbatim.sedona-raster-gdal(real GDAL): 7 integration tests writing real GeoTIFFs to temp dirs and verifying end-to-end:loads_outdb_band_bytes_from_geotiff— 4×3 single-band tiff,band.nd_buffer().buffermatches the file bytes.second_call_on_same_band_reuses_cache— verifiesBandRefImpl.outdb_loadedreuse on a secondnd_buffer()call.band_fragment_selects_correct_band— two-band tiff with#band=Nfragment selects the correct band.rejects_non_yx_dim_names— band metadata withdim_names=["x","y"]is rejected with a clear error instead of transposed-read corruption.rejects_pixel_type_mismatch— UInt16 GeoTIFF + band metadata claiming UInt8 is rejected at the loader boundary naming both types.errors_on_missing_file— clean error surface for unresolvable URIs.errors_on_band_index_out_of_range— clean error when#band=Nexceeds the file's band count.End-to-end SQL coverage
A SQL-layer test along the lines of
SELECT RS_Value(raster, x, y) FROM outdb_geotiff_tablewould also exercise this path, but no byte-readingRS_*function exists yet — the currentRS_*surface (RS_BandPath,RS_NumBands,RS_BandPixelType,RS_Envelope, …) is all metadata, none of them callband.nd_buffer(). SQL-layer coverage is deferred until a byte-reading kernel lands. The 3 real-GDAL integration tests are the right test layer until then.Relationship to PR #813 (view machinery)
This PR is independent of PR-D (#813 — view machinery /
materializedcell / strided walk) and based directly onmain. The view-composition path remains rejected at construction inRasterRefImpl::band(), so the OutDb byte path here only needs to handle the identity-view case.If PR-D lands first, this PR will need a small follow-up to integrate OutDb bytes with the strided walk in
data()(the byte-access surface is the soft conflict — both PRs rewritedata()/nd_buffer()/contiguous_data()). If this PR lands first, PR-D folds the secondOnceCellinto its existing materialization pattern. Either order is manageable; there is no hard duplication.Test plan
cargo test -p sedona-raster --lib(74 passing)cargo test -p sedona-raster --test no_loader_registered(process-isolated, 1 passing)cargo test -p sedona-raster-gdal --lib outdb_loader(7 passing)cargo clippy --all-targets -p sedona-raster -p sedona-raster-gdal -p sedona -- -D warningscargo fmt --all -- --check