Skip to content

feat(raster): transparent OutDb byte loading via a GDAL loader hook#849

Draft
james-willis wants to merge 3 commits into
apache:mainfrom
james-willis:jw/outdb-loader
Draft

feat(raster): transparent OutDb byte loading via a GDAL loader hook#849
james-willis wants to merge 3 commits into
apache:mainfrom
james-willis:jw/outdb-loader

Conversation

@james-willis
Copy link
Copy Markdown
Contributor

@james-willis james-willis commented May 15, 2026

Summary

Make BandRef::nd_buffer(), contiguous_data(), and data() Just Work for OutDb bands (bands whose Arrow data column is empty and whose outdb_uri points elsewhere). Today those methods return NotYetImplemented, blocking every UDF and downstream consumer from reading OutDb pixel bytes — the entire reason OutDb references exist as a schema feature is moot without a working byte path.

Approach: a single statically-typed function-pointer hook in sedona-raster, populated by sedona-raster-gdal at session bootstrap. No trait, no HashMap-keyed registry, no async — "compiled in, not pluggable". The only abstraction introduced is the one the crate-boundary forces (sedona-raster-gdal depends on sedona-raster, not the reverse, so a direct call from BandRefImpl to GDAL is impossible).

Changes

  • New rust/sedona-raster/src/outdb_loader.rsOutDbLoadRequest, OutDbBandLoader fn-pointer type, OnceLock, set_outdb_band_loader, internal load_outdb dispatcher.
  • Modified rust/sedona-raster/src/array.rsBandRefImpl gains outdb_loaded: OnceCell<Vec<u8>> (lifetime anchor, not a cross-band cache). nd_buffer(), contiguous_data(), data() route through a new source_bytes() helper that zero-copies from the Arrow data column for schema-InDb bands and falls back to the loader hook (anchored in outdb_loaded) for schema-OutDb bands. Overflow-safe bounds check (checked_mul + try_into::<usize>). The legacy data() accessor logs the error via log::error! before collapsing to &[] so a missing loader registration produces a diagnostic instead of a silent empty buffer.
  • Modified rust/sedona-raster/src/traits.rsBandRef::is_indb() no longer has a default impl: the previous default (derived from nd_buffer() emptiness) misclassifies OutDb bands as InDb once a loader is installed. All concrete impls now provide an explicit empty-column check. Adds a Sync caveat doc note ("implementations are not required to be Sync").
  • Modified rust/sedona-raster/src/lib.rs — re-exports set_outdb_band_loader, OutDbBandLoader, OutDbLoadRequest.
  • New rust/sedona-raster-gdal/src/outdb_loader.rsgdal_load fn + register_outdb_loader. 2-D [y,x] enforcement, pre-read pixel-type check (errors cleanly if the file's GDAL pixel type differs from the band metadata's claim), uses GDALDatasetCache::get_or_create_outdb_source.
  • Modified rust/sedona/src/context.rs — calls sedona_raster_gdal::register_outdb_loader() from SedonaContext::new_from_context() next to the existing function-set registration.

Tests

sedona-raster (mock loader): 9 unit tests covering loader registration, round-trip via nd_buffer() / contiguous_data() / data(), per-band cache reuse, missing-uri error, undersized-loader-output error, loader-failure propagation, is_indb invariant after loader runs, and data() empty-on-error fallback.

sedona-raster (process-isolated integration): tests/no_loader_registered.rs — separate test binary so OUTDB_BAND_LOADER stays unset; asserts the "no OutDb loader registered" diagnostic message verbatim.

sedona-raster-gdal (real GDAL): 7 integration tests writing real GeoTIFFs to temp dirs and verifying end-to-end:

  • loads_outdb_band_bytes_from_geotiff — 4×3 single-band tiff, band.nd_buffer().buffer matches the file bytes.
  • second_call_on_same_band_reuses_cache — verifies BandRefImpl.outdb_loaded reuse on a second nd_buffer() call.
  • band_fragment_selects_correct_band — two-band tiff with #band=N fragment selects the correct band.
  • rejects_non_yx_dim_names — band metadata with dim_names=["x","y"] is rejected with a clear error instead of transposed-read corruption.
  • rejects_pixel_type_mismatch — UInt16 GeoTIFF + band metadata claiming UInt8 is rejected at the loader boundary naming both types.
  • errors_on_missing_file — clean error surface for unresolvable URIs.
  • errors_on_band_index_out_of_range — clean error when #band=N exceeds the file's band count.

End-to-end SQL coverage

A SQL-layer test along the lines of SELECT RS_Value(raster, x, y) FROM outdb_geotiff_table would also exercise this path, but no byte-reading RS_* function exists yet — the current RS_* surface (RS_BandPath, RS_NumBands, RS_BandPixelType, RS_Envelope, …) is all metadata, none of them call band.nd_buffer(). SQL-layer coverage is deferred until a byte-reading kernel lands. The 3 real-GDAL integration tests are the right test layer until then.

Relationship to PR #813 (view machinery)

This PR is independent of PR-D (#813 — view machinery / materialized cell / strided walk) and based directly on main. The view-composition path remains rejected at construction in RasterRefImpl::band(), so the OutDb byte path here only needs to handle the identity-view case.

If PR-D lands first, this PR will need a small follow-up to integrate OutDb bytes with the strided walk in data() (the byte-access surface is the soft conflict — both PRs rewrite data() / nd_buffer() / contiguous_data()). If this PR lands first, PR-D folds the second OnceCell into its existing materialization pattern. Either order is manageable; there is no hard duplication.

Test plan

  • cargo test -p sedona-raster --lib (74 passing)
  • cargo test -p sedona-raster --test no_loader_registered (process-isolated, 1 passing)
  • cargo test -p sedona-raster-gdal --lib outdb_loader (7 passing)
  • cargo clippy --all-targets -p sedona-raster -p sedona-raster-gdal -p sedona -- -D warnings
  • cargo fmt --all -- --check

@james-willis james-willis force-pushed the jw/outdb-loader branch 3 times, most recently from 9695f7e to 05e1af8 Compare May 15, 2026 21:33
@github-actions github-actions Bot requested a review from paleolimbot May 15, 2026 21:39
claude added 3 commits May 15, 2026 15:31
Adds a process-wide OutDb loader hook (OnceLock<fn>) in sedona-raster so
non-GDAL consumers (Arrow FFI, numeric kernels) can resolve pixels from
schema-OutDb bands transparently. BandRefImpl's three byte-access methods
(data, nd_buffer, contiguous_data) route through one private
source_bytes() helper that yields either the zero-copy Arrow column
(schema-InDb) or loader-resolved bytes anchored in a per-band OnceCell
(schema-OutDb). No new trait, no caller-visible API change.

Reframes is_indb() as a schema-level discriminator ("does the Arrow data
column carry these bytes inline?") rather than a runtime byte-availability
check; trait-default doc updated and the structural fix to drop the
default is tracked as DB-21. BandRefImpl already overrides correctly by
inspecting the column directly.

The post-load check_view_buffer_bounds invocation against loader output
mirrors the construction-time check the InDb path runs; a mismatched
source_shape vs bytes-length combination errors cleanly instead of
panicking inside materialize_strided.

9 new unit tests in outdb_loader.rs use a single URI-dispatched mock
loader (OnceLock allows exactly one registration per process) to cover:
loader bytes flowing through nd_buffer / contiguous_data / data, is_indb
staying false post-load, OnceCell caching across calls, missing outdb_uri,
loader errors, undersized loader output (bound-check trip), and the
data() infallible shim collapsing errors to &[].
Implements the GDAL backend for the OutDb loader hook introduced in
the previous commit. `register_outdb_loader()` installs `gdal_load` as
the process-wide handler; SedonaContext::new() calls this once per
process via the next commit.

gdal_load parses the `#band=N` URI fragment with the existing
`parse_outdb_source`, retrieves the source dataset through the
thread-local GDALDatasetCache (so the underlying GDAL Dataset is
opened at most once per (thread, URI)), and reads the requested band
as row-major bytes via `read_as_bytes`. 2-D only for v1; higher-rank
reads need MDArray support and are out of scope.

Required visibility bumps:
  - `GDALDatasetCache::get_or_create_outdb_source` → pub(crate)
  - `mod source_uri;` un-gated from `#[cfg(test)]`
  - new `arrow-schema` dep so the loader can build `ArrowError`s

`#[allow(dead_code)]` on `gdal_common` / `gdal_dataset_provider` is
kept; the loader uses a subset (with_gdal, convert_gdal_err,
thread_local_cache, get_or_create_outdb_source) but the broader VRT
machinery still awaits its consumers (issue apache#804).

Three integration tests using temp-file GeoTIFFs cover the happy path,
intra-band cache reuse on repeated calls, and `#band=N` selection on
multi-band sources.
Wires sedona_raster_gdal::register_outdb_loader() into
SedonaContext::new_from_context() after the existing raster-functions
registration so every session has the OutDb byte path available
without explicit setup.

The registration is idempotent (`OnceLock::set` is first-write-wins),
so repeated SedonaContext::new() calls in one process are safe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants