[WIP] feat(raster): view machinery for non-identity band views#813
[WIP] feat(raster): view machinery for non-identity band views#813james-willis wants to merge 8 commits into
Conversation
349c957 to
91966ed
Compare
`raster_ref_to_gdal_mem` previously returned a `Result<Dataset>` and
guarded against `BandRef::contiguous_data()` returning `Cow::Owned`
with a runtime tripwire ("Internal: contiguous_data must be borrowed
for is_2d bands; got owned"). The check was correct — handing GDAL a
pointer into a `Vec<u8>` that drops at the end of the iteration would
dangle — but it ties an internal invariant ("`is_2d` ⇒ Borrowed") to
incidental properties of today's reader. Any future copy path in the
reader (compression, BinaryView block-boundary stitching, alignment
fix-up, sliced/broadcast/transposed views from apache#813 / apache#750) would
detonate the tripwire on perfectly valid 2-D rasters.
Change: return `Result<(Dataset, Vec<Vec<u8>>)>`. On `Cow::Borrowed`
the GDAL band still points directly at the StructArray buffer
(zero-copy). On `Cow::Owned` we move the `Vec<u8>` out of the Cow
without copying — the reader's existing materialization is the only
allocation — and stash it in the returned vector. The caller (the
provider in `gdal_dataset_provider.rs`) parks it in a new
`RasterDataset::_owned_band_bytes` field that lives as long as the
MEM dataset that holds the pointers.
`raster_ref_to_gdal_empty` discards the always-empty vector.
9e5ae44 to
7f2f79a
Compare
d27d8d1 to
e890f90
Compare
paleolimbot
left a comment
There was a problem hiding this comment.
This is very cool!
I think I got there by the end in the inline comments, but I think a struct ViewEntries with associated functions in src/view_entries.rs where the non serialize-to-arrow and deserialize-from-arrow logic behind this lives will make this clearer.
The main issue I have with the implementation here is that it relies on hidden Vec<u8>s that aren't reusable. To manage memory properly we are going to have to register some scratch spaces with the session's memory pool and reuse them between iterations of a loop. Hiding them within various structures for convenience is going to make it tricky to implement a correct pattern going forward.
| /// with no allocation or copy. Non-identity views must materialize a | ||
| /// row-major copy on first byte access. |
There was a problem hiding this comment.
| /// with no allocation or copy. Non-identity views must materialize a | |
| /// row-major copy on first byte access. | |
| /// with no allocation or copy. Non-identity views must materialize a | |
| /// row-major copy on first byte access when using these functions | |
| /// (use `nd_buffer()` to guarantee that a copy is never produced). |
| ) -> Result<Vec<ViewEntry>, ArrowError> { | ||
| let mut out = Vec::with_capacity(next_view.len()); | ||
| for (k, next) in next_view.iter().enumerate() { | ||
| if next.source_axis < 0 || (next.source_axis as usize) >= input_view.len() { |
There was a problem hiding this comment.
Should source_axis be a usize? I'm wary of using as here given that this is sensitive territory (will panic on out of bounds) although it is probably fine.
| pub(crate) fn visible_shape_from_view(view: &[ViewEntry]) -> Vec<u64> { | ||
| view.iter().map(|v| v.steps as u64).collect() | ||
| } |
There was a problem hiding this comment.
These should probably be all i64s unless there is a compelling reason to use the unsigned version to minimize the number of times we have to as between signed and unsigned types.
There was a problem hiding this comment.
I cleaned out some of the u64s. These are the ones I left in the raster ecosystem:
- BandRef::width() -> u64 / height() -> u64 (and the RasterMetadata shim equivalents) — 2D-compat accessors. Changing these is API-breaking across sedona-functions, sedona-raster-gdal, callers in tests, etc.
- BandRef::raw_source_shape() -> &[u64] and dim_size() -> Option — same reason; they sit on the public trait.
- source_shape, spatial_shape Arrow columns type → could be done as a follow up.
| #[allow(clippy::too_many_arguments)] | ||
| pub(crate) fn start_band_with_view( |
There was a problem hiding this comment.
Clippy has a good point here. Define struct StartBandWithViewArgs<'a> { ... } with all pub fields, which will force callers to write start_band_with_view(StartBandWithViewArgs { <named params> }).
You could also expose more than one function or have a BandBuilder of some kind to separate these things into more than one call.
| #[allow(clippy::too_many_arguments)] | ||
| pub fn with_view( |
There was a problem hiding this comment.
Here as well (you can probably use a common Args struct)
| let view = [ | ||
| ViewEntry { | ||
| source_axis: 0, |
There was a problem hiding this comment.
A helper struct to manage Vec<ViewEntry> would make it more likely that other people will write reasonable tests. Or a macro (view_entries![0:4, 1:5]; use Python slice syntax since that's what you'd type in numpy to do this).
| None, | ||
| &["y", "x"], | ||
| &[3, 3], // 3x3 source | ||
| &view, | ||
| BandDataType::Float32, | ||
| None, | ||
| None, | ||
| None, |
There was a problem hiding this comment.
A BandBuilder may help you quite a bit here since there are a lot of None that a default constructor could fill in
| #[test] | ||
| fn test_nd_buffer_permutation_and_slice_combined() { | ||
| // 2D source [Y=4, X=3]. View permutes (visible order [X, Y]) and | ||
| // slices Y from 1, step 2, steps 2. Expected: | ||
| // visible_shape = [3, 2] | ||
| // byte_strides = [step_X * stride_X_src, step_Y * stride_Y_src] | ||
| // = [1 * 1, 2 * 3] = [1, 6] | ||
| // byte_offset = start_X * stride_X_src + start_Y * stride_Y_src | ||
| // = 0 * 1 + 1 * 3 = 3 |
There was a problem hiding this comment.
These kinds of tests you probably don't want in this file. This file is about serializing stuff into Arrow and it's tricky to see what's going on...I'm not sure what the perfect abstraction is here but it should probably live in src/views.rs with tests and some pub functions that can be benchmarked.
Proabably struct ViewEntries { inner: Vec<ViewEntry> } with associated methods would make this a bit cleaner than the &[ViewEntry] + free functions.
|
Somewhat offtopic but are you going to have the same concern with lazy outdb loading as well? (Outdb lazy loading + zarr loader is going to be higher on my priority list next week than this PR)
|
|
Possibly? I know it seems like a lot of comments but what I'm getting at is: let mut scratch = Vec::new();
for raster in rasters {
let data_slice = raster.band(0)?.contiguous_data(&mut scratch)?;
do_stuff(data_slice);
}Or on the day we need to track memory: let mut max_capacity = 0;
for raster in rasters {
max_capacity = max_capacity.max(raster.band(0)?.contiguous_data_alloc_size());
}
if max_capacity > config_options.raster.max_scratch_alloc {
// error
}
// If this ends up being a lot, frequently, we may also want to use a MemoryReservation to track it
// Probably this helps contiguous_data be faster since in theory there is only one heap allocation.
// You could probably use some of the unsafe modifiers to also ensure that there are no bounds
// checks in your materializer slowing things down
let mut scratch = Vec::with_capacity(max_capacity);
for raster in rasters {
let data_slice = raster.band(0)?.contiguous_data(&mut scratch)?;
do_stuff(data_slice);
}I'm not sure exactly how loading works yet but you might want something similar (with the hiccup that you need some number of async workers doing IO so you also need the same number of scratch buffers). (I have the same comment for pretty much everything that heap allocates in a loop, which is why, for example, our geometry writers write WKB directly into the Arrow output instead of returning |
b81d5f1 to
9267b04
Compare
31878b5 to
4564e8b
Compare
Adds the view-machinery layer that lets bands carry non-identity view entries — slices, broadcasts, axis permutations — and decode them correctly through the byte-access surface. Schema and trait surface: - validate_view() and visible_shape_from_view() helpers in traits.rs. Both pub(crate); validation runs implicitly inside start_band_with_view on the writer and RasterRef::band on the reader, so external callers don't need to invoke them directly. Builder: - start_band_with_view() writes a band with an explicit view column. Reader: - RasterRefImpl::band composes view -> byte strides + byte offset using checked arithmetic. Malformed views and arithmetic overflows route through sedona_internal_datafusion_err! so they surface with the standardised "SedonaDB internal error" framing. - nd_buffer() returns the source buffer plus shape/strides/offset (zero-allocation strided access for callers that want to walk pixels themselves). - contiguous_data() delegates to nd_buffer() for the OutDb check and identity-view fast path, then walks the strided copy via a shared materialize_strided helper. - BandRef::data() on BandRefImpl: identity views borrow from the Arrow column directly; non-identity views lazy-materialize a row-major copy into an internal OnceCell on first access and reuse it on subsequent calls. Repeat data() calls on the same band don't re-walk strides. GDAL bridge: - raster_ref_to_gdal_mem now returns (Dataset, Vec<Vec<u8>>) so the Vec backing GDAL's pointer for view-materialized (Cow::Owned) bands outlives the dataset. RasterDataset carries the matching _owned_band_bytes field. Tests (~50 new): validate_view rejection cases, builder-side view constructions (slice, broadcast, transpose, negative-step, multi-band mixed views, Arrow IPC round-trip), reader-side decoding of malformed view rows, OnceCell materialization correctness.
Addresses the should-fix and nit items from the PR-D review pass: * `BandRefImpl::byte_offset` is now `i64` throughout — the construction arithmetic, the field, and `materialize_strided`'s parameter all agree. The single `i64 -> u64` cast lives at the public `NdBuffer::offset` boundary, where it's provably safe (band() asserts `>= 0` before storing). Eliminates the previous unchecked round-trip cast inside `materialize_strided`. * Factor identity-view detection out of the byte-stride composition loop into a `pub(crate) is_identity_view(view, source_shape)` helper next to `visible_shape_from_view`. Six new unit tests including the permuted-axes-with-identity-step negative case. * New `check_view_buffer_bounds` precheck runs at `RasterRef::band()` construction for InDb bands: verifies that every byte the view can address lies within the data column AND that every stride × index product (and their accumulations) fits in i64. A corrupt Arrow row whose source_shape exceeds the bytes available now surfaces as a clean internal error at the read boundary. New test covers the corruption path; one stale builder test fixture (1024x1024 raster with 100 bytes of data) was tightened to 10x10 to satisfy the precheck. * `materialize_strided` is now infallible. With `check_view_buffer_bounds` load-bearing at the read boundary, the strided walk's per-step bound checks and `checked_mul`/`checked_add` calls are provably redundant and have been removed. Return type drops to `Vec<u8>`; the OnceCell init in `BandRef::data()` no longer needs a `panic!`-on-Err arm. Defense-in-depth lives at the precheck only; if a future refactor removes it, Rust's bound-checked slice indexing inside `materialize_strided` still catches buffer overruns (just with a less informative panic message). * Empty (non-null, zero-length) `view` rows are now read equivalently to NULL `view` rows (synthesize identity), for forward compatibility with producers that emit one form versus the other. Schema docstring updated; new reader test covers it. * `materialize_strided`'s outer-index increment uses `for k in (0..inner).rev()` with `break` instead of the `while k > 0; k -= 1` pattern. * GDAL bridge: explicit-index pattern replaces `last().unwrap()` when recording owned band bytes; a SAFETY comment documents the `Vec<Vec<u8>>` heap-pointer stability invariant that GDAL's raw pointer relies on. `raster_ref_to_gdal_empty` uses a hard `assert!` instead of `debug_assert!` so any future leak of GDAL-referenced bytes from an empty band list fails loudly in release builds too. * `BandRef::nd_buffer` docstring now states that the returned `shape`/`strides`/`offset` are guaranteed in-bounds by the read-boundary precheck and validate_view, so stride-aware consumers don't need to re-check against `buffer.len()`. * New 3-D cyclic-permutation test exercises the strided walk through every outer-axis combination plus a non-trivial inner step (`row_bytes_contiguous == false` path).
…g bands Adds the canonical user-facing API for "create a new view into an existing raster band." UDFs that slice / permute / broadcast can now express their operation in terms of the input's *visible* axes without knowing whether the input itself already carries a view — the builder composes views internally. Surface: * `pub fn RasterBuilder::with_view(name, dim_names, input, view, nodata, outdb_uri, outdb_format)` — takes an `&dyn BandRef` and a view spec relative to the input's visible axes; writes an output band layered on the input's source bytes (InDb) or pointing at the same external resource (OutDb). The supplied view composes with the input's existing view via the new internal `compose_view` helper. * `pub(crate) fn compose_view(input, next) -> Result<Vec<ViewEntry>, _>` in traits.rs. Walks `next` through `input` to produce a single source-space view. Checked arithmetic for stride × index products and start accumulation. Six unit tests cover identity pass-through, slice-of-slice collapse, source_axis preservation through permutation, step multiplication, source_axis-out-of-range rejection, and step-product overflow rejection. * `RasterBuilder::start_band_with_view` is dropped from the public API (visibility downgraded to `pub(crate)`). External callers should use `with_view`; the helper stays reachable from internal tests in the same crate. Storage semantics: * InDb input → InDb output. Source bytes are copied verbatim into the output's data column. Arrow `BinaryView` buffer-sharing is a future optimisation. * OutDb input → OutDb output. Data column stays empty; `outdb_uri` and `outdb_format` are inherited from the input (or overridden by the explicit arguments). Bytes are never fetched at construction — the output's composed view rides alongside the same external pointer, loading deferred to whoever reads the visible bytes. Tests added in builder.rs: * `with_view_over_identity_input_produces_expected_visible_bytes` * `with_view_chained_composes_into_single_view` — proves slice-of-slice collapses into a single source-space view layer end-to-end. * `with_view_on_outdb_input_produces_outdb_output_with_composed_view`
Per Dewey's review: every `source_axis`/`source_shape`/`visible_shape` in the view helpers should be `i64` to match `ViewEntry`'s signed fields and the stride arithmetic, instead of round-tripping through `u64` and sprinkling `as` casts through helper bodies. Reworked to make i64 the canonical view-machinery type — no more u64↔i64 conversions in `data()`, `nd_buffer()`, `contiguous_data()`, or the band-construction storage step. Public surface changes: - `BandRef::shape() -> &[i64]` (was `&[u64]`). Production callers outside the raster crate's own tests don't exist; tests use array literals which infer the new element type. `width()` and `height()` stay `u64` (conventional non-negative scalar counts) and pick up a single `as u64` cast in the default `dim_size()` impl. - `BandRef::raw_source_shape() -> &[u64]` unchanged — this is a direct slice into the Arrow `UInt64Array` column, changing the return type would force an alloc per call. - `NdBuffer.shape: Vec<i64>`, `NdBuffer.offset: i64` — internal struct, no out-of-crate consumers. Internals now `i64`: - `validate_view(view, source_shape: &[i64])`, with an added `source_shape[sa] >= 0` check at the boundary. - `is_identity_view(view, source_shape: &[i64])`. - `visible_shape_from_view(view) -> Vec<i64>`. - `BandRefImpl::visible_shape: Vec<i64>` (matches the trait return). - `materialize_strided` and `check_view_buffer_bounds` take `visible_shape: &[i64]`. Conversions concentrated at the two unavoidable boundaries — both are Arrow column reads: - `RasterRefImpl::band()` converts `raw_source_shape() -> &[u64]` into `Vec<i64>` once with a `try_from` overflow check. - `RasterBuilder::start_band_with_view` does the same conversion at its `source_shape: &[u64]` entry point before calling `validate_view`. Eliminates the source-stride `i64::try_from` cast that used to live inside `band()` (line ~527 pre-refactor). Tests pass: sedona-raster 131, sedona-raster-functions 143, sedona-raster-gdal 53. Arrow column storage and the public `width()`/`height()` API unchanged.
…ies.rs Per Dewey's review: the view machinery should live behind a named type with associated methods, not a `&[ViewEntry]` + free-function soup spread across `traits.rs`, `builder.rs`, and `array.rs`. New module `rust/sedona-raster/src/view_entries.rs`: - `ViewEntry` struct (moved from traits.rs; re-exported from there so existing import paths keep working). - `ViewEntries(Vec<ViewEntry>)` newtype with the full surface as methods: `validate`, `is_identity`, `visible_shape`, `compose`, plus `identity_for_shape` constructor and the usual `len`/ `as_slice`/`iter`/`Index`/`IntoIterator` ergonomics. - All view-machinery tests (validate, is_identity, compose, visible_shape) live in this file's `#[cfg(test)]` module. - `view_entries![start;stop, ...]` macro for Python-slice-style test construction. `;` rather than `:` because `:` isn't valid Rust macro syntax for the `start:stop` form, but `start;stop` is a pragmatic stand-in that reads close enough at the call site. Migrated call sites: - `array.rs`: `RasterRefImpl::band` builds a `ViewEntries` (either via `identity_for_shape` for the NULL/empty-encoded identity case or via `ViewEntries::new` for the decoded-from-Arrow case) and uses `.validate(...)`, `.visible_shape()`, `.is_identity(...)`. `BandRefImpl::view_entries: ViewEntries` (was `Vec<ViewEntry>`). - `builder.rs`: `start_band_with_view` validates via `ViewEntries::validate`; `with_view` composes via `ViewEntries::compose`. - `traits.rs`: lost `validate_view`/`is_identity_view`/ `visible_shape_from_view`/`compose_view` free functions and ~250 lines of their tests. `ViewEntry` is re-exported from `view_entries`. Tests pass: sedona-raster 134 (was 131; +3 from the new tests exercising `identity_for_shape`, `visible_shape`, and the macro), sedona-raster-functions 143, sedona-raster-gdal 53.
…ecomp Three polish items on the view-machinery review: 1. Identity-view encoding: the `view` list column is now identity iff the row is NULL. Empty (zero-length, non-null) lists are rejected by the reader as malformed. The schema doc on `RasterSchema::view_type` documents the contract. 2. `start_band_with_view` now takes a `StartBandWithViewArgs` struct instead of eight positional arguments. Test call sites migrated to the struct-literal form. 3. `RasterRefImpl::band()` decomposed into named helpers — read source shape, resolve data type, read view entries, compose byte strides — so the orchestrating function reads top-down without inlining Arrow-buffer arithmetic and checked-stride composition together.
423b516 to
23d87ac
Compare
The view_entries![start:stop, ...] macro existed but no tests used it, so the example had no readers and the syntax wasn't being exercised in practice. Migrate every test site whose view-list happens to be in the macro's expressible subset (sequential source_axis 0..n, step = 1, and steps = stop - start). Mixed-permutation, broadcast (step = 0), negative-step, and step != 1 sites stay on the struct-literal form, as the macro intentionally doesn't try to express those. Also: shift `axis += 1;` to before the push and start axis at -1 so the final assignment is always read by the next push. This kills the unused_assignments warning on single-entry macro uses without needing an inner #[allow].
…ntics After the ViewEntries extraction commit, every view-rule (validate / compose / is_identity / visible_shape / broadcast / negative-step / permutation) is exhaustively covered by pure-semantic tests in view_entries.rs. The build+read integration tests in array.rs and builder.rs had drifted into re-verifying those same rules through the Arrow round-trip — useful when the rules were being designed, wasteful now that the abstraction is settled. Pruned 12 tests that re-asserted view-rule semantics through the build+read pipeline. What remains is the set that uniquely tests integration-layer concerns: - Arrow encoding parity (NULL field, identity-via-explicit-view vs identity-via-NULL, IPC round-trip) - Reader-side corruption guards (source-stride overflow, view step × stride overflow, data column shorter than view, empty source_shape, empty non-null view list, unknown data_type discriminant) - Cow::Borrowed vs Cow::Owned semantics on contiguous_data() - materialize_strided byte-correctness across distinct paths (outer slice / strided inner fallback / negative step / broadcast / permutation / empty axis / float32 fast path) - builder + reader validate-wiring smoke (1 each) - with_view API (identity, chained, OutDb) Specifically removed: - builder.rs: test_view_slice_nd_buffer_and_contiguous_data, test_view_broadcast, test_view_permutation_transpose, test_view_empty_axis, test_view_validation_rejects_out_of_range_start, test_view_validation_rejects_duplicate_source_axis, test_contiguous_data_negative_step_full_reverse, test_nd_buffer_steps_one_view, test_nd_buffer_multidim_non_zero_starts - array.rs: band_returns_none_when_view_has_negative_steps, band_returns_none_when_view_source_axis_out_of_range, band_returns_none_when_view_has_duplicate_source_axis 134 → 122 sedona-raster tests; all other crates unchanged.
Summary
Adds the view-machinery layer on top of the N-D raster type. Bands can now be constructed and read with non-identity
viewentries — slices, broadcasts, axis permutations, and reverse-iteration — and the byte-access surface (data/contiguous_data/nd_buffer) handles the strided materialization transparently. The slice/manipulation functions in #750 depend on this layer.What's in this PR
Single commit, five files. Three layers:
Schema / trait surface (
rust/sedona-raster/src/traits.rs)validate_view()+ 22 unit tests (length checks, axis range, negative step, broadcast / step=0, i64 overflow guards, etc.).visible_shape_from_view()helper.pub(crate)— validation runs implicitly insidestart_band_with_view(writer) andRasterRef::band(reader), so external callers don't need to invoke them.Builder (
rust/sedona-raster/src/builder.rs)start_band_with_view()API.Reader (
rust/sedona-raster/src/array.rs)RasterRefImpl::bandcomposes view → byte strides + byte offset using checked arithmetic; malformed views and arithmetic overflows route throughsedona_internal_datafusion_err!so they surface with the standard "SedonaDB internal error" framing.nd_buffer()returns the source buffer + shape/strides/offset (zero-allocation strided access).contiguous_data()delegates tond_buffer()for the OutDb check + identity-view fast path, then walks the strided copy via a sharedmaterialize_stridedhelper.BandRef::data()onBandRefImpl: identity views borrow from the Arrow column directly (zero-allocation); non-identity views lazy-materialize a row-major copy into an internalOnceCellon first access and reuse it on subsequent calls.GDAL bridge (
rust/sedona-raster-gdal/src/{gdal_common,gdal_dataset_provider}.rs)raster_ref_to_gdal_memreturns(Dataset, Vec<Vec<u8>>)so theVecbacking GDAL's pointer for view-materialized (Cow::Owned) bands outlives the dataset.RasterDatasetcarries the matching_owned_band_bytesfield.Behavior change vs main
viewrow now decodes via the composition path (previously rejected as a read error).BandRef::data()on a non-identity view returns the materialized row-major visible bytes via an internalOnceCellcache, rather than the raw source-column bytes. Repeat calls reuse the cache.Test plan
cargo build -p sedona-schema -p sedona-raster -p sedona-raster-gdal -p sedona-raster-functions -p sedona-testingcargo test -p sedona-raster— 113 tests pass.cargo test -p sedona-raster-gdal— 53 tests pass.cargo clippy --all-targets -- -D warningscargo fmt --all --check