fix(typed-store): make iterator bounds consistent and add prefix scans#11865
fix(typed-store): make iterator bounds consistent and add prefix scans#11865chriscandless wants to merge 5 commits into
Conversation
cec50cf to
e4adf79
Compare
6f393f2 to
ae44d0a
Compare
|
I found important bug - I've accidentally changed values of cursors from "dynamic remainder" to "entire key". I now fixing it. |
28787d8 to
8df9d58
Compare
| match &self.db.storage { | ||
| Storage::Rocks(db) => { | ||
| let readopts = | ||
| rocks_util::apply_range_bounds(self.opts.readopts(), lower_bound, None); |
There was a problem hiding this comment.
I don't quite understand why not set the upper bound here too (it is set in iter_forward_raw)? That way rocksdb would know that the range is bounded from both sides.
There was a problem hiding this comment.
Both ways are equivalent from the point of view of logic:
- current variant uses
seek_for_prev +prev` to land on the first included element from the top - the same effect can be achieved by using readopts (
iterate_upper_bound) +seek_to_last
There are two reasons it is done the current way:
- semantical reason: readops is a regular way to achieve getting "invalid" element when reaching that position. There is no need in setting it in reverse func if we only descend from it.
- historical reason:
iterate_upper_boundis designed for forward iteration, and its correctness for reverse seek (seek_for_prev) wasn't guaranteed - the seek could still land on a key at/above the bound. The original reverse iterator avoided relying on it too.
There was a problem hiding this comment.
Oh, I see this ticket was even closed as not planned so maybe that "historical" bug even still there and actually not "historical" at all lol: facebook/rocksdb#9904)
There was a problem hiding this comment.
I see that forward and reverse iterators have slightly different initialization ways, and None for upper bound works here. Thought it would just be nicer and symmetric with upper bound set to the provided value, especially that we already have one.
There was a problem hiding this comment.
If we add it there just for symmetric logic, it may confuse someone, because the upper bound is already implemented using seek_for_prev. So additionally setting the upper bound via readopts may look strange and confusing. To make it less confusing we'd have to remove the seek_for_prevlogic - but its necessity was explained in the previous comment.
There was a problem hiding this comment.
Let's remove this check. In all the code paths is_max is being checked already, doing it here is useless.
There was a problem hiding this comment.
This one is tricky.
In Rust when making addition it is expected to specify in the name of the function what would happen in case of overflow: saturating_add (checks for max/min), wrapping_add (MAX goes to MIN, kind-a C++), checked_add / overflowing_add (return Option/Result/bool).
Such a plain name as big_endian_add_one does not specify anything, which can be treated as "caller checks for max". Practically this means "undefined behaviour in case of overflow", and that is kind-a not Rust-style. So we need to do at least anything about it.
In theory we can super easily check in the loop that max value was reached and return error. But that is waaay too intrusive for this code - nobody actually cares about it, everyone already checks for is_max.
So what I did: I put debug_assert!(!is_max(v)) at the start. And yeah, I know that is ineffective - it can be done more effectively just my modifying the addition loop. But I wanted to exclude that code in prod, so I decided to go with that ineffective double-iteration over value to be able to use lean debug_assert.
There was a problem hiding this comment.
debug_assert! is one way to do that. Another way would be to put unreachable! after the loop and instead of breaking the loop in the else branch -- just return;.
There was a problem hiding this comment.
You know what? That is actually more justified than what I did. My code has a chance to return wrong key in prod, which is not good. Panicking is to be on a safer side, which is good, especially considering it is super easy to do here. So switched to unreachable!
| iter.db_iter.seek_for_prev(&bound); | ||
| // The bound is exclusive: move to the previous key if the | ||
| // excluded key actually exists in the db. | ||
| if iter.db_iter.valid() && iter.db_iter.key() == Some(bound.as_slice()) { | ||
| iter.db_iter.prev(); | ||
| } |
There was a problem hiding this comment.
Shouldn't seek_for_prev already return an entry with the key strictly less that bound?
There was a problem hiding this comment.
Nah.
The name is really confusing, I also was googling it. seek_for_prev may land on a given target if that element is present. So it's like "inclusive upper bound". That's the reason why that conditional prev is there.
(see the doc - it has this example with Prev: https://github.com/facebook/rocksdb/wiki/SeekForPrev)
5fc489e to
4ac5720
Compare
4ac5720 to
30e5b39
Compare
safe_iter_with_bounds and reversed_safe_iter_with_bounds computed their raw rocksdb bounds independently, so the same logical bounds covered different key sets in the two directions ([lo, hi) forward vs [lo, hi] reversed). Route all bound computation through one range engine (iterator_bounds_with_range) and add safe_range_iter_reversed, which by construction yields exactly the keys of safe_range_iter in descending order; the Option-based wrappers keep their documented semantics as thin presets on top of it. SafeRevIter now takes an exclusive upper boundary and steps off the boundary key when the initial seek lands on it, making exclusive upper bounds work in reverse as well. Composite-key scans had to construct artificial maximum keys (e.g. ObjectKey::max_for_id) to bound the iteration. Add safe_iter_with_prefix and safe_iter_with_prefix_reversed, which derive the upper bound by incrementing the serialized prefix bytes instead, and migrate AuthorityStore::get_latest_marker to the reversed prefix scan as the first consumer. Part of #10929. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… of safe_range_iter_reversed The Option-based reverse iterator hard-coded inclusive bounds, which is non-standard (the forward APIs treat the upper bound as exclusive) and invisible at the call site. Migrate all callers to safe_range_iter_reversed, where the inclusivity is spelled out by the range syntax itself, and drop the wrapper: - (None, None) -> .. - (None, Some(hi)) -> ..=hi - (Some(lo), Some(hi)) -> lo..=hi The translation is semantics-preserving: the range forms construct exactly the same Bound::Included pairs the wrapper produced internally. This also drops the wrapper's vestigial Result, removing ?/expect noise at the call sites. Note for upstream ports: Sui still has reversed_safe_iter_with_bounds; ported code touching these call sites should apply the mapping above. Part of #10929. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Several scans over composite-key tables bounded the iteration with hand-built maximum keys (ObjectKey::max_for_id, usize::MAX tuples, SequenceNumber::MAX_VALID_EXCL + ObjectDigest::MAX) to express "all entries under a leading key". Replace the pure prefix scans with safe_iter_with_prefix / safe_iter_with_prefix_reversed, which derive the upper bound from the serialized prefix bytes instead: - get_events_by_events_digest: all events of one digest - have_deleted_owned_object_at_version_or_after: latest marker under (epoch, object id) - get_latest_live_version_for_object_id: latest live-marker ObjectRef - get_latest_object_ref_or_tombstone / get_latest_object_or_tombstone / try_get_object: latest object version - count_object_versions (msim-only): all versions of an object id Sites where a bound carries a real value (cursors, version cutoffs, range_iter_live_object_set spanning multiple ids) keep the explicit range form. Besides the version-tail change below, a prefix scan also tightens the lower bound (from unbounded / min_for_id to the prefix start). That is safe: every migrated reader already re-checks the leading key (epoch / object id) of the first row it reads, so a narrower lower bound cannot change the result. The replaced ..=max_for_id bounds capped the version component at MAX_VALID_EXCL while a prefix scan covers the full u64 space; this is equivalent because sentinel versions (CANCELLED_READ, CONGESTED_*) are scheduling-time artifacts and are never persisted as object or marker table keys. Part of #10929. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rop MAX bounds in cursor scans Paginated prefix scans bounded the iteration with an artificial maximum key for the upper end (`(prefix, MAX)`) while the lower end carried a real cursor. Add `safe_iter_with_prefix_from(&prefix, &cursor)` -- the rocksdb `prefix_iterator_from` idiom: the lower bound is `prefix ++ cursor` (the `cursor` is the remainder of the key after the prefix, not a full key) and the upper bound is derived by incrementing the serialized prefix bytes, so these scans no longer need a synthetic maximum key. Cursor-paginated scans, where the cursor is the inclusive lower bound and the caller keeps its `.skip(cursor.is_some())` to make it exclusive: - IndexStore::get_dynamic_fields_iterator (jsonrpc) - GrpcIndexesStore dynamic_field_iter / package_versions_iter AuthorityPerpetualTables::get_newer_object_keys is migrated too, but it has no cursor: its lower bound is object.1.next() (version + 1), an exclusive-by-construction start, and it uses no `.skip`. The prefix scan covers the whole prefix, including an entry sitting exactly at the type's maximum tail that the old exclusive `(prefix, MAX)` upper bound dropped. This differs from the old behavior only for such a max-tail key, which is never a real object version or marker; for dynamic-field ids it is a corrected edge. Equivalent for all real data otherwise. Reversed pagination and the hash-bounded owner_iter keep explicit ranges -- their bounds are not pure prefixes. Part of #10929. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…of ending iteration `SafeIter::next` mapped a failed key/value deserialization to `None` via `.ok()`, which is indistinguishable from "end of data": a single undeserializable row silently truncated the rest of the scan, despite the iterator item being `Result<(K, V), TypedStoreError>` precisely so such errors can be reported. Return `Some(Err(TypedStoreError::Serialization(..)))` for a row that fails to deserialize, matching how the in-memory backend (`memstore`) and `DBMap::get`/`multi_get` already handle deserialization failures. Callers that `?` now propagate the error instead of seeing a short read; callers that `.unwrap()` fail loudly instead of losing data. Adds a regression test that reopens a column family with a wider value type and asserts iteration yields `Some(Err(..))` rather than silently ending. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
30e5b39 to
62fb793
Compare
Description of change
Make the typed-store iterator wrapper produce consistent key ranges in both directions and remove the need for hand-built maximum keys in composite-key scans.
safe_iter_with_bounds(lo, hi)andreversed_safe_iter_with_bounds(lo, hi)computed their raw rocksdb bounds independently: the forward iterator covered[lo, hi)while the reversed one covered[lo, hi], so reversing a scan did not yield the same key set. Separately, scanning "all entries under a leading key" of a composite key required constructing artificial maximum keys (ObjectKey::max_for_id,usize::MAXtuple tails, etc.).Those issues are solved with two new sets of methods:
safe_range_iter_reversed- reversed version of pre-existingsafe_range_iter. It is a better version ofreversed_safe_iter_with_bounds, because it explicitly defines type of bounds (included, excluded). Alsosafe_range_iterandsafe_range_iter_reversedare now using same underlying logic for bound calculation, so it is impossible that they ever become asymmetric.safe_iter_with_prefix/safe_iter_with_prefix_reversed- enable clear iteration by prefix, which is what magicalMAXvalues were mostly used for. This removes most current and future uses ofMAXvalues. Asafe_iter_with_prefix_fromvariant covers cursor-paginated prefix scans (start from a cursor, noMAXupper bound).The change is split into five self-contained commits (the latter four are independently revertable):
fix(typed-store): unify iterator bound handling and add prefix scans— route all bound computation through one range engine (iterator_bounds_with_range) and addsafe_range_iter_reversed, which by construction yields exactly the keys ofsafe_range_iterin descending order.SafeRevIternow takes an exclusive upper boundary and steps off the boundary key when the initial seek lands on it, so exclusive upper bounds work in reverse as well. Addsafe_iter_with_prefix/safe_iter_with_prefix_reversed, which derive the upper bound by incrementing the serialized prefix bytes, and migrateAuthorityStore::get_latest_markeras the first consumer. The Option-based wrappers keep their existing documented semantics as thin presets over the range engine.refactor(typed-store): remove reversed_safe_iter_with_bounds in favor of safe_range_iter_reversed— the Option-based reverse iterator hard-coded inclusive bounds, invisible at the call site. Migrate all callers to the range form, where inclusivity is spelled out by the range syntax, and drop the wrapper together with its vestigialResult. The translation is semantics-preserving:(None, None)..(None, Some(hi))..=hi(Some(lo), Some(hi))lo..=hiNote for upstream ports: Sui still has
reversed_safe_iter_with_bounds; ported code touching these call sites should apply the mapping above.refactor(iota-core): use prefix iterators instead of artificial MAX keys— replace the pure prefix scans (get_events_by_events_digest,have_deleted_owned_object_at_version_or_after,get_latest_live_version_for_object_id,get_latest_object_ref_or_tombstone,get_latest_object_or_tombstone,try_get_object, and the#[cfg(msim)]count_object_versions) with prefix iterators. Sites where a bound carries a real value (cursors, version cutoffs,range_iter_live_object_setspanning multiple ids) keep the explicit range form. Two behavior deltas, both safe: (a) a prefix scan also tightens the lower bound (from unbounded /min_for_idto the prefix start) — safe because every migrated reader re-checks the leading key (epoch / object id) of the first row, so a narrower lower bound cannot change the result; (b) the replaced..=max_for_idupper bounds capped the version component atMAX_VALID_EXCLwhile a prefix scan covers the fullu64space — equivalent because sentinel versions (CANCELLED_READ,CONGESTED_*,MAX_VALID_EXCL) are scheduling-time artifacts, never persisted as object or marker table keys (a write-path invariant; verified by inspecting the write paths, not unit-tested, since a read-side test could only re-demonstrate the inclusion rather than prove the invariant).refactor(typed-store,iota-core): add safe_iter_with_prefix_from and drop MAX bounds in cursor scans— addsafe_iter_with_prefix_from(&prefix, &cursor)(the rocksdbprefix_iterator_fromidiom: lower boundprefix ++ cursor, upper bound derived by incrementing the serialized prefix bytes).cursoris the remainder of the key after the prefix, not a full key, so the prefix is never duplicated and the lower bound is a prefix of the key by construction. Migrates the cursor-paginated prefix scans (get_dynamic_fields_iterator,dynamic_field_iter,package_versions_iter) off their(prefix, MAX)upper bounds; those callers keep their.skip(cursor.is_some())(the cursor stays the inclusive lower bound).get_newer_object_keysis migrated too but has no cursor — its lower bound isobject.1.next()(version + 1, an exclusive-by-construction start) and it uses no.skip. The prefix scan covers the whole prefix, including a key sitting exactly at the type's maximum tail that the old exclusive(prefix, MAX)bound dropped — never a real version/marker, a corrected edge for dynamic-field ids, equivalent for all real data. Reversed-pagination scans and the hash-boundedowner_iterkeep explicit ranges — their bounds are not pure prefixes.fix(typed-store): surface deserialization errors in SafeIter instead of ending iteration—SafeIter::nextmapped a failed key/value deserialization toNonevia.ok(), indistinguishable from end-of-data, so one undeserializable row silently truncated the rest of the scan (the item type isResult<(K, V), _>precisely so the error can be reported; upstream returns it, the IOTA port had regressed to swallowing it with no test on the path). Now returnsSome(Err(TypedStoreError::Serialization(..))), matching how the in-memory backend (memstore) andDBMap::get/multi_getalready handle deserialization failures, with a regression test that reopens a column family with a wider value type and asserts iteration yieldsSome(Err(..)). Found in review of this PR.The forward
safe_iter_with_boundsis intentionally kept: its[lo, hi)contract matches the conventional half-open semantics, it is part of theMaptrait shared with upstream, and it is now implemented as a thin preset oversafe_range_iterwith the contract documented on the trait.Intentional runtime-behavior changes (no consumer relies on the old behavior):
(prefix, MAX)bound excluded (see commits 3-4); only matters for a key exactly at that tail, which the migrated tables never hold.SafeIter::nextnow surfaces a deserialization failure asSome(Err(..))instead of mapping it toNone(silently truncating the scan); every iterator consumer is affected —?/collect propagate the error,.next()-style readers return it. No consumer was relying on silent skip-on-bad-row.Edge-case note (issue point 2): the inclusive upper-bound arm of
iterator_bounds_with_rangereturnedNone(unbounded) when the bound serialized to all-0xFF. That is fine for fixed-width keys, but a longer key extending the maximum sorts after it and was wrongly included in a..=maxrange. This PR returnsser(max) ++ [0]instead (the first key past the maximum), mirroring the excluded-lower arm, with a regression test. On top of that it adds reverse-direction and prefix regression coverage (forward/reverse symmetry across all bound kinds, prefix scans at the0xFFserialization limit, boundary keys present in the DB).Remaining
MAXusages (deliberately not removed)A few iterator bounds still build
MAX/MINkeys by hand. They are left as-is because they are not pure forward prefix scans, so the new prefix APIs do not fit them cleanly; each would be a separate, higher-risk change:get_transactionsreversed overtransaction_order, keyed by a bareTxSequenceNumber):..=cursor.unwrap_or(MAX)— there is no composite prefix here,MAXjust means "no cursor → start from the end of the table". Removing it means branching oncursor(..=cvs..), pure cosmetics over a non-fragile sentinel.get_transactions_from_index,get_event_from_index,events_by_transaction,all_events): these scan a leading key downwards from a cursor. They would fit a reversedsafe_iter_with_prefix_from, but that is a new method with a different shape (in reverse the cursor is the upper bound and the prefix is the lower), and these sites currently stop via atake_while/predicate rather than a DB-level lower bound — switching the cutoff mechanism is a behavioral change that warrants its own PR with tests.owner_iter(gRPC): theOwnerIndexKeyprefix length is filter-dependent (owner, or(owner, id_hash), or(owner, id_hash, params_hash)), and theMAXvalues span the remaining unfixed fields. Expressible via prefix scans, but only by reworkingowner_boundsper filter branch — not a mechanical swap.range_iter_live_object_set:min_for_id(id_A)..=max_for_id(id_B)spans a range of different object ids (table sharding for background scans). The heads differ, so there is no common prefix; the alternative..(next(id_B), 0)just moves the synthetic key elsewhere.schedule_delete_rangefor events,compact_rangefor objects) and config sentinels (num_epochs_to_retain == u64::MAXmeaning "retain forever"): these are not iterators at all — a "delete/compact by prefix" helper would be a separate API surface, and the config value is not a key.Net: this PR removes the
MAXkeys that were standing in for a missing forward prefix API. The reversed-prefix cases are the natural next step (a reversedprefix_from) and are left for a follow-up.Links to any relevant issues
fixes #10929
How the change has been tested
cargo check/cargo clippy --all-targets --all-features -- -D warningsare green fortyped-store,iota-core,iota-transactional-test-runner, andstarfish-core; thetyped-storeunit suite passes (48/48, including the new symmetry/prefix regression tests). Each commit builds and passes the typed-store suite on its own. Note: running the typed-store suite with parallelcargo testhits a pre-existing PrometheusAlreadyRegregistry race (also fails ondevelop); the suite is green undernextest(process per test) and with--test-threads=1.