Skip to content

fix(typed-store): make iterator bounds consistent and add prefix scans#11865

Open
chriscandless wants to merge 5 commits into
developfrom
fix/typed-store-iterator-bounds
Open

fix(typed-store): make iterator bounds consistent and add prefix scans#11865
chriscandless wants to merge 5 commits into
developfrom
fix/typed-store-iterator-bounds

Conversation

@chriscandless

@chriscandless chriscandless commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Description of change

Make the typed-store iterator wrapper produce consistent key ranges in both directions and remove the need for hand-built maximum keys in composite-key scans.

safe_iter_with_bounds(lo, hi) and reversed_safe_iter_with_bounds(lo, hi) computed their raw rocksdb bounds independently: the forward iterator covered [lo, hi) while the reversed one covered [lo, hi], so reversing a scan did not yield the same key set. Separately, scanning "all entries under a leading key" of a composite key required constructing artificial maximum keys (ObjectKey::max_for_id, usize::MAX tuple tails, etc.).

Those issues are solved with two new sets of methods:

  • Added safe_range_iter_reversed - reversed version of pre-existing safe_range_iter. It is a better version of reversed_safe_iter_with_bounds, because it explicitly defines type of bounds (included, excluded). Also safe_range_iter and safe_range_iter_reversed are now using same underlying logic for bound calculation, so it is impossible that they ever become asymmetric.
  • Added safe_iter_with_prefix / safe_iter_with_prefix_reversed - enable clear iteration by prefix, which is what magical MAX values were mostly used for. This removes most current and future uses of MAX values. A safe_iter_with_prefix_from variant covers cursor-paginated prefix scans (start from a cursor, no MAX upper bound).

The change is split into five self-contained commits (the latter four are independently revertable):

  1. fix(typed-store): unify iterator bound handling and add prefix scans — route all bound computation through one range engine (iterator_bounds_with_range) and add safe_range_iter_reversed, which by construction yields exactly the keys of safe_range_iter in descending order. SafeRevIter now takes an exclusive upper boundary and steps off the boundary key when the initial seek lands on it, so exclusive upper bounds work in reverse as well. Add safe_iter_with_prefix / safe_iter_with_prefix_reversed, which derive the upper bound by incrementing the serialized prefix bytes, and migrate AuthorityStore::get_latest_marker as the first consumer. The Option-based wrappers keep their existing documented semantics as thin presets over the range engine.

  2. refactor(typed-store): remove reversed_safe_iter_with_bounds in favor of safe_range_iter_reversed — the Option-based reverse iterator hard-coded inclusive bounds, invisible at the call site. Migrate all callers to the range form, where inclusivity is spelled out by the range syntax, and drop the wrapper together with its vestigial Result. The translation is semantics-preserving:

    old new
    (None, None) ..
    (None, Some(hi)) ..=hi
    (Some(lo), Some(hi)) lo..=hi

    Note for upstream ports: Sui still has reversed_safe_iter_with_bounds; ported code touching these call sites should apply the mapping above.

  3. refactor(iota-core): use prefix iterators instead of artificial MAX keys — replace the pure prefix scans (get_events_by_events_digest, have_deleted_owned_object_at_version_or_after, get_latest_live_version_for_object_id, get_latest_object_ref_or_tombstone, get_latest_object_or_tombstone, try_get_object, and the #[cfg(msim)] count_object_versions) with prefix iterators. Sites where a bound carries a real value (cursors, version cutoffs, range_iter_live_object_set spanning multiple ids) keep the explicit range form. Two behavior deltas, both safe: (a) a prefix scan also tightens the lower bound (from unbounded / min_for_id to the prefix start) — safe because every migrated reader re-checks the leading key (epoch / object id) of the first row, so a narrower lower bound cannot change the result; (b) the replaced ..=max_for_id upper bounds capped the version component at MAX_VALID_EXCL while a prefix scan covers the full u64 space — equivalent because sentinel versions (CANCELLED_READ, CONGESTED_*, MAX_VALID_EXCL) are scheduling-time artifacts, never persisted as object or marker table keys (a write-path invariant; verified by inspecting the write paths, not unit-tested, since a read-side test could only re-demonstrate the inclusion rather than prove the invariant).

  4. refactor(typed-store,iota-core): add safe_iter_with_prefix_from and drop MAX bounds in cursor scans — add safe_iter_with_prefix_from(&prefix, &cursor) (the rocksdb prefix_iterator_from idiom: lower bound prefix ++ cursor, upper bound derived by incrementing the serialized prefix bytes). cursor is the remainder of the key after the prefix, not a full key, so the prefix is never duplicated and the lower bound is a prefix of the key by construction. Migrates the cursor-paginated prefix scans (get_dynamic_fields_iterator, dynamic_field_iter, package_versions_iter) off their (prefix, MAX) upper bounds; those callers keep their .skip(cursor.is_some()) (the cursor stays the inclusive lower bound). get_newer_object_keys is migrated too but has no cursor — its lower bound is object.1.next() (version + 1, an exclusive-by-construction start) and it uses no .skip. The prefix scan covers the whole prefix, including a key sitting exactly at the type's maximum tail that the old exclusive (prefix, MAX) bound dropped — never a real version/marker, a corrected edge for dynamic-field ids, equivalent for all real data. Reversed-pagination scans and the hash-bounded owner_iter keep explicit ranges — their bounds are not pure prefixes.

  5. fix(typed-store): surface deserialization errors in SafeIter instead of ending iterationSafeIter::next mapped a failed key/value deserialization to None via .ok(), indistinguishable from end-of-data, so one undeserializable row silently truncated the rest of the scan (the item type is Result<(K, V), _> precisely so the error can be reported; upstream returns it, the IOTA port had regressed to swallowing it with no test on the path). Now returns Some(Err(TypedStoreError::Serialization(..))), matching how the in-memory backend (memstore) and DBMap::get/multi_get already handle deserialization failures, with a regression test that reopens a column family with a wider value type and asserts iteration yields Some(Err(..)). Found in review of this PR.

The forward safe_iter_with_bounds is intentionally kept: its [lo, hi) contract matches the conventional half-open semantics, it is part of the Map trait shared with upstream, and it is now implemented as a thin preset over safe_range_iter with the contract documented on the trait.

Intentional runtime-behavior changes (no consumer relies on the old behavior):

  • Prefix scans include an entry at the type's maximum tail that the old exclusive (prefix, MAX) bound excluded (see commits 3-4); only matters for a key exactly at that tail, which the migrated tables never hold.
  • SafeIter::next now surfaces a deserialization failure as Some(Err(..)) instead of mapping it to None (silently truncating the scan); every iterator consumer is affected — ?/collect propagate the error, .next()-style readers return it. No consumer was relying on silent skip-on-bad-row.

Edge-case note (issue point 2): the inclusive upper-bound arm of iterator_bounds_with_range returned None (unbounded) when the bound serialized to all-0xFF. That is fine for fixed-width keys, but a longer key extending the maximum sorts after it and was wrongly included in a ..=max range. This PR returns ser(max) ++ [0] instead (the first key past the maximum), mirroring the excluded-lower arm, with a regression test. On top of that it adds reverse-direction and prefix regression coverage (forward/reverse symmetry across all bound kinds, prefix scans at the 0xFF serialization limit, boundary keys present in the DB).

Remaining MAX usages (deliberately not removed)

A few iterator bounds still build MAX/MIN keys by hand. They are left as-is because they are not pure forward prefix scans, so the new prefix APIs do not fit them cleanly; each would be a separate, higher-risk change:

  • Reversed pagination without a prefix (get_transactions reversed over transaction_order, keyed by a bare TxSequenceNumber): ..=cursor.unwrap_or(MAX) — there is no composite prefix here, MAX just means "no cursor → start from the end of the table". Removing it means branching on cursor (..=c vs ..), pure cosmetics over a non-fragile sentinel.
  • Reversed prefix pagination (get_transactions_from_index, get_event_from_index, events_by_transaction, all_events): these scan a leading key downwards from a cursor. They would fit a reversed safe_iter_with_prefix_from, but that is a new method with a different shape (in reverse the cursor is the upper bound and the prefix is the lower), and these sites currently stop via a take_while/predicate rather than a DB-level lower bound — switching the cutoff mechanism is a behavioral change that warrants its own PR with tests.
  • owner_iter (gRPC): the OwnerIndexKey prefix length is filter-dependent (owner, or (owner, id_hash), or (owner, id_hash, params_hash)), and the MAX values span the remaining unfixed fields. Expressible via prefix scans, but only by reworking owner_bounds per filter branch — not a mechanical swap.
  • range_iter_live_object_set: min_for_id(id_A)..=max_for_id(id_B) spans a range of different object ids (table sharding for background scans). The heads differ, so there is no common prefix; the alternative ..(next(id_B), 0) just moves the synthetic key elsewhere.
  • Range delete / compaction (schedule_delete_range for events, compact_range for objects) and config sentinels (num_epochs_to_retain == u64::MAX meaning "retain forever"): these are not iterators at all — a "delete/compact by prefix" helper would be a separate API surface, and the config value is not a key.

Net: this PR removes the MAX keys that were standing in for a missing forward prefix API. The reversed-prefix cases are the natural next step (a reversed prefix_from) and are left for a follow-up.

Links to any relevant issues

fixes #10929

How the change has been tested

  • Basic tests (linting, compilation, formatting, unit/integration tests)
  • Patch-specific tests (correctness, functionality coverage)
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that new and existing unit tests pass locally with my changes

cargo check / cargo clippy --all-targets --all-features -- -D warnings are green for typed-store, iota-core, iota-transactional-test-runner, and starfish-core; the typed-store unit suite passes (48/48, including the new symmetry/prefix regression tests). Each commit builds and passes the typed-store suite on its own. Note: running the typed-store suite with parallel cargo test hits a pre-existing Prometheus AlreadyReg registry race (also fails on develop); the suite is green under nextest (process per test) and with --test-threads=1.

@iota-ci iota-ci added core-protocol node Issues related to the Core Node team labels Jun 11, 2026
@chriscandless chriscandless force-pushed the fix/typed-store-iterator-bounds branch 2 times, most recently from cec50cf to e4adf79 Compare June 12, 2026 17:40
@chriscandless chriscandless marked this pull request as ready for review June 12, 2026 18:05
@chriscandless chriscandless requested review from a team as code owners June 12, 2026 18:05
Comment thread crates/typed-store/src/database.rs
Comment thread crates/typed-store/src/rocks/safe_iter.rs Outdated
Comment thread crates/typed-store/src/rocks/tests.rs
@chriscandless chriscandless force-pushed the fix/typed-store-iterator-bounds branch 2 times, most recently from 6f393f2 to ae44d0a Compare June 15, 2026 10:00
@chriscandless

Copy link
Copy Markdown
Contributor Author

I found important bug - I've accidentally changed values of cursors from "dynamic remainder" to "entire key". I now fixing it.

@chriscandless chriscandless force-pushed the fix/typed-store-iterator-bounds branch 4 times, most recently from 28787d8 to 8df9d58 Compare June 16, 2026 14:38
Comment thread crates/typed-store/src/database.rs
match &self.db.storage {
Storage::Rocks(db) => {
let readopts =
rocks_util::apply_range_bounds(self.opts.readopts(), lower_bound, None);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand why not set the upper bound here too (it is set in iter_forward_raw)? That way rocksdb would know that the range is bounded from both sides.

@chriscandless chriscandless Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both ways are equivalent from the point of view of logic:

  • current variant uses seek_for_prev + prev` to land on the first included element from the top
  • the same effect can be achieved by using readopts (iterate_upper_bound) + seek_to_last

There are two reasons it is done the current way:

  • semantical reason: readops is a regular way to achieve getting "invalid" element when reaching that position. There is no need in setting it in reverse func if we only descend from it.
  • historical reason: iterate_upper_bound is designed for forward iteration, and its correctness for reverse seek (seek_for_prev) wasn't guaranteed - the seek could still land on a key at/above the bound. The original reverse iterator avoided relying on it too.

@chriscandless chriscandless Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see this ticket was even closed as not planned so maybe that "historical" bug even still there and actually not "historical" at all lol: facebook/rocksdb#9904)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that forward and reverse iterators have slightly different initialization ways, and None for upper bound works here. Thought it would just be nicer and symmetric with upper bound set to the provided value, especially that we already have one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add it there just for symmetric logic, it may confuse someone, because the upper bound is already implemented using seek_for_prev. So additionally setting the upper bound via readopts may look strange and confusing. To make it less confusing we'd have to remove the seek_for_prevlogic - but its necessity was explained in the previous comment.

Comment thread crates/typed-store/src/util.rs Outdated
Comment thread crates/typed-store/src/util.rs Outdated
Comment on lines 101 to 103

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Let's remove this check. In all the code paths is_max is being checked already, doing it here is useless.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is tricky.

In Rust when making addition it is expected to specify in the name of the function what would happen in case of overflow: saturating_add (checks for max/min), wrapping_add (MAX goes to MIN, kind-a C++), checked_add / overflowing_add (return Option/Result/bool).
Such a plain name as big_endian_add_one does not specify anything, which can be treated as "caller checks for max". Practically this means "undefined behaviour in case of overflow", and that is kind-a not Rust-style. So we need to do at least anything about it.

In theory we can super easily check in the loop that max value was reached and return error. But that is waaay too intrusive for this code - nobody actually cares about it, everyone already checks for is_max.

So what I did: I put debug_assert!(!is_max(v)) at the start. And yeah, I know that is ineffective - it can be done more effectively just my modifying the addition loop. But I wanted to exclude that code in prod, so I decided to go with that ineffective double-iteration over value to be able to use lean debug_assert.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug_assert! is one way to do that. Another way would be to put unreachable! after the loop and instead of breaking the loop in the else branch -- just return;.

@chriscandless chriscandless Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know what? That is actually more justified than what I did. My code has a chance to return wrong key in prod, which is not good. Panicking is to be on a safer side, which is good, especially considering it is super easy to do here. So switched to unreachable!

Comment on lines +136 to +141
iter.db_iter.seek_for_prev(&bound);
// The bound is exclusive: move to the previous key if the
// excluded key actually exists in the db.
if iter.db_iter.valid() && iter.db_iter.key() == Some(bound.as_slice()) {
iter.db_iter.prev();
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't seek_for_prev already return an entry with the key strictly less that bound?

@chriscandless chriscandless Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah.
The name is really confusing, I also was googling it. seek_for_prev may land on a given target if that element is present. So it's like "inclusive upper bound". That's the reason why that conditional prev is there.

(see the doc - it has this example with Prev: https://github.com/facebook/rocksdb/wiki/SeekForPrev)

@chriscandless chriscandless force-pushed the fix/typed-store-iterator-bounds branch 3 times, most recently from 5fc489e to 4ac5720 Compare June 17, 2026 17:12
@chriscandless chriscandless force-pushed the fix/typed-store-iterator-bounds branch from 4ac5720 to 30e5b39 Compare June 22, 2026 12:24
chriscandless and others added 5 commits June 22, 2026 14:53
safe_iter_with_bounds and reversed_safe_iter_with_bounds computed their
raw rocksdb bounds independently, so the same logical bounds covered
different key sets in the two directions ([lo, hi) forward vs [lo, hi]
reversed). Route all bound computation through one range engine
(iterator_bounds_with_range) and add safe_range_iter_reversed, which by
construction yields exactly the keys of safe_range_iter in descending
order; the Option-based wrappers keep their documented semantics as thin
presets on top of it. SafeRevIter now takes an exclusive upper boundary
and steps off the boundary key when the initial seek lands on it, making
exclusive upper bounds work in reverse as well.

Composite-key scans had to construct artificial maximum keys (e.g.
ObjectKey::max_for_id) to bound the iteration. Add safe_iter_with_prefix
and safe_iter_with_prefix_reversed, which derive the upper bound by
incrementing the serialized prefix bytes instead, and migrate
AuthorityStore::get_latest_marker to the reversed prefix scan as the
first consumer.

Part of #10929.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… of safe_range_iter_reversed

The Option-based reverse iterator hard-coded inclusive bounds, which is
non-standard (the forward APIs treat the upper bound as exclusive) and
invisible at the call site. Migrate all callers to
safe_range_iter_reversed, where the inclusivity is spelled out by the
range syntax itself, and drop the wrapper:

- (None, None)         -> ..
- (None, Some(hi))     -> ..=hi
- (Some(lo), Some(hi)) -> lo..=hi

The translation is semantics-preserving: the range forms construct
exactly the same Bound::Included pairs the wrapper produced internally.
This also drops the wrapper's vestigial Result, removing ?/expect noise
at the call sites.

Note for upstream ports: Sui still has reversed_safe_iter_with_bounds;
ported code touching these call sites should apply the mapping above.

Part of #10929.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Several scans over composite-key tables bounded the iteration with
hand-built maximum keys (ObjectKey::max_for_id, usize::MAX tuples,
SequenceNumber::MAX_VALID_EXCL + ObjectDigest::MAX) to express "all
entries under a leading key". Replace the pure prefix scans with
safe_iter_with_prefix / safe_iter_with_prefix_reversed, which derive the
upper bound from the serialized prefix bytes instead:

- get_events_by_events_digest: all events of one digest
- have_deleted_owned_object_at_version_or_after: latest marker under
  (epoch, object id)
- get_latest_live_version_for_object_id: latest live-marker ObjectRef
- get_latest_object_ref_or_tombstone / get_latest_object_or_tombstone /
  try_get_object: latest object version
- count_object_versions (msim-only): all versions of an object id

Sites where a bound carries a real value (cursors, version cutoffs,
range_iter_live_object_set spanning multiple ids) keep the explicit
range form.

Besides the version-tail change below, a prefix scan also tightens the
lower bound (from unbounded / min_for_id to the prefix start). That is
safe: every migrated reader already re-checks the leading key (epoch /
object id) of the first row it reads, so a narrower lower bound cannot
change the result.

The replaced ..=max_for_id bounds capped the version component at
MAX_VALID_EXCL while a prefix scan covers the full u64 space; this is
equivalent because sentinel versions (CANCELLED_READ, CONGESTED_*) are
scheduling-time artifacts and are never persisted as object or marker
table keys.

Part of #10929.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rop MAX bounds in cursor scans

Paginated prefix scans bounded the iteration with an artificial maximum key
for the upper end (`(prefix, MAX)`) while the lower end carried a real
cursor. Add `safe_iter_with_prefix_from(&prefix, &cursor)` -- the rocksdb
`prefix_iterator_from` idiom: the lower bound is `prefix ++ cursor` (the
`cursor` is the remainder of the key after the prefix, not a full key) and
the upper bound is derived by incrementing the serialized prefix bytes, so
these scans no longer need a synthetic maximum key.

Cursor-paginated scans, where the cursor is the inclusive lower bound and
the caller keeps its `.skip(cursor.is_some())` to make it exclusive:

- IndexStore::get_dynamic_fields_iterator (jsonrpc)
- GrpcIndexesStore dynamic_field_iter / package_versions_iter

AuthorityPerpetualTables::get_newer_object_keys is migrated too, but it has
no cursor: its lower bound is object.1.next() (version + 1), an
exclusive-by-construction start, and it uses no `.skip`.

The prefix scan covers the whole prefix, including an entry sitting exactly
at the type's maximum tail that the old exclusive `(prefix, MAX)` upper
bound dropped. This differs from the old behavior only for such a max-tail
key, which is never a real object version or marker; for dynamic-field ids
it is a corrected edge. Equivalent for all real data otherwise. Reversed
pagination and the hash-bounded owner_iter keep explicit ranges -- their
bounds are not pure prefixes.

Part of #10929.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…of ending iteration

`SafeIter::next` mapped a failed key/value deserialization to `None` via
`.ok()`, which is indistinguishable from "end of data": a single
undeserializable row silently truncated the rest of the scan, despite the
iterator item being `Result<(K, V), TypedStoreError>` precisely so such
errors can be reported.

Return `Some(Err(TypedStoreError::Serialization(..)))` for a row that fails
to deserialize, matching how the in-memory backend (`memstore`) and
`DBMap::get`/`multi_get` already handle deserialization failures. Callers
that `?` now propagate the error instead of seeing a short read; callers
that `.unwrap()` fail loudly instead of losing data.

Adds a regression test that reopens a column family with a wider value type
and asserts iteration yields `Some(Err(..))` rather than silently ending.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@chriscandless chriscandless force-pushed the fix/typed-store-iterator-bounds branch from 30e5b39 to 62fb793 Compare June 22, 2026 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-protocol node Issues related to the Core Node team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

typed-store: issues with typed-store iterator

7 participants