[Fast Finality] Require Locked QC storage for Vote2; Load Locked QC on Restart by bfish713 · Pull Request #4537 · EspressoSystems/espresso-network

bfish713 · 2026-06-18T20:07:28Z

Closes #<ISSUE_NUMBER>

This PR:

This PR does not:

Key places to review:

gemini-code-assist · 2026-06-18T20:07:32Z

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

claude · 2026-06-19T13:45:51Z

Claude finished @bfish713's task in 4m 28s —— View job

Code Review: [Fast Finality] Require Locked QC storage for Vote2; Load Locked QC on Restart

Read repository guidelines (CLAUDE.md)
Read existing review comments
Gather full PR diff context
Review all changed files in detail
Post comprehensive review feedback

Summary

This PR adds a critical safety mechanism: persisting the locked QC (high QC) before releasing a phase-2 vote, and restoring it on restart. This prevents a restarted node from regressing to the (older) decided-anchor QC and potentially voting for a conflicting proposal. The design is well thought out and the implementation is clean.

What looks good

Core safety invariant is sound: The persist-before-vote-release pattern (PersistHighQc → StorageOutput::HighQc → release_vote2) correctly ensures the lock is durable before the vote goes out.
Lock-as-proof-of-reconstruction (parent_reconstructed) is an elegant optimization that avoids re-fetching VID shares after restart.
seed_locked_cert ordering (after seed_parent) is correct — the persisted lock can be newer than the anchor QC.
Monotonic durability watermark (bump_stored_high_qc) correctly uses >= comparisons throughout.
Good test coverage: New tests cover the restart seeding paths (seeded parent, lock-as-reconstruction-proof, blocked-without-parent) and the existing test_vote2_gated_on_vid_storage is updated to verify the new HighQc gate.
DB migrations are clean — single-row tables with bool PRIMARY KEY DEFAULT true is a well-known pattern.

Inline comments posted

I've posted 6 inline comments covering:

TOCTOU race in monotonicity guard (traits.rs): The load-then-write pattern has a race window, though mitigated by backend atomicity. Clarification on intent would help.
Early return in HighQc handler (consensus.rs): Skips release_vote1/release_proposal — likely intentional but worth confirming.
locked_view() vs emitted PersistHighQc (consensus.rs): Question about whether required can differ from the just-emitted persist.
Infinite retry without escalation (storage.rs): The retry loop has no upper bound, consistent with other storage methods but now directly gates liveness.
Lock-as-proof-of-reconstruction design (consensus.rs): Positive feedback — the reasoning is sound.
seed_locked_cert and blocks_reconstructed (consensus.rs): Question about whether the lock view could be re-voted via maybe_vote_2_and_update_lock.

No issues found in

SQL persistence (sql.rs): Correctly uses WRITE_BACKOFF retry with upsert semantics.
FS persistence (fs.rs): Correctly acquires write lock and uses replace for atomic overwrite.
Coordinator integration (coordinator.rs): PersistHighQc output is correctly dispatched to storage.append_high_qc2.
Context startup (context.rs): load_high_qc2 is called before coordinator construction, and maybe_locked_qc cleanly passes it through the builder.
Test harness (utils.rs, coordinator_builder.rs): Correctly mirrors the production code's seeding behavior.

claude · 2026-06-19T13:48:52Z

+    async fn append_high_qc2(&self, high_qc: Certificate1<SeqTypes>) -> anyhow::Result<()> {
+        // The lock advances monotonically, but persistence writes are spawned
+        // concurrently and retried, so a stale write can land after a newer
+        // one. Skip writes that would regress the stored view.
+        if let Some(existing) = (**self).load_high_qc2().await?
+            && existing.view_number() >= high_qc.view_number()
+        {
+            return Ok(());
+        }
+        (**self).append_high_qc2(high_qc).await


Potential concern: TOCTOU race in the monotonicity guard

This load-then-conditional-write has a race window: between load_high_qc2() returning and append_high_qc2() executing, another concurrent storage task could have written a newer QC, which this write would then overwrite with an older one.

In practice this is mitigated because:

The SQL backend uses upsert (atomic at DB level)

The FS backend holds a write lock across replace

The consensus layer's bump_stored_high_qc is itself monotonic in-memory

But it's worth noting the comment says "Skip writes that would regress the stored view" while the implementation doesn't fully guarantee that under concurrency. If the underlying storage backends already guarantee monotonicity (via upsert semantics or write locking), this check is redundant. If they don't, this check is insufficient.

Is the intent for this to be a best-effort optimization (reducing unnecessary writes), or a safety invariant? If the former, a comment clarifying that would help. If the latter, consider using an atomic compare-and-swap or holding a lock across both the read and write.

claude · 2026-06-19T13:49:03Z

+            StorageOutput::HighQc(view) => {
+                self.bump_stored_high_qc(view);
+                // A higher durable lock can unblock phase-2 votes across many
+                // views, not just `view`, so re-check every pending vote2.
+                let pending: Vec<ViewNumber> = self.pending_vote2.keys().copied().collect();
+                for view in pending {
+                    self.release_vote2(view, outbox);
+                }
+                return;
+            },


Nit: The return here skips release_vote1 and release_proposal

The HighQc branch does an early return, which means it never calls release_vote1(view, outbox) or release_proposal(view, outbox) for the view associated with this storage output.

This is likely intentional — HighQc doesn't provide any information relevant to vote1 or proposal release — but it's worth confirming. If a HighQc event happens to share a view number with a pending vote1 that is also waiting on other conditions, the early return won't delay it since release_vote1/release_proposal are called on each of their own storage events. But it does mean a HighQc event at view V won't opportunistically release a vote1/proposal at the same view V.

Is this the intended behavior? If so, a brief comment like "// HighQc only gates vote2; skip vote1/proposal release" before the return would clarify the intent.

claude · 2026-06-19T13:49:21Z

        self.voted_2_views.insert(view);
+        // The lock is set above (this point is unreachable with no lock), and
+        // is at or above `view`. The vote may only go out once that lock is
+        // durable.
+        let required = self
+            .locked_view()
+            .expect("locked_cert is set before voting in phase 2");


Question: Can locked_view() ever be less than view here?

The comment says "The lock is set above ... and is at or above view", but actually cert1.view_number() equals view (the cert certifies the same view as the vote). When the lock is not updated (because locked_cert.view_number() >= cert1.view_number()), the locked view is the existing lock which could be at any view ≥ some prior view.

The expect is safe since locked_cert is always Some by this point (either just set above, or already set from a prior view). But could required here be a different view than the PersistHighQc that was just emitted? If the lock wasn't updated (the if on line 2013 was false), no PersistHighQc is emitted, but required would be the old locked view which is already durable from a prior persist. So it seems correct — just worth confirming the reasoning.

claude · 2026-06-19T13:49:31Z

+    pub fn append_high_qc2(&mut self, high_qc: Certificate1<T>) {
+        let view = high_qc.view_number();
+        let storage = self.storage.clone();
+        let handle = self.tasks.spawn(async move {
+            loop {
+                match storage.append_high_qc2(high_qc.clone()).await {
+                    Ok(()) => return Some(StorageOutput::HighQc(view)),
+                    Err(err) => {
+                        warn!(%err, %view, "failed to append high qc, retrying");
+                        sleep(RETRY_DELAY).await;
+                    },
+                }
+            }
+        });
+        self.handles.entry(view).or_default().push(handle);


Infinite retry loop without backoff escalation or abort

This loop retries forever with a fixed 300ms delay. Unlike the SQL backend which uses WRITE_BACKOFF with WRITE_RETRY_MAX, this has no upper bound on retries. If storage is permanently broken (e.g. disk full, permissions changed), this task will retry forever and the vote2 will never be released.

Consider either:

Adding a max retry count that eventually returns None (and logs an error), or

Using an exponential backoff

This is consistent with other storage methods in this file (append_vid, append_da, etc.) which also retry forever, so this may be an intentional design choice. But since HighQc now gates vote2 release, an infinite hang here directly impacts liveness.

claude · 2026-06-19T13:49:42Z

+    fn parent_reconstructed(
+        &self,
+        parent_view: ViewNumber,
+        parent_block_commitment: VidCommitment2,
+        parent_leaf: Commitment<Leaf2<T>>,
+    ) -> bool {
+        self.blocks_reconstructed
+            .contains(&(parent_view, parent_block_commitment))
+            || self.locked_cert.as_ref().is_some_and(|lock| {
+                lock.view_number() == parent_view && lock.data().leaf_commit == parent_leaf
+            })
+    }


Nice design: lock-as-proof-of-reconstruction is a clean optimization

The reasoning here is sound — a node only locks on a block it has reconstructed, and a persisted lock is written for exactly such a view, so matching the lock against the parent is itself proof of reconstruction. This elegantly avoids the need to re-fetch VID shares after restart.

One thing to verify: if the lock is restored from persistence but the blocks_reconstructed set is empty (post-restart), and a new proposal arrives building on a view different from the locked view, this correctly falls through to false. The first condition checks blocks_reconstructed (empty), and the second checks lock.view_number() == parent_view (won't match if parent is different from lock). Looks correct.

claude · 2026-06-19T13:49:53Z

+    pub fn seed_locked_cert(&mut self, cert1: Certificate1<T>) {
+        let view = cert1.view_number();
+        self.bump_stored_high_qc(view);
+        self.certs.entry(view).or_insert_with(|| cert1.clone());
+        if self
+            .locked_cert
+            .as_ref()
+            .is_none_or(|locked| locked.view_number() < view)
+        {
+            self.locked_cert = Some(cert1);
+        }
+    }


Question: Should seed_locked_cert also insert into blocks_reconstructed?

seed_parent inserts into blocks_reconstructed for the anchor, but seed_locked_cert does not. This is intentional because parent_reconstructed() uses the lock itself as proof of reconstruction. But it means that after restart, for the lock view:

parent_reconstructed(lock_view, ...) returns true (via the lock check)

blocks_reconstructed.contains((lock_view, ...)) returns false

The maybe_vote_2_and_update_lock method checks blocks_reconstructed directly (line ~2000-2008), not parent_reconstructed. So if a cert1 + proposal arrives for the same view as the restored lock, the vote2 path would fail the blocks_reconstructed check. Is this expected because the node wouldn't receive a cert1 for a view it already locked on? (Since voted_2_views would catch the duplicate.)

github-actions · 2026-06-20T01:29:02Z

Successfully created backport PR for release-ff-upgrade-test:

[Backport release-ff-upgrade-test] [Fast Finality] Require Locked QC storage for Vote2; Load Locked QC on Restart #4551

…storage for Vote2; Load Locked QC on Restart (#4551) [Fast Finality] Require Locked QC storage for Vote2; Load Locked QC on Restart (#4537) * require high qc storage for voting, load on restart * seed with all proposals, consider locked view reconstructed * add schemas * comments * enforce locked qc invarient in storage (cherry picked from commit 0151062) Co-authored-by: Brendon Fish <bfish713@gmail.com>

require high qc storage for voting, load on restart

002131e

bfish713 added 3 commits June 19, 2026 09:43

seed with all proposals, consider locked view reconstructed

8e35828

add schemas

c7ecbed

Merge branch 'main' into bf/qc

ecd13f1

bfish713 marked this pull request as ready for review June 19, 2026 13:45

bfish713 requested review from imabdulbasit, jbearer, lukaszrzasik, ss-es and sveitser as code owners June 19, 2026 13:45

claude Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread crates/espresso/node/api/migrations/postgres/V1503__high_qc2.sql

imabdulbasit approved these changes Jun 19, 2026

View reviewed changes

bfish713 added 4 commits June 19, 2026 10:53

comments

4722854

enforce locked qc invarient in storage

ed11623

Merge remote-tracking branch 'origin/main' into bf/qc

a79dbd3

lint

ab1e65e

bfish713 enabled auto-merge (squash) June 20, 2026 00:06

bfish713 added the backport release-ff-upgrade-test label Jun 20, 2026

bfish713 merged commit 0151062 into main Jun 20, 2026
171 of 180 checks passed

bfish713 deleted the bf/qc branch June 20, 2026 01:26

github-actions Bot mentioned this pull request Jun 20, 2026

[Backport release-ff-upgrade-test] [Fast Finality] Require Locked QC storage for Vote2; Load Locked QC on Restart #4551

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fast Finality] Require Locked QC storage for Vote2; Load Locked QC on Restart#4537

[Fast Finality] Require Locked QC storage for Vote2; Load Locked QC on Restart#4537
bfish713 merged 8 commits into
mainfrom
bf/qc

bfish713 commented Jun 18, 2026

Uh oh!

gemini-code-assist Bot commented Jun 18, 2026

Uh oh!

claude Bot commented Jun 19, 2026 •

edited

Loading

Uh oh!

claude Bot Jun 19, 2026

Uh oh!

claude Bot Jun 19, 2026

Uh oh!

claude Bot Jun 19, 2026

Uh oh!

claude Bot Jun 19, 2026

Uh oh!

claude Bot Jun 19, 2026

Uh oh!

claude Bot Jun 19, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bfish713 commented Jun 18, 2026

This PR:

This PR does not:

Key places to review:

Uh oh!

gemini-code-assist Bot commented Jun 18, 2026

Uh oh!

claude Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: [Fast Finality] Require Locked QC storage for Vote2; Load Locked QC on Restart

Summary

What looks good

Inline comments posted

No issues found in

Uh oh!

claude Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Jun 19, 2026 •

edited

Loading