Conversation
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. |
|
Claude finished @bfish713's task in 4m 28s —— View job Code Review: [Fast Finality] Require Locked QC storage for Vote2; Load Locked QC on Restart
SummaryThis PR adds a critical safety mechanism: persisting the locked QC (high QC) before releasing a phase-2 vote, and restoring it on restart. This prevents a restarted node from regressing to the (older) decided-anchor QC and potentially voting for a conflicting proposal. The design is well thought out and the implementation is clean. What looks good
Inline comments postedI've posted 6 inline comments covering:
No issues found in
|
| async fn append_high_qc2(&self, high_qc: Certificate1<SeqTypes>) -> anyhow::Result<()> { | ||
| // The lock advances monotonically, but persistence writes are spawned | ||
| // concurrently and retried, so a stale write can land after a newer | ||
| // one. Skip writes that would regress the stored view. | ||
| if let Some(existing) = (**self).load_high_qc2().await? | ||
| && existing.view_number() >= high_qc.view_number() | ||
| { | ||
| return Ok(()); | ||
| } | ||
| (**self).append_high_qc2(high_qc).await |
There was a problem hiding this comment.
Potential concern: TOCTOU race in the monotonicity guard
This load-then-conditional-write has a race window: between load_high_qc2() returning and append_high_qc2() executing, another concurrent storage task could have written a newer QC, which this write would then overwrite with an older one.
In practice this is mitigated because:
- The SQL backend uses
upsert(atomic at DB level) - The FS backend holds a write lock across
replace - The consensus layer's
bump_stored_high_qcis itself monotonic in-memory
But it's worth noting the comment says "Skip writes that would regress the stored view" while the implementation doesn't fully guarantee that under concurrency. If the underlying storage backends already guarantee monotonicity (via upsert semantics or write locking), this check is redundant. If they don't, this check is insufficient.
Is the intent for this to be a best-effort optimization (reducing unnecessary writes), or a safety invariant? If the former, a comment clarifying that would help. If the latter, consider using an atomic compare-and-swap or holding a lock across both the read and write.
| StorageOutput::HighQc(view) => { | ||
| self.bump_stored_high_qc(view); | ||
| // A higher durable lock can unblock phase-2 votes across many | ||
| // views, not just `view`, so re-check every pending vote2. | ||
| let pending: Vec<ViewNumber> = self.pending_vote2.keys().copied().collect(); | ||
| for view in pending { | ||
| self.release_vote2(view, outbox); | ||
| } | ||
| return; | ||
| }, |
There was a problem hiding this comment.
Nit: The return here skips release_vote1 and release_proposal
The HighQc branch does an early return, which means it never calls release_vote1(view, outbox) or release_proposal(view, outbox) for the view associated with this storage output.
This is likely intentional — HighQc doesn't provide any information relevant to vote1 or proposal release — but it's worth confirming. If a HighQc event happens to share a view number with a pending vote1 that is also waiting on other conditions, the early return won't delay it since release_vote1/release_proposal are called on each of their own storage events. But it does mean a HighQc event at view V won't opportunistically release a vote1/proposal at the same view V.
Is this the intended behavior? If so, a brief comment like "// HighQc only gates vote2; skip vote1/proposal release" before the return would clarify the intent.
| self.voted_2_views.insert(view); | ||
| // The lock is set above (this point is unreachable with no lock), and | ||
| // is at or above `view`. The vote may only go out once that lock is | ||
| // durable. | ||
| let required = self | ||
| .locked_view() | ||
| .expect("locked_cert is set before voting in phase 2"); |
There was a problem hiding this comment.
Question: Can locked_view() ever be less than view here?
The comment says "The lock is set above ... and is at or above view", but actually cert1.view_number() equals view (the cert certifies the same view as the vote). When the lock is not updated (because locked_cert.view_number() >= cert1.view_number()), the locked view is the existing lock which could be at any view ≥ some prior view.
The expect is safe since locked_cert is always Some by this point (either just set above, or already set from a prior view). But could required here be a different view than the PersistHighQc that was just emitted? If the lock wasn't updated (the if on line 2013 was false), no PersistHighQc is emitted, but required would be the old locked view which is already durable from a prior persist. So it seems correct — just worth confirming the reasoning.
| pub fn append_high_qc2(&mut self, high_qc: Certificate1<T>) { | ||
| let view = high_qc.view_number(); | ||
| let storage = self.storage.clone(); | ||
| let handle = self.tasks.spawn(async move { | ||
| loop { | ||
| match storage.append_high_qc2(high_qc.clone()).await { | ||
| Ok(()) => return Some(StorageOutput::HighQc(view)), | ||
| Err(err) => { | ||
| warn!(%err, %view, "failed to append high qc, retrying"); | ||
| sleep(RETRY_DELAY).await; | ||
| }, | ||
| } | ||
| } | ||
| }); | ||
| self.handles.entry(view).or_default().push(handle); |
There was a problem hiding this comment.
Infinite retry loop without backoff escalation or abort
This loop retries forever with a fixed 300ms delay. Unlike the SQL backend which uses WRITE_BACKOFF with WRITE_RETRY_MAX, this has no upper bound on retries. If storage is permanently broken (e.g. disk full, permissions changed), this task will retry forever and the vote2 will never be released.
Consider either:
- Adding a max retry count that eventually returns
None(and logs an error), or - Using an exponential backoff
This is consistent with other storage methods in this file (append_vid, append_da, etc.) which also retry forever, so this may be an intentional design choice. But since HighQc now gates vote2 release, an infinite hang here directly impacts liveness.
| fn parent_reconstructed( | ||
| &self, | ||
| parent_view: ViewNumber, | ||
| parent_block_commitment: VidCommitment2, | ||
| parent_leaf: Commitment<Leaf2<T>>, | ||
| ) -> bool { | ||
| self.blocks_reconstructed | ||
| .contains(&(parent_view, parent_block_commitment)) | ||
| || self.locked_cert.as_ref().is_some_and(|lock| { | ||
| lock.view_number() == parent_view && lock.data().leaf_commit == parent_leaf | ||
| }) | ||
| } |
There was a problem hiding this comment.
Nice design: lock-as-proof-of-reconstruction is a clean optimization
The reasoning here is sound — a node only locks on a block it has reconstructed, and a persisted lock is written for exactly such a view, so matching the lock against the parent is itself proof of reconstruction. This elegantly avoids the need to re-fetch VID shares after restart.
One thing to verify: if the lock is restored from persistence but the blocks_reconstructed set is empty (post-restart), and a new proposal arrives building on a view different from the locked view, this correctly falls through to false. The first condition checks blocks_reconstructed (empty), and the second checks lock.view_number() == parent_view (won't match if parent is different from lock). Looks correct.
| pub fn seed_locked_cert(&mut self, cert1: Certificate1<T>) { | ||
| let view = cert1.view_number(); | ||
| self.bump_stored_high_qc(view); | ||
| self.certs.entry(view).or_insert_with(|| cert1.clone()); | ||
| if self | ||
| .locked_cert | ||
| .as_ref() | ||
| .is_none_or(|locked| locked.view_number() < view) | ||
| { | ||
| self.locked_cert = Some(cert1); | ||
| } | ||
| } |
There was a problem hiding this comment.
Question: Should seed_locked_cert also insert into blocks_reconstructed?
seed_parent inserts into blocks_reconstructed for the anchor, but seed_locked_cert does not. This is intentional because parent_reconstructed() uses the lock itself as proof of reconstruction. But it means that after restart, for the lock view:
parent_reconstructed(lock_view, ...)returnstrue(via the lock check)blocks_reconstructed.contains((lock_view, ...))returnsfalse
The maybe_vote_2_and_update_lock method checks blocks_reconstructed directly (line ~2000-2008), not parent_reconstructed. So if a cert1 + proposal arrives for the same view as the restored lock, the vote2 path would fail the blocks_reconstructed check. Is this expected because the node wouldn't receive a cert1 for a view it already locked on? (Since voted_2_views would catch the duplicate.)
|
Successfully created backport PR for |
…storage for Vote2; Load Locked QC on Restart (#4551) [Fast Finality] Require Locked QC storage for Vote2; Load Locked QC on Restart (#4537) * require high qc storage for voting, load on restart * seed with all proposals, consider locked view reconstructed * add schemas * comments * enforce locked qc invarient in storage (cherry picked from commit 0151062) Co-authored-by: Brendon Fish <bfish713@gmail.com>
Closes #<ISSUE_NUMBER>
This PR:
This PR does not:
Key places to review: