fix(tbtc/signer): gate plaintext state, zeroize FFI buffers, allow periodic reshare#4124
Merged
mswilkison merged 12 commits intoJul 1, 2026
Conversation
…riodic reshare Three findings from the #4005 review (workflow review + Codex): 1. [HIGH] persistence: the legacy unencrypted plaintext state fallback was accepted unconditionally, bypassing the AEAD envelope -- anyone who could write the state file could forge it (cleared replay markers / attacker key material) without the state-encryption key. Per the secret-material hardening plan the plaintext path is now emergency-rollback-only: compile-time disabled in release builds, rejected in production profiles, and gated behind an explicit TBTC_SIGNER_PERMIT_PLAINTEXT_STATE_ROLLBACK opt-in. The two tests that load plaintext now opt in; a negative assertion pins the fail-closed default. 2. [Codex P1] ffi: free_buffer now zeroizes the buffer before deallocation, so secret material returned by nonce/DKG/key-package endpoints is wiped rather than left in allocator memory when a caller forgets to clear it. 3. [Codex P2] lifecycle: a subsequent same-session refresh with updated shares produced a different fingerprint and was rejected as SessionConflict, so a long-lived session could never advance refresh_history past the first refresh. A different fingerprint is now treated as a new periodic refresh; idempotent replay of the same fingerprint is unchanged. Deferred (flagged, not in this change): - Signing-policy firewall production force-on: force-enabling it makes the entire firewall policy config mandatory in production (a deployment-contract change); the verifier flagged it as needing product confirmation, so left as-is. - State-at-rest anti-rollback freshness counter and quarantine-reset marker loss: design-level / inherent to the opt-in policy. Verified: cargo test (297 passed, 0 failed) and cargo clippy -D warnings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Follow-up to the #4124 review (multi-agent + Codex), which found regressions in the periodic-refresh change and a gap in the FFI wipe: 1. Refresh stale-retry (CONFIRMED, Codex P2): removing SessionConflict made refresh re-execute on a replayed/older request -- re-deriving stale shares and bumping the epoch. Now a fingerprint already in refresh_history (accepted but no longer current) is rejected as a stale retry; a genuinely-new fingerprint still proceeds (repeatable periodic reshares). RefreshHistoryRecord gains an optional request_fingerprint (serde-default, backward compatible). 2. Unbounded refresh_history (CONFIRMED): the removed guard was the only bound. Cap per-session history at MAX_REFRESH_HISTORY (256), dropping oldest; strictly-increasing epochs preserved so continuity still holds. 3. FFI incomplete wipe (PLAUSIBLE): to_ffi_buffer's into_boxed_slice() shrinks and reallocates when capacity > len (serde_json over-allocates), freeing the original secret buffer un-wiped. Copy into an exact boxed slice and zeroize the source Vec so no un-wiped copy survives. 4. Quarantine schema-mismatch test exercised plaintext-refusal, not schema validation, after the plaintext gate landed: opt into the rollback flag like its sibling so it tests the intended path. 5. Plaintext-gate doc comment overclaimed "compile-time disabled in release"; clarify the load-bearing guard is the runtime non-production check. Verified: cargo test (298 passed) + clippy -D warnings + fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codex re-review: the rollback-path tests opt into the plaintext path, which legacy_plaintext_state_permitted() compiles out in release (cfg!(debug_assertions) is false), so under `cargo test --release` they fail at the post-acceptance assertions. cfg-gate the three plaintext-acceptance tests to debug_assertions; release always refuses plaintext before those assertions, so there is nothing for them to cover there. Verified: cargo test (298 passed, debug) + cargo test --release (295 passed; the 3 tests correctly compiled out) + clippy --release -D warnings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codex re-review: a host that installs an init-time config has signer_env_var read the installed config, not the process environment, so the documented TBTC_SIGNER_PERMIT_PLAINTEXT_STATE_ROLLBACK env flag had no effect for it -- the emergency plaintext migration was impossible to enable via the configured-host path. Add permit_plaintext_state_rollback to InitSignerConfigRequest and map it in config_values_from_request, mirroring the other enforcement flags. Verified: cargo test (298 passed) + clippy -D warnings + fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codex re-review: a session loaded from state written before RefreshHistoryRecord.request_fingerprint deserializes its history records with None, so the last accepted refresh fingerprint survives only in session.refresh_request_fingerprint. The first post-upgrade periodic refresh overwrote that value, so a delayed retry of the pre-upgrade refresh no longer matched the history scan and was accepted as a new epoch instead of SessionConflict. Before overwriting refresh_request_fingerprint, backfill the previously-accepted fingerprint onto the most-recent history record when it is not already tracked (the legacy None case). Adds a regression test that simulates pre-upgrade state (strip history fingerprints) and verifies a retry of the old refresh is rejected after a new one. Verified: cargo test (299 debug / 296 release) + clippy -D warnings + fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codex re-review: refresh_cadence_status reported refresh_count as session.refresh_history.len(), which the MAX_REFRESH_HISTORY cap bounds -- so a long-lived session that ran more than 256 refreshes showed a stuck count of 256 even as later refreshes succeeded. Add a monotonic SessionState.refresh_count (and its PersistedSessionState mirror, serde-default for back-compat, plus both conversions) incremented on each accepted refresh and backfilled from the retained history length for legacy state; report it from refresh_cadence_status instead of the pruned history length. Adds a regression test that prunes history and asserts the reported count is the true total. Verified: cargo test (300 passed) + clippy -D warnings + fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… load Codex re-review: refresh_count is #[serde(default)], so state written before the field existed deserializes it to 0 even when refresh_history holds accepted refreshes -- and since refresh_cadence_status now reports refresh_count, an upgraded node reported 0 until the next refresh. Backfill it from refresh_history.len() in the Persisted->Session conversion (evaluated before refresh_history is moved), with a regression test. Verified: cargo test + clippy -D warnings + fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ntine recheck)
Codex re-review of the signer FFI surface:
1. [P1] Secret request fields not zeroized (api.rs): SignShareRequest.nonces_hex
and key_package_hex deserialized into plain Strings dropped without
zeroization, retaining the one-time nonce / private key package in freed heap.
Wrap both in Zeroizing<String> (wiped on drop) + a redacting Debug so the
secrets no longer leak via {:?} either.
2. [P1] Panic hook defeats payload redaction (ffi.rs): catch_unwind does not
suppress Rust's default panic hook, so a panic payload (path/config/secret)
prints to stderr before being converted to the redacted ErrorResponse. Install
a panic hook at the FFI boundary that redacts the payload outside development.
3. [P2] Quarantine not rechecked on cached-round reuse (signing.rs): the matched-
fingerprint reuse branch returned a fresh share without the new-round path's
enforce_not_quarantined_identifiers check, so a participant quarantined after
the round opened could still obtain a share. Apply the check on reuse too.
Verified: cargo test (301 passed) + clippy -D warnings + fmt clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pty history Codex re-review: legacy state written before refresh_history existed can carry refresh_request_fingerprint/refresh_result with an EMPTY refresh_history. The fingerprint backfill used refresh_history.last_mut(), which is None for empty history, so the previous fingerprint was not recorded and a delayed retry of that pre-upgrade refresh was re-executed as a new refresh. When there is no record to backfill onto, synthesize one from the cached refresh_result (prior epoch, keeping history strictly increasing) carrying the previous fingerprint, so stale retries stay rejected. Adds a regression test for the empty-history legacy shape. Verified: cargo test (302 passed) + clippy -D warnings + fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…h env Two Codex-flagged issues in the non-interactive signing path: 1. start_sign_round fingerprinted and derived the round id from the raw request.message_hex string. hex::decode accepts mixed casing, so two calls carrying identical message bytes but different hex casing derived different fingerprints/round ids and a semantically identical retry was rejected as a SessionConflict. Lowercase message_hex right after it validates as hex, mirroring the interactive path (interactive.rs), so the fingerprint and derive_round_id are casing-independent. 2. The README-documented `cargo bench --features bench-restart-hook --bench phase5_roast` failed in a clean shell: ensure_benchmark_environment never set TBTC_SIGNER_PROFILE (missing -> production) or TBTC_SIGNER_STATE_ENCRYPTION_KEY_HEX (required by the default `env` state-key provider), so the first RunDkg persist aborted. Seed profile=development, provider=env, and a fixed dev key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The prior commit lowercases message_hex before computing every canonical and legacy fingerprint. That regressed the upgrade path: when a node upgrades with an active sign round that a previous build started using mixed/uppercase message_hex, the persisted sign_request_fingerprint was computed over that original casing. The same retry then matched none of the fingerprints and fell through to SessionConflict instead of reusing the cached round. Capture the pre-canonicalization casing and, only when it differs from the lowercased form, compute the four legacy fingerprint shapes over the original casing and accept them in the cached-round match. A match migrates the stored fingerprint to the canonical lowercase form, so the compatibility cost is paid once per in-flight round. Adds a regression test that persists a mixed-case fingerprint and asserts the retry reuses the round and migrates it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d result The empty-history synthesize branch was guarded by `else if let Some(previous_result)`, so a legacy/degraded state that kept refresh_request_fingerprint but whose refresh_result deserialized to None (and whose refresh_history is empty) fell through: neither the last_mut() backfill nor the synthesize branch ran. The next refresh then overwrote the only copy of the old fingerprint, so a delayed retry of the pre-upgrade request was treated as fresh and advanced the epoch — a replay. Make the branch unconditional: synthesize a history record carrying the previous fingerprint whether or not a cached result is present. Prefer the result's epoch (when non-zero and below the new refresh) and share count; otherwise fall back to refresh_epoch-1 (min 1) and a zero share count. refresh_epoch_counter is persisted, so a prior accepted refresh implies refresh_epoch >= 2 and the fallback stays non-zero and strictly below the new record, keeping history monotonic. Adds a regression test (empty history + refresh_result=None) verified to fail against the pre-fix code (the stale retry was re-executed instead of rejected). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
01e274b
into
extraction/frost-signer-mirror-2026-05-26
20 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three findings from the #4005 review (the multi-agent custody review + Codex's review), fixed and verified.
1. [HIGH] Plaintext state fallback was accepted unconditionally —
persistence.rsdecode_persisted_state_storage_formatfell back to parsing an unencrypted, unauthenticatedPersistedEngineStatewhenever the bytes weren't a valid AEAD envelope — with no gate at all. The AEAD envelope is the only integrity control for state-at-rest, so anyone who could write the state file could forge it (clearedconsumed_*replay markers → resume-into-double-sign, or an attackerdkg_key_package→ admission bypass) without holding the state-encryption key.Per the secret-material hardening plan, the plaintext path is now emergency-rollback-only: compile-time disabled in release builds (
cfg!(debug_assertions)), rejected in production profiles, and gated behind an explicitTBTC_SIGNER_PERMIT_PLAINTEXT_STATE_ROLLBACKopt-in. The two tests that load plaintext (migration + schema-mismatch) opt in, and a negative assertion pins the fail-closed default.2. [Codex P1] Secret FFI buffers left unwiped on free —
ffi.rsfree_buffernowzeroizes the buffer before deallocation, so secret material returned by nonce/DKG/key-package endpoints (e.g.frost_tbtc_generate_nonces_and_commitments) is wiped rather than left in allocator memory when a caller forgets to clear it. (zeroizeis a volatile wipe the optimizer can't elide.)3. [Codex P2] Periodic reshare was a one-shot —
lifecycle.rsAfter the first
refresh_shares, any later reshare with updatedcurrent_shareshad a different fingerprint and was rejected asSessionConflict, so a long-lived session could never advancerefresh_history/refresh_count/cadence past the first refresh. A different fingerprint is now treated as a new periodic refresh; idempotent replay of the same fingerprint is unchanged.Deferred (flagged for the team)
Verification
cargo test— 297 passed, 0 failed.cargo clippy --all-targets -- -D warnings— clean (exit 0).Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com
🤖 Generated with Claude Code