Skip to content

Feat: Single Sequencer#949

Open
tomatoishealthy wants to merge 19 commits into
mainfrom
feat/sequencer-optimize
Open

Feat: Single Sequencer#949
tomatoishealthy wants to merge 19 commits into
mainfrom
feat/sequencer-optimize

Conversation

@tomatoishealthy
Copy link
Copy Markdown
Contributor

@tomatoishealthy tomatoishealthy commented May 9, 2026

Summary by CodeRabbit

  • New Features

    • Added Sequencer High Availability (HA) system with Raft-based clustering, enabling failover and cluster membership management.
    • Upgraded L1Sequencer contract to maintain sequencer history with L2 block activation scheduling instead of single address.
    • Made block application idempotent and reorg-aware for proper state handling.
  • Tests

    • Added comprehensive test suites for HA operations, including cluster formation, failover, and API verification.
    • Added L1Sequencer contract tests covering initialization, history management, and ownership controls.
  • Chores

    • Updated dependencies across all modules to new upstream versions.

Review Change Stack

@tomatoishealthy tomatoishealthy requested a review from a team as a code owner May 9, 2026 15:28
@tomatoishealthy tomatoishealthy requested review from twcctop and removed request for a team May 9, 2026 15:28
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 9, 2026

📝 Walkthrough

Walkthrough

This PR introduces Sequencer History V2 alongside a new Raft-based High-Availability (HA) system for the Morph sequencer. The L1Sequencer contract is refactored from a single address to an ordered history with L2-block activation scheduling. The verifier switches from TTL-based caching to history-backed lookups, and a new hakeeper module provides Raft-based consensus for cluster-mode sequencer operation with leader election and block replication. The executor gains idempotent block application, and comprehensive Docker/test infrastructure validates cluster topology and P2P optimizations.

Changes

Sequencer History V2 with High-Availability Raft Support

Layer / File(s) Summary
L1Sequencer Contract and ABI Bindings
contracts/contracts/l1/L1Sequencer.sol, bindings/bindings/l1sequencer.go, contracts/contracts/test/L1Sequencer.t.sol, contracts/contracts/test/base/L1SequencerBase.t.sol, contracts/deploy/022-SequencerInit.ts
Contract refactored to store ordered HistoryRecord[] with L2 block activation heights, replacing immediate-effect updates. New initializeHistory(firstSequencer, upgradeL2Block) and updateSequencer(newSequencer, startL2Block) admin flows. View functions now expose getSequencerAt(l2Height) for height-based lookup and full history access. Generated bindings and comprehensive test suite (28 test functions) validate initialization, history growth, binary-search lookup, and ownership controls.
Sequencer Verification from Cache to History
node/l1sequencer/verifier.go
SequencerVerifier completely rewritten: replaces TTL-based single-address cache with L1 contract history polling. Background refresh loop uses exponential backoff (10s) while history is empty, then switches to 5-minute polling. Maintains interval cursor for O(1) height lookups. New API: SequencerAtHeight(l2Height), IsSequencerAt(addr, l2Height), VerificationStartHeight(), Stop(). First history load sets Tendermint upgrade height from first record's StartL2Block.
L1 Sequencer Signer Refactoring
node/l1sequencer/signer.go
Signer interface simplified: removed IsActiveSequencer contract. LocalSigner decoupled from SequencerVerifier; now stores only private key and derived address. Constructor signature updated to eliminate verifier parameter.
HA Configuration Model and Validation
node/hakeeper/config.go, node/hakeeper/ha.toml.example
New hakeeper configuration package with nested Config, ConsensusConfig, SnapshotConfig, TimeoutConfig, RPCConfig. Includes DefaultConfig(), LoadFile(path) for TOML overlay, Resolve(homeDir) for auto-populate ServerID/StorageDir/AdvertisedAddr, Validate() for required fields and port ranges, and LogEffectiveConfig() for operator visibility.
HA Service: Raft FSM and Core Components
node/hakeeper/block_fsm.go, node/hakeeper/block_payload.go, node/hakeeper/ha_service.go, node/hakeeper/leader_monitor.go
BlockFSM implements Raft log apply with decoded block delivery via buffered channel and onApplied callback injection. Snapshot/restore persist only 8-byte appliedHeight. Block serialization/deserialization helpers wrap proto conversion errors. HAService orchestrates Raft: storage, transport, snapshotting, optional bootstrap, leader-readiness monitoring with Barrier() synchronization, and join loop with exponential backoff. Exposes Commit(block) for replication and Subscribe() for committed-block streaming.
HA Management RPC API
node/hakeeper/rpc/api.go, node/hakeeper/rpc/auth.go, node/hakeeper/rpc/auth_test.go, node/hakeeper/rpc/backend.go, node/hakeeper/rpc/client.go, node/hakeeper/rpc/server.go, node/hakeeper/rpc/types.go
Complete JSON-RPC management API: API interface with leader queries, voter/non-voter membership changes, leader transfer, and cluster-membership reporting. APIBackend delegates to ConsensusAdapter (Raft). HTTP server with optional token-based write-method auth. Client library for remote HA management. RPC types: ServerSuffrage enum, ClusterMembership, ServerInfo, ConsensusAdapter interface. Auth middleware enforces constant-time token comparison, preserves request body, and supports batch JSON-RPC requests.
Executor Idempotent Block Application
node/core/executor.go, node/types/retryable_client.go, node/types/retryable_client_test.go
ApplyBlockV2 signature changed to (applied bool, err error) with reorg-aware skip logic: checks L2 head height, skips apply if existing block hash matches at same height. Added Syncer() getter and SetSyncer() idempotent setter. RetryableClient.NewL2BlockV2 wraps L2BlockV2 creation with retry policy. Permanent-error classification extended to include wrong block number, missing parent, block hash mismatch, invalid NextL1MsgIndex.
Node Startup and HA Wiring
node/cmd/node/main.go
L2NodeMain integrates HA service: removed ConsensusSwitchHeight logic. Added conditional eager L1 syncer startup for post-upgrade nodes. initHAService helper builds HA config from defaults, TOML file, and CLI flag overrides. SequencerVerifier now requires sequencer contract address (fails if unset). SetupNode accepts optional HA parameter. LocalSigner created without verifier. Verifier stopped during shutdown.
CLI Flag Configuration for HA
node/flags/flags.go
Added seven HA flags: ha.enabled, ha.config, ha.bootstrap, ha.join, ha.server-id, ha.advertised-addr, ha.rpc-token, all with MORPH_NODE_HA_* environment mappings. Removed consensus.switchHeight. Flags inserted after sequencer.privateKey in global Flags slice.
Syncer Start Guard
node/sync/syncer.go
Syncer.Start() updated with atomic.Bool guard to prevent duplicate initialization: subsequent calls log and return early.
Module Dependencies and Versions
Makefile, bindings/go.mod, contracts/go.mod, node/go.mod, ops/l2-genesis/go.mod, ops/tools/go.mod, oracle/go.mod, token-price-oracle/go.mod, tx-submitter/go.mod, go-ethereum submodule
Coordinated version/replace updates across all modules: tendermint replaced with morph-l2/tendermint v0.0.0-20260515043308-c6f7e21e4b14, go-ethereum pinned to v0.0.0-20260508105911-56deb7072ae4. Makefile pins ETHEREUM_SUBMODULE_COMMIT_OR_TAG, ETHEREUM_TARGET_VERSION, TENDERMINT_TARGET_VERSION. Hashicorp Raft upgraded to v1.7.3, raft-boltdb to v2.3.1 in node/go.mod. go-ethereum submodule updated to commit 62952ec7d188953ee7b3224c3693baffa97c8084.
Docker Test Infrastructure for HA
ops/docker-sequencer-test/docker-compose.ha-override.yml, ops/docker-sequencer-test/docker-compose.override.yml, ops/docker/.env
New docker-compose.ha-override.yml defines isolated HA cluster: three geth clients (ha-geth-0/1/2) and three Tendermint nodes (ha-node-0 bootstrap, ha-node-1/2 followers) with Raft configuration. Updated docker-compose.override.yml removes sequencer key from node-0 (fullnode mode), adds malicious node for P2P security testing. Updated .env with new contract addresses.
Comprehensive HA Integration Test Suite
ops/docker-sequencer-test/run-ha-test.sh
Complete end-to-end HA V2 test runner: config validation, cluster formation (3-voter verification, join retry), block production (leader growth, follower silence, sync matching), failover (election timing, leader rejoin), admin API (8 endpoints with version optimistic locks), lifecycle (follower loops, restart recovery, Barrier evidence), P2P optimization (reactor settings via block deltas and log analysis). Generates markdown test report with pass/fail/skip aggregation.
HA Performance Testing Framework
ops/docker-sequencer-test/run-perf-test.sh
New performance benchmarking script: builds perf-instrumented Docker images, deploys L1/L2 devnet, starts HA cluster, runs txflood load generator, collects [PERF] log entries, and produces per-node timing summaries (produceBlock, HAService.Commit, BlockFSM.Apply, ApplyBlock) with Raft overhead analysis.
Test Script Updates and P2P Security
ops/docker-sequencer-test/run-test.sh
Removed set_upgrade_height helper. Added wait_for_l1_finalized for finality safety. L1Sequencer history initialization in devnet deployment (calls initializeHistory with UPGRADE_HEIGHT). New P2P malicious node testing: build-malicious command, swap_sentry_to_malicious, and test_p2p_security with staged rejection validation and forge-attempt detection. Extended CLI help and entrypoints.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes


Possibly related PRs

  • morph-l2/morph#748: Updates the same Makefile dependency pins (ETHEREUM_SUBMODULE_COMMIT_OR_TAG, ETHEREUM_TARGET_VERSION, TENDERMINT_TARGET_VERSION) and go.mod replace/require entries for go-ethereum and tendermint.
  • morph-l2/morph#936: Refactors the same RetryableClient code and permanent-error classification logic in node/types/retryable_client.go.
  • morph-l2/morph#723: Updates go-ethereum dependency versions across go.mod files via Makefile/submodule tracking and replace/require directives.

Suggested reviewers

  • twcctop
  • panos-xyz
  • chengwenxi

🐰 A sequencer's tale in history made,
From single point to cluster's trade!
With Raft and votes and leader's race, 🚀
The L2 finds a safer place.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sequencer-optimize

allen.wu and others added 18 commits May 9, 2026 23:42
- L1Sequencer.sol: sequencerHistory[], updateSequencer, getSequencerAt, initializeHistory
- Bindings: updated ABI for new contract interface
- SequencerVerifier: L1 history cache with interval cursor optimization
- Signer: simplified interface (removed IsActiveSequencer)
- 022-SequencerInit.ts: fixed initialize call (1 param instead of 2)
- Docker: added L1_SEQUENCER_CONTRACT env for all nodes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- run-test.sh: added build-malicious and p2p-test commands
- docker-compose.override.yml: malicious-geth-0 and malicious-node-0 services
- Tests: T-01~T-05 (active attacks) + T-06 (BlockSync pollution) + T-07 (resilience)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix grep -c multiline: use || true instead of || echo "0"
- Fix env var loss: malicious override must include full env list
- Swap approach: reuse synced sentry instead of fresh malicious container
- Uncomment CONSENSUS_SWITCH_HEIGHT for V2 mode activation
- Add SEQUENCER_PRIVATE_KEY to node-0 override

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use staking key (0xd998...) as SEQUENCER_PRIVATE_KEY for node-0
- Add initializeHistory() call in setup to register sequencer on L1
- Fixes "no sequencer record" error in V2 mode

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- T-06: use blocksync-forge (blocksync/reactor.go) instead of sync-forge
  (broadcast_reactor.go) - targets the actual V1 vulnerability path
- T-06: stop node-3 to create gap, restart to trigger BlockSync
- Phase 0: explicit checks for V2 mode, signer, and switch height
- T-04: use futureHeight (currentHeight+10000) for deterministic unsolicited test
- Separate log files per phase to prevent cross-contamination

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add L1Sequencer.t.sol: 27 Foundry tests covering initialize,
  initializeHistory, updateSequencer, getSequencerAt binary search
  edge cases, and access control
- Regenerate l1sequencer.go with abigen (bytecode now matches current
  contract with sequencerHistory[], binary search, etc.)
- Update verifier.go: L1SequencerHistoryRecord -> L1SequencerSequencerRecord
- Add exponential backoff retry (10s -> 20s -> ... -> 5min) when initial
  history load fails, instead of waiting full 5 minutes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Avoids stuttering in abigen output (L1SequencerSequencerRecord ->
L1SequencerHistoryRecord). No ABI/storage layout change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…om L1 contract

Unify the upgrade height source: instead of a CLI flag / env var, the
verifier now sets upgrade.UpgradeBlockHeight from the first history
record fetched from the L1Sequencer contract.

- node/l1sequencer/verifier.go: call SetUpgradeBlockHeight on first
  successful history load (prev==0)
- node/cmd/node/main.go: remove ConsensusSwitchHeight flag read;
  require L1 Sequencer contract address
- node/flags/flags.go: delete ConsensusSwitchHeight flag definition
- docker-compose.override.yml: remove 5× MORPH_NODE_CONSENSUS_SWITCH_HEIGHT
- run-test.sh: remove set_upgrade_height function, add wait_for_l1_finalized
  to ensure L1 contract data is finalized before L2 nodes start

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These env var overrides (DEPOSIT_CONTRACT_ADDRESS, SYNC_START_HEIGHT)
and the malicious_geth_data volume should be managed via overlay/override
files, not by modifying the base compose file directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ncer

Add hakeeper module implementing a 3-node Raft cluster for sequencer HA.
The HA cluster provides automatic leader election, block replication, and
failover without changing the on-chain sequencer identity.

node/hakeeper/:
- HAService: wraps hashicorp/raft, implements SequencerHA interface
- Config: layered loading (defaults -> TOML file -> CLI flags -> resolve -> validate)
  supports auto-detection of server_id (hostname) and advertised_addr (local IP)
- BlockFSM: Raft FSM for block replication; onApplied callback drives geth apply
- leaderMonitor: gates block production behind Barrier to ensure log catch-up
- rpc/: JSON-RPC admin API (ha_leader, ha_clusterMembership, ha_addServerAsVoter,
  ha_removeServer, ha_transferLeader, ha_transferLeaderToServer)
  with HTTP middleware token auth on write operations

node/flags/flags.go:
- New flags: --ha.enabled, --ha.config, --ha.bootstrap, --ha.join,
  --ha.server-id, --ha.advertised-addr, --ha.rpc-token

node/cmd/node/main.go:
- initHAService(): init HA from flags/config when --ha.enabled is set
- Fix typed-nil interface bug: pass untyped nil when HA is disabled

node/sequencer/tm_node.go:
- Pass HA service to tendermint node setup

node/go.mod:
- Add hashicorp/raft v1.7.1, raft-boltdb/v2

ops/docker-sequencer-test/:
- docker-compose.ha-override.yml: 3-node Raft cluster config for devnet
- run-ha-test.sh: 29-case integration test suite (config, cluster, block
  production, failover, admin API, lifecycle)
- run-perf-test.sh: performance test harness

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire up the new engine_newL2BlockV2 API for reorg support:

- Executor.ApplyBlockV2 now returns (applied bool, err error) matching
  the updated L2Node interface; detects idempotent skips and reorgs
  using BlockNumber + BlockByNumber checks before calling NewL2BlockV2
- RetryableClient.NewL2BlockV2 wraps the new authclient method with
  exponential backoff retry; excludes WrongBlockNumberError and
  ParentNotFoundError from retry (permanent errors)
- RetryableClient.AssembleL2BlockV2 added for parent-hash-based block
  assembly

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add BlockHashMismatchError and InvalidNextL1MsgIndexError to the
retryableError() exclusion list so the executor stops re-sending
invalid payloads back to geth.

Made-with: Cursor
… not in PBFT validator set

- Add Syncer()/SetSyncer() accessors to Executor for explicit syncer wiring
- Start L1 syncer eagerly in main.go for separated-deployment / HA sequencers
  that are not PBFT validators (lazy-init path never fires for them)
- Guard Syncer.Start() with atomic flag to prevent duplicate goroutines

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Increase blockCh buffer from 200 to 1000 to reduce drops under load.
- Panic on nil onApplied callback in BlockFSM.Apply: this can only happen
  due to a programmer error (forgot to wire SetOnBlockApplied) and would
  otherwise silently succeed on the leader while followers diverge.
- gofmt: realign one-line method bodies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- go-ethereum: v0.0.0-20260508105911-56deb7072ae4
- tendermint: v0.0.0-20260508065906-9e56b04da3c8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add explicit replace directives in every go.mod to override MVS,
because token-price-oracle indirectly required v1.10.14-..., which
caused all workspace modules to resolve to the older version and
miss new APIs (NewL2BlockV2, AssembleL2BlockV2, SetBlockTags,
MorphTxType, updated AssembleL2Block/NewL2Block signatures).

Drop the local-path replace block from go.work so the pinned
pseudo-version is actually used.

Follow-up: investigate the indirect dep that requires v1.10.14
and bump it so these per-module replaces can be removed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tomatoishealthy tomatoishealthy force-pushed the feat/sequencer-optimize branch from 650d038 to 8f4c5ee Compare May 9, 2026 15:46
Pull in the persistent-peer ban exemption + sigStore.Close fixes from
morph-l2/tendermint feat/sequencer-optimize (commit c6f7e21e4).

Updated via 'make update' after bumping TENDERMINT_TARGET_VERSION in
the Makefile. All sub-modules tidied. morphnode + tendermint binaries
build cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
contracts/deploy/022-SequencerInit.ts (1)

31-32: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace toLocaleLowerCase() with toLowerCase() for locale-insensitive address comparisons.

Lines 31 and 48 use toLocaleLowerCase(), which is locale-dependent. For hex address normalization, use toLowerCase() to avoid edge-case mismatches.

Suggested fix
-        (await IL1SequencerProxy.implementation()).toLocaleLowerCase() !== L1SequencerImplAddress.toLocaleLowerCase()
+        (await IL1SequencerProxy.implementation()).toLowerCase() !== L1SequencerImplAddress.toLowerCase()
...
-                    (await IL1SequencerProxy.implementation()).toLocaleLowerCase() === L1SequencerImplAddress.toLocaleLowerCase()
+                    (await IL1SequencerProxy.implementation()).toLowerCase() === L1SequencerImplAddress.toLowerCase()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@contracts/deploy/022-SequencerInit.ts` around lines 31 - 32, The code uses
locale-dependent string normalization when comparing addresses: replace all
calls to toLocaleLowerCase() with toLowerCase() (e.g., in the comparison
involving IL1SequencerProxy.implementation() and L1SequencerImplAddress and any
other occurrences in this file such as the second usage at the later comparison)
so address comparisons are locale-insensitive; update those invocations to call
toLowerCase() on the strings being compared.
ops/docker-sequencer-test/run-test.sh (1)

53-60: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make parse failures explicit in get_block_number.

This can now return an empty string on transient RPC errors, but callers still do integer arithmetic/comparisons on the result. That turns a parse failure into integer expression expected noise or bogus test flow. Return non-zero here and let callers retry or fail fast.

Suggested fix
 get_block_number() {
     local rpc_url="${1:-$L2_RPC}"
     local result
     result=$(curl -s -X POST -H "Content-Type: application/json" \
         --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
         "$rpc_url" 2>/dev/null)
-    echo "$result" | grep -o '"result":"[^"]*"' | cut -d'"' -f4 | xargs printf "%d" 2>/dev/null || true
+    local hex
+    hex=$(echo "$result" | grep -o '"result":"[^"]*"' | cut -d'"' -f4)
+    [ -n "$hex" ] || return 1
+    printf "%d" "$hex" 2>/dev/null
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ops/docker-sequencer-test/run-test.sh` around lines 53 - 60, The
get_block_number function currently prints an empty string on parse failure
which causes callers to see "integer expression expected"; modify
get_block_number to detect when the curl/grep/parse yields no numeric block
number (empty or non-numeric), write a clear error to stderr including the
rpc_url and raw response for debugging, and exit the function with a non-zero
status (return 1) so callers can retry or fail fast; keep the existing signature
and behavior on success (print numeric block and return 0).
🟡 Minor comments (6)
ops/docker-sequencer-test/run-perf-test.sh-27-27 (1)

27-27: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fragile docker build context: BITGET_ROOT assumes a specific parent-directory layout.

BITGET_ROOT is computed as the parent of the morph repo and then used as the docker build context with Dockerfile paths like morph/ops/docker-sequencer-test/Dockerfile.l2-geth-test. This implicitly requires (1) the morph repo to be checked out into a directory literally named morph, and (2) the parent directory to be a "safe" docker context — for many devs the parent is $HOME or a multi-repo workspace, which means the build either fails (wrong dirname) or sends gigabytes of unrelated trees as build context. The variable name (BITGET_ROOT) also leaks an internal/organizational assumption into an OSS test script.

If a sibling tree is genuinely required (e.g., a vendored go-ethereum checkout next to morph/), make the requirement explicit and overridable, e.g.:

-BITGET_ROOT="$(cd "$MORPH_ROOT/.." && pwd)"
+# Parent dir is used as docker build context (must contain `morph/` and any
+# sibling sources referenced by the Dockerfiles). Override via env if your
+# layout differs.
+BUILD_CONTEXT_ROOT="${BUILD_CONTEXT_ROOT:-$(cd "$MORPH_ROOT/.." && pwd)}"

and update lines 117/119/123 to use $BUILD_CONTEXT_ROOT plus the resolved repo dirname (basename "$MORPH_ROOT") instead of the hard-coded morph/ prefix. Otherwise, please document the required layout in the usage header.

Also applies to: 114-124

ops/docker-sequencer-test/run-perf-test.sh-197-203 (1)

197-203: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

do_start reports success even when no HA leader is elected.

wait_for_ha_leader 60 || { log_warn "HA leader not found, checking logs..."; } swallows the failure, then line 202 unconditionally prints log_ok "HA cluster running" and returns 0. do_run will then start txflood and gather [PERF] data against a cluster with no leader, producing misleading or empty summaries. For a perf harness this should fail fast.

🛠️ Proposed fix
     # Wait for HA leader
     log_info "Waiting for HA cluster formation..."
     sleep 10
-    wait_for_ha_leader 60 || { log_warn "HA leader not found, checking logs..."; }
+    wait_for_ha_leader 60 || { log_err "HA leader not found; aborting"; return 1; }

     log_ok "HA cluster running"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ops/docker-sequencer-test/run-perf-test.sh` around lines 197 - 203, The
do_start function currently ignores wait_for_ha_leader failures and always
prints log_ok "HA cluster running", so change do_start to treat
wait_for_ha_leader failure as fatal: call wait_for_ha_leader 60 and if it fails,
emit a clear error (use log_error or log_warn) and return a non-zero status (or
exit 1) so do_start does not print log_ok nor allow do_run to proceed; only
print log_ok "HA cluster running" when wait_for_ha_leader succeeds. Ensure the
change references wait_for_ha_leader, do_start, do_run, and log_ok so the
failure propagates and the perf harness fails fast.
node/hakeeper/block_fsm.go-48-59 (1)

48-59: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Buffer size inconsistency between comment and code.

The comment on line 48 states "Buffer of 200" but the code on line 59 initializes the channel with a buffer of 1000. Update the comment to match the implementation.

📝 Proposed fix
-	// Buffer of 200 gives ample room for transient subscriber slowness.
+	// Buffer of 1000 gives ample room for transient subscriber slowness.
 	blockCh chan *types.BlockV2
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@node/hakeeper/block_fsm.go` around lines 48 - 59, The comment above the
blockCh field in the BlockFSM struct is out of sync: update the comment that
currently says "Buffer of 200 gives ample room for transient subscriber
slowness." to reflect the actual buffer size used in NewBlockFSM (1000). Change
the comment near the blockCh declaration in BlockFSM so it reads something like
"Buffer of 1000 gives ample room for transient subscriber slowness." to match
the initialization in NewBlockFSM.
node/hakeeper/config.go-152-152 (1)

152-152: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

IPv6 wildcard address not handled.

The check listenAddr != "0.0.0.0" only handles the IPv4 wildcard address. If operators configure listen_addr = "::" (the IPv6 wildcard), the current logic will incorrectly treat it as a specific IP and use it directly, which will fail when constructing the advertised address.

🔧 Proposed fix to handle IPv6 wildcard
-	if listenAddr != "0.0.0.0" && listenAddr != "" {
+	if listenAddr != "0.0.0.0" && listenAddr != "::" && listenAddr != "" {
 		return net.JoinHostPort(listenAddr, port), nil
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@node/hakeeper/config.go` at line 152, The current check uses listenAddr !=
"0.0.0.0" which misses the IPv6 wildcard "::"; update the logic that decides
whether to use listenAddr for the advertised address to treat any unspecified IP
as a wildcard. Parse listenAddr with net.ParseIP and use
parsedIP.IsUnspecified() (and still treat empty string as wildcard) so both
"0.0.0.0" and "::" are handled; change the conditional around listenAddr used
when building the advertised address (the code referencing listenAddr and the
advertised address construction) to only use listenAddr when parsedIP is non-nil
and not unspecified.
node/hakeeper/rpc/client.go-80-84 (1)

80-84: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Return nil pointer on error.

When CallContext returns an error, the function still returns &membership (a pointer to a zero-valued struct). Returning nil on error is clearer and follows Go conventions.

🛡️ Proposed fix
 func (c *APIClient) ClusterMembership(ctx context.Context) (*ClusterMembership, error) {
 	var membership ClusterMembership
 	err := c.c.CallContext(ctx, &membership, prefixRPC("clusterMembership"))
+	if err != nil {
+		return nil, err
+	}
 	return &membership, err
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@node/hakeeper/rpc/client.go` around lines 80 - 84, The ClusterMembership
method currently always returns &membership even when c.c.CallContext returns an
error; update APIClient.ClusterMembership so that after calling
c.c.CallContext(ctx, &membership, prefixRPC("clusterMembership")) you check the
returned err and if err != nil return nil, err, otherwise return &membership,
nil; this uses the APIClient.ClusterMembership, CallContext and
ClusterMembership identifiers to locate and implement the change.
node/core/executor.go-382-387 (1)

382-387: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard SetSyncer(nil).

Assigning a nil *sync.Syncer to e.l1MsgReader produces a non-nil interface value, so later e.l1MsgReader == nil checks will pass and the first read call will dereference nil. Reject nil before wiring it in.

Suggested fix
 func (e *Executor) SetSyncer(s *sync.Syncer) {
+	if s == nil {
+		return
+	}
 	if e.syncer != nil {
 		return
 	}
 	e.syncer = s
 	e.l1MsgReader = s
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@node/core/executor.go` around lines 382 - 387, The SetSyncer method currently
assigns a nil *sync.Syncer into e.syncer and e.l1MsgReader which yields a
non-nil interface and breaks later nil checks; update Executor.SetSyncer to
reject nil by returning early (or logging/erroring) if the provided *sync.Syncer
is nil, and only wire s to e.syncer and e.l1MsgReader when s != nil so that
e.l1MsgReader comparisons remain valid (refer to method SetSyncer and fields
e.syncer and e.l1MsgReader).
🧹 Nitpick comments (7)
node/hakeeper/ha_service.go (1)

181-185: ⚖️ Poor tradeoff

Panic on FSM.Apply error crashes the leader node.

If the FSM's onApplied callback fails (e.g., geth ApplyBlock returns an error), the leader panics and terminates. This is a hard failure that requires node restart and operator intervention.

This design choice treats FSM apply failures as invariant violations: the block was committed to Raft (quorum achieved) but couldn't be applied to local state. Since followers also apply the same block, a failure here suggests either:

  • A bug in the block application logic
  • State corruption
  • Transient resource exhaustion

Panicking forces immediate visibility and prevents the leader from continuing in an inconsistent state. However, it also means a single bad block can take down the leader, triggering failover.

Verify that this aligns with the intended failure mode. If transient errors (e.g., disk I/O spikes) are possible, consider retry logic before panicking.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@node/hakeeper/ha_service.go` around lines 181 - 185, The current leader code
panics when the FSM Apply callback returns an error (the block handling path
that checks f.Response() and calls panic with "hakeeper: leader FSM.Apply
failed"), which kills the leader process; change this behavior in the leader FSM
apply-handling (the block using f.Response(), resp.(error) check and panic) to
instead surface the error and perform controlled handling: replace the panic
with a bounded retry/backoff loop that re-invokes the local apply handler (or
requeues the Raft log application) a configurable number of times, and if still
failing, log the full error via the existing logger and trigger a safe
leadership relinquish (step down) or mark the node unhealthy rather than
terminating the process; ensure you preserve the original error in logs and
metrics and reference the same response check (f.Response()) and error branch so
callers can observe the failure mode.
node/hakeeper/config.go (2)

203-208: 💤 Low value

Port validation allows port 0 for consensus and RPC.

Ports 0 (OS-assigned) are technically valid but likely not appropriate for Raft consensus and RPC servers, which need stable, well-known ports for cluster membership and client connections. While the current bounds check prevents negative values and values >65535, explicitly rejecting port 0 would prevent misconfigurations.

🔧 Proposed validation to reject port 0
-	if c.Consensus.ListenPort < 0 || c.Consensus.ListenPort > math.MaxUint16 {
+	if c.Consensus.ListenPort <= 0 || c.Consensus.ListenPort > math.MaxUint16 {
 		return fmt.Errorf("invalid consensus.listen_port: %d", c.Consensus.ListenPort)
 	}
-	if c.RPC.ListenPort < 0 || c.RPC.ListenPort > math.MaxUint16 {
+	if c.RPC.ListenPort <= 0 || c.RPC.ListenPort > math.MaxUint16 {
 		return fmt.Errorf("invalid rpc.listen_port: %d", c.RPC.ListenPort)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@node/hakeeper/config.go` around lines 203 - 208, The current port checks for
c.Consensus.ListenPort and c.RPC.ListenPort allow 0; update the validation to
reject zero by changing the conditions from "< 0 || > math.MaxUint16" to "<= 0
|| > math.MaxUint16" (or explicitly add "|| c.Consensus.ListenPort == 0" and
same for c.RPC.ListenPort) so that ports must be in 1..65535 and return the same
formatted error messages if invalid.

164-191: ⚡ Quick win

Function name implies general IP detection but only returns IPv4.

The function localNonLoopbackIP only returns IPv4 addresses (line 185: ip.To4()), but the name suggests it would handle any IP version. The error message on line 190 correctly states "IPv4", but the function signature and name do not document this limitation.

If IPv6 support is not required, consider renaming to localNonLoopbackIPv4 for clarity. If IPv6 should be supported, extend the logic to handle IPv6 addresses.

📝 Option 1: Rename function to clarify IPv4-only behavior
-func localNonLoopbackIP() (string, error) {
+func localNonLoopbackIPv4() (string, error) {
 	ifaces, err := net.Interfaces()
📝 Option 2: Add IPv6 support
 			if ip4 := ip.To4(); ip4 != nil && !ip4.IsLoopback() {
 				return ip4.String(), nil
 			}
+			// Fallback to IPv6 if no IPv4 available
+			if ip.To16() != nil && !ip.IsLoopback() {
+				return ip.String(), nil
+			}
 		}
 	}
-	return "", fmt.Errorf("no non-loopback IPv4 address found")
+	return "", fmt.Errorf("no non-loopback IP address found")
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@node/hakeeper/config.go` around lines 164 - 191, The function
localNonLoopbackIP currently only returns IPv4 (uses ip.To4()) but its
name/signature implies any IP; either rename it to localNonLoopbackIPv4 and
update its error text and all callers to reflect IPv4-only behavior (change the
function name and the error string "no non-loopback IPv4 address found"), or if
you need IPv6 support, modify localNonLoopbackIP to accept both families by
removing the ip.To4() check and returning ip.String() for non-loopback IPv4 and
IPv6 addresses (ensure callers handle possible IPv6 literals); update tests and
any usages accordingly.
node/hakeeper/rpc/server.go (2)

69-82: ⚡ Quick win

Consider graceful shutdown with timeout.

Stop() uses httpServer.Close(), which immediately closes all active connections. Using httpServer.Shutdown(ctx) with a timeout context would allow in-flight requests to complete gracefully.

♻️ Proposed graceful shutdown
 // Stop gracefully shuts down the server.
 func (s *Server) Stop() {
 	s.log.Info("Stopping hakeeper RPC server")
 	if s.httpServer != nil {
-		if err := s.httpServer.Close(); err != nil {
+		ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+		defer cancel()
+		if err := s.httpServer.Shutdown(ctx); err != nil {
 			s.log.Error("hakeeper RPC server shutdown error", "err", err)
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@node/hakeeper/rpc/server.go` around lines 69 - 82, Stop currently calls
s.httpServer.Close() which force-closes active connections; replace that with a
graceful shutdown using context.WithTimeout and s.httpServer.Shutdown(ctx)
(e.g., 5s or configurable) to allow in-flight requests to finish, log any
shutdown errors including context timeouts, and only call s.httpServer.Close()
as a last-resort fallback if Shutdown returns an error; ensure this occurs
inside Server.Stop before waiting on s.wg and before calling s.rpcServer.Stop().

56-67: ⚖️ Poor tradeoff

Start returns before server is listening.

Start() spawns a goroutine and returns immediately, before ListenAndServe confirms the server is listening. If the server fails to start (e.g., port already in use), the error is only logged inside the goroutine and the caller receives no indication of failure.

Consider using a channel to signal when the server is ready or if startup failed, so the caller can handle initialization errors appropriately.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@node/hakeeper/rpc/server.go` around lines 56 - 67, Start currently returns
before the server has bound its port; fix Start in Server to synchronously bind
the listening socket and only return nil after successful bind. Use net.Listen
to create a listener for s.httpServer.Addr (capture error and return it), then
spawn the goroutine that calls s.httpServer.Serve(listener) and keeps the
existing waitgroup/error logging logic; ensure Serve is used instead of
ListenAndServe so the goroutine uses the prebound listener and shutdown still
works with that listener.
node/hakeeper/rpc/backend.go (1)

10-11: ⚡ Quick win

Logger field is stored but never used.

The log field is passed to the constructor and stored but is not used in any of the delegate methods. Consider either adding appropriate logging (e.g., for errors, debug info) or removing the field if not needed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@node/hakeeper/rpc/backend.go` around lines 10 - 11, The APIBackend struct
stores a log.Logger in the field named "log" but never uses it; either remove
the unused field from APIBackend (and its constructor NewAPIBackend) or wire it
into the delegate methods on APIBackend by replacing silent error returns with
process logging—use the "log" field to emit contextual messages (errors,
warnings, or debug) inside methods that call the underlying delegate (so any
method on APIBackend that currently forwards calls without logging should log
failures or notable events using a descriptive message and the err value).
node/hakeeper/rpc/client.go (1)

9-10: ⚡ Quick win

Consider making RPCNamespace a constant.

Since RPCNamespace is never modified, declare it as const instead of var to prevent accidental reassignment and clarify intent.

♻️ Proposed fix
-// RPCNamespace is the JSON-RPC namespace for the HA management API.
-var RPCNamespace = "ha"
+// RPCNamespace is the JSON-RPC namespace for the HA management API.
+const RPCNamespace = "ha"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@node/hakeeper/rpc/client.go` around lines 9 - 10, RPCNamespace is declared as
a package-level var but never modified; change its declaration to a const
(replace var RPCNamespace = "ha" with const RPCNamespace = "ha") to prevent
reassignment and clarify intent—update any references remain unchanged since
const has the same visibility and name RPCNamespace.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go-ethereum`:
- Line 1: The repository's go-ethereum submodule commit
(62952ec7d188953ee7b3224c3693baffa97c8084) does not match the commit referenced
in the Makefile and go.mod files (56deb7072ae467a12a850815c7a5c09b7c2782ba /
v0.0.0-20260508105911-56deb7072ae4); fix by making them consistent: either
update the submodule to the declared commit (56deb707...) or update the Makefile
and every go.mod to reference the actual submodule commit (62952ec7...).
Concretely, pick one source-of-truth, then update the Makefile variable(s) that
pin go-ethereum and all go.mod require/replace entries to the chosen
commit/version, and ensure the git submodule is checked out/synced to that
commit so builds are reproducible.

In `@node/cmd/node/main.go`:
- Around line 153-160: This branch starts l1Syncer via node.NewSyncer and
l1Syncer.Start but never assigns it to the outer syncer variable, so shutdown
only stops the original syncer and leaves l1Syncer running; fix by ensuring the
started syncer is the one cleaned up — either assign the outer syncer variable
to l1Syncer (e.g., syncer = l1Syncer) or update the cleanup path to fetch
executor.Syncer() and Stop() it; make the change in the branch where you call
node.NewSyncer, executor.SetSyncer, and l1Syncer.Start so the same syncer
instance is stopped on shutdown or error.

In `@node/hakeeper/ha_service.go`:
- Around line 165-198: Commit currently calls h.r.Apply with raftInfiniteTimeout
(0) which can block forever if quorum is lost; change Commit to use a finite
timeout instead of raftInfiniteTimeout (e.g., 30s–60s) when calling h.r.Apply,
add retry logic with exponential backoff and a maximum retry window around the
Apply call (use the returned Future f and its Response/Error handling to surface
failures), and integrate a circuit-breaker/health flag that trips after repeated
Apply timeouts to short-circuit further commits and emit an alert/metric; also
expose quorum/health state via a health-check endpoint so operators and
monitoring can detect and react to stalled Commit calls.

In `@node/hakeeper/rpc/client.go`:
- Around line 28-29: The DialAPIClient function currently prepends "http://" to
addr, forcing unencrypted connections; change it to use the provided addr as the
full URL (including scheme) when calling ethrpc.DialContext so callers can pass
"https://..." for TLS; ensure DialAPIClient (and any callers) expect a full URL
and optionally validate that addr has a valid scheme (http or https) before
calling ethrpc.DialContext to avoid silently malformed URLs.

In `@node/types/retryable_client.go`:
- Around line 120-135: The wrapper NewL2BlockV2 on RetryableClient is using the
shared rc.b backoff which is not thread-safe; create a fresh backoff per RPC by
calling newRetryBackoff() (or otherwise constructing/resetting a new backoff)
inside NewL2BlockV2 and pass that to backoff.Retry instead of rc.b; update the
call site in NewL2BlockV2 to use the fresh backoff and apply the same pattern to
the other RetryableClient wrapper methods so each RPC uses its own independent
backoff instance.

In `@ops/docker-sequencer-test/docker-compose.ha-override.yml`:
- Around line 127-128: Remove the hard-coded private key values for
MORPH_NODE_SEQUENCER_PRIVATE_KEY and instead reference an external secret or
environment variable; replace each literal assignment (the three occurrences of
MORPH_NODE_SEQUENCER_PRIVATE_KEY=0x...) with a variable reference (e.g. set
MORPH_NODE_SEQUENCER_PRIVATE_KEY=${MORPH_NODE_SEQUENCER_PRIVATE_KEY} in the
compose env block or switch the service to use a docker secret named
sequencer_private_key) and add a .env.example placeholder and/or instructions to
mount the real key via a secret or external .env (ensure the real .env is
gitignored). Ensure all occurrences (the three commits) are updated to use the
same external reference so the key is no longer stored in source control.
- Around line 120-125: The HA admin RPC port mappings currently publish the
admin endpoint on all interfaces (e.g., "- "9501:9401"" and the analogous
entries at the other occurrences), so change each host-side binding to bind to
localhost by prefixing the host port with 127.0.0.1 (e.g.,
"127.0.0.1:9501:9401") for the entries that expose port 9401; update the same
pattern at the other two occurrences referenced (the blocks around the other
port mappings) so the Raft admin RPC is only reachable from the host loopback.

In `@ops/docker-sequencer-test/run-ha-test.sh`:
- Around line 1166-1185: The test currently uses a future version
(stale_version_high = current_version + 100) which can be accepted and still
marked PASS; change the test to use a genuinely stale version (e.g.
stale_version_low = current_version - 1, guarding if current_version==0 use 0)
when calling ha_call with "ha_addServerAsVoter" so the request truly represents
an older optimistic-lock version, then assert failure by checking the ha_call
response contains an "error" or that get_membership_version(leader_rpc) remains
unchanged; update the record_test branches for "TC-API-08" to FAIL when the
stale request is accepted and the membership version changed or the member was
added, and PASS only when an error is returned or membership_version did not
change.
- Around line 667-680: The loop currently hardcodes "ha-node-1 ha-node-2" so it
skips any other followers (e.g., ha-node-0); update the loop to iterate over the
full set of HA nodes instead of only two (for example derive nodes from a
HA_NODES list or expand to "ha-node-0 ha-node-1 ha-node-2"), ensure you map each
node to its RPC variable (replace HA_RPC_NODE1/HA_RPC_NODE2 mapping with a
lookup that matches the node name to HA_RPC_NODE0/1/2 or to an associative
array), keep the existing is_ha_leader check and COMPOSE_HA logs/grep logic, and
append results to follower_produce_logs as before so every non-leader node is
inspected.
- Around line 1073-1077: The cleanup path deletes the wrong raft directory
because the constructed path for rm -rf doesn't match the directory naming used
in setup_ha_nodes_config (it produces names like .devnet/ha-node0 but the script
currently resolves to .devnet/ha-node-0); update the removal to compute the
exact node directory name the same way setup_ha_nodes_config does (derive the
target directory from target_follower_id using the same transformation or reuse
the variable/logic from setup_ha_nodes_config) so rm -rf
"$DOCKER_DIR/.devnet/<correct-node-dir>/raft" removes the stale raft state for
the intended node; ensure this change is applied where post_add_count check does
the force-recover (references: post_add_count, target_follower_id, DOCKER_DIR,
setup_ha_nodes_config).
- Around line 1412-1420: leader_height_before/after are being fetched from
HA_RPC_NODE0 (the admin RPC) which doesn't return L2 blockNumber; replace the HA
admin RPC with the current leader's geth RPC when calling get_block_number so
leader_height_* reflects the leader fullnode L2 height, and stop hard-coding
node 0 — use the variable that holds the current leader RPC endpoint (the same
leader identifier used elsewhere in the script) so gap = leader_height_after -
follower_height_after is valid; update the two calls to get_block_number that
set leader_height_before and leader_height_after accordingly.

In `@ops/docker-sequencer-test/run-perf-test.sh`:
- Around line 277-307: The produceBlock awk parser currently writes apply_ms
into the same variable as raft_commit_ms (both assign to commit), corrupting
raft_commit_ms; update the parser to parse apply_ms into its own variable (e.g.,
apply), add accumulators s_apply and trackers max_apply/min_apply alongside the
existing s_commit/max_commit/min_commit, increment s_apply and update
max_apply/min_apply in the per-line block, and add a printf row for "apply_ms:"
in the END block (similar to raft_commit_ms:) so apply_ms is reported separately
while raft_commit_ms remains correct.

In `@ops/docker-sequencer-test/run-test.sh`:
- Around line 530-533: The build_malicious_image function uses an undefined
BITGET_ROOT which causes cd to fail; fix by either initializing BITGET_ROOT from
an existing repo variable or using that existing variable directly. For example,
set BITGET_ROOT="${BITGET_ROOT:-$POLYREPO_ROOT}" near script initialization or
replace cd "$BITGET_ROOT" inside build_malicious_image with cd "$POLYREPO_ROOT"
(or cd "$MORPH_ROOT" if that is the intended repo), referencing
build_malicious_image, BITGET_ROOT, POLYREPO_ROOT and MORPH_ROOT to locate the
change.
- Around line 215-225: The try/except around the initializeHistory call is
swallowing all errors and masking real failures; change it so initializeHistory
failures on a fresh deploy surface: either remove the broad except and let
exceptions propagate from run_command('initializeHistory(address,uint64)', ...)
so the script fails on real errors, or catch the exception, inspect the error
text returned by run_command and only suppress it if it unmistakably indicates
"already initialized" (otherwise re-raise); update the block around run_command
/ initializeHistory / L1Sequencer to log and then raise for unexpected errors so
the test cannot continue without a valid sequencer history.
- Around line 534-552: The script currently stashes local changes, checks out
test/p2p-security, and runs docker build but does not guarantee cleanup on
failure; update the block that sets original_branch, runs git stash, git
checkout test/p2p-security and docker build to always restore the original
branch and stash state by installing a trap-based cleanup (or use a separate git
worktree) that on EXIT checks if original_branch is set and then runs git
checkout "$original_branch" and git stash pop (if stash was created), ensuring
the trap is registered immediately after computing original_branch and before
any potentially failing command (refer to the original_branch variable, git
stash, git checkout test/p2p-security and docker build -t
morph-node-malicious:latest).

---

Outside diff comments:
In `@contracts/deploy/022-SequencerInit.ts`:
- Around line 31-32: The code uses locale-dependent string normalization when
comparing addresses: replace all calls to toLocaleLowerCase() with toLowerCase()
(e.g., in the comparison involving IL1SequencerProxy.implementation() and
L1SequencerImplAddress and any other occurrences in this file such as the second
usage at the later comparison) so address comparisons are locale-insensitive;
update those invocations to call toLowerCase() on the strings being compared.

In `@ops/docker-sequencer-test/run-test.sh`:
- Around line 53-60: The get_block_number function currently prints an empty
string on parse failure which causes callers to see "integer expression
expected"; modify get_block_number to detect when the curl/grep/parse yields no
numeric block number (empty or non-numeric), write a clear error to stderr
including the rpc_url and raw response for debugging, and exit the function with
a non-zero status (return 1) so callers can retry or fail fast; keep the
existing signature and behavior on success (print numeric block and return 0).

---

Minor comments:
In `@node/core/executor.go`:
- Around line 382-387: The SetSyncer method currently assigns a nil *sync.Syncer
into e.syncer and e.l1MsgReader which yields a non-nil interface and breaks
later nil checks; update Executor.SetSyncer to reject nil by returning early (or
logging/erroring) if the provided *sync.Syncer is nil, and only wire s to
e.syncer and e.l1MsgReader when s != nil so that e.l1MsgReader comparisons
remain valid (refer to method SetSyncer and fields e.syncer and e.l1MsgReader).

In `@node/hakeeper/block_fsm.go`:
- Around line 48-59: The comment above the blockCh field in the BlockFSM struct
is out of sync: update the comment that currently says "Buffer of 200 gives
ample room for transient subscriber slowness." to reflect the actual buffer size
used in NewBlockFSM (1000). Change the comment near the blockCh declaration in
BlockFSM so it reads something like "Buffer of 1000 gives ample room for
transient subscriber slowness." to match the initialization in NewBlockFSM.

In `@node/hakeeper/config.go`:
- Line 152: The current check uses listenAddr != "0.0.0.0" which misses the IPv6
wildcard "::"; update the logic that decides whether to use listenAddr for the
advertised address to treat any unspecified IP as a wildcard. Parse listenAddr
with net.ParseIP and use parsedIP.IsUnspecified() (and still treat empty string
as wildcard) so both "0.0.0.0" and "::" are handled; change the conditional
around listenAddr used when building the advertised address (the code
referencing listenAddr and the advertised address construction) to only use
listenAddr when parsedIP is non-nil and not unspecified.

In `@node/hakeeper/rpc/client.go`:
- Around line 80-84: The ClusterMembership method currently always returns
&membership even when c.c.CallContext returns an error; update
APIClient.ClusterMembership so that after calling c.c.CallContext(ctx,
&membership, prefixRPC("clusterMembership")) you check the returned err and if
err != nil return nil, err, otherwise return &membership, nil; this uses the
APIClient.ClusterMembership, CallContext and ClusterMembership identifiers to
locate and implement the change.

In `@ops/docker-sequencer-test/run-perf-test.sh`:
- Around line 197-203: The do_start function currently ignores
wait_for_ha_leader failures and always prints log_ok "HA cluster running", so
change do_start to treat wait_for_ha_leader failure as fatal: call
wait_for_ha_leader 60 and if it fails, emit a clear error (use log_error or
log_warn) and return a non-zero status (or exit 1) so do_start does not print
log_ok nor allow do_run to proceed; only print log_ok "HA cluster running" when
wait_for_ha_leader succeeds. Ensure the change references wait_for_ha_leader,
do_start, do_run, and log_ok so the failure propagates and the perf harness
fails fast.

---

Nitpick comments:
In `@node/hakeeper/config.go`:
- Around line 203-208: The current port checks for c.Consensus.ListenPort and
c.RPC.ListenPort allow 0; update the validation to reject zero by changing the
conditions from "< 0 || > math.MaxUint16" to "<= 0 || > math.MaxUint16" (or
explicitly add "|| c.Consensus.ListenPort == 0" and same for c.RPC.ListenPort)
so that ports must be in 1..65535 and return the same formatted error messages
if invalid.
- Around line 164-191: The function localNonLoopbackIP currently only returns
IPv4 (uses ip.To4()) but its name/signature implies any IP; either rename it to
localNonLoopbackIPv4 and update its error text and all callers to reflect
IPv4-only behavior (change the function name and the error string "no
non-loopback IPv4 address found"), or if you need IPv6 support, modify
localNonLoopbackIP to accept both families by removing the ip.To4() check and
returning ip.String() for non-loopback IPv4 and IPv6 addresses (ensure callers
handle possible IPv6 literals); update tests and any usages accordingly.

In `@node/hakeeper/ha_service.go`:
- Around line 181-185: The current leader code panics when the FSM Apply
callback returns an error (the block handling path that checks f.Response() and
calls panic with "hakeeper: leader FSM.Apply failed"), which kills the leader
process; change this behavior in the leader FSM apply-handling (the block using
f.Response(), resp.(error) check and panic) to instead surface the error and
perform controlled handling: replace the panic with a bounded retry/backoff loop
that re-invokes the local apply handler (or requeues the Raft log application) a
configurable number of times, and if still failing, log the full error via the
existing logger and trigger a safe leadership relinquish (step down) or mark the
node unhealthy rather than terminating the process; ensure you preserve the
original error in logs and metrics and reference the same response check
(f.Response()) and error branch so callers can observe the failure mode.

In `@node/hakeeper/rpc/backend.go`:
- Around line 10-11: The APIBackend struct stores a log.Logger in the field
named "log" but never uses it; either remove the unused field from APIBackend
(and its constructor NewAPIBackend) or wire it into the delegate methods on
APIBackend by replacing silent error returns with process logging—use the "log"
field to emit contextual messages (errors, warnings, or debug) inside methods
that call the underlying delegate (so any method on APIBackend that currently
forwards calls without logging should log failures or notable events using a
descriptive message and the err value).

In `@node/hakeeper/rpc/client.go`:
- Around line 9-10: RPCNamespace is declared as a package-level var but never
modified; change its declaration to a const (replace var RPCNamespace = "ha"
with const RPCNamespace = "ha") to prevent reassignment and clarify
intent—update any references remain unchanged since const has the same
visibility and name RPCNamespace.

In `@node/hakeeper/rpc/server.go`:
- Around line 69-82: Stop currently calls s.httpServer.Close() which
force-closes active connections; replace that with a graceful shutdown using
context.WithTimeout and s.httpServer.Shutdown(ctx) (e.g., 5s or configurable) to
allow in-flight requests to finish, log any shutdown errors including context
timeouts, and only call s.httpServer.Close() as a last-resort fallback if
Shutdown returns an error; ensure this occurs inside Server.Stop before waiting
on s.wg and before calling s.rpcServer.Stop().
- Around line 56-67: Start currently returns before the server has bound its
port; fix Start in Server to synchronously bind the listening socket and only
return nil after successful bind. Use net.Listen to create a listener for
s.httpServer.Addr (capture error and return it), then spawn the goroutine that
calls s.httpServer.Serve(listener) and keeps the existing waitgroup/error
logging logic; ensure Serve is used instead of ListenAndServe so the goroutine
uses the prebound listener and shutdown still works with that listener.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 35229859-a993-4347-996e-549d05c91f82

📥 Commits

Reviewing files that changed from the base of the PR and between 58c70f9 and a64bbfd.

⛔ Files ignored due to path filters (9)
  • bindings/go.sum is excluded by !**/*.sum
  • contracts/go.sum is excluded by !**/*.sum
  • go.work.sum is excluded by !**/*.sum
  • node/go.sum is excluded by !**/*.sum
  • ops/l2-genesis/go.sum is excluded by !**/*.sum
  • ops/tools/go.sum is excluded by !**/*.sum
  • oracle/go.sum is excluded by !**/*.sum
  • token-price-oracle/go.sum is excluded by !**/*.sum
  • tx-submitter/go.sum is excluded by !**/*.sum
📒 Files selected for processing (43)
  • Makefile
  • bindings/bindings/l1sequencer.go
  • bindings/go.mod
  • contracts/contracts/l1/L1Sequencer.sol
  • contracts/contracts/test/L1Sequencer.t.sol
  • contracts/contracts/test/base/L1SequencerBase.t.sol
  • contracts/deploy/022-SequencerInit.ts
  • contracts/go.mod
  • go-ethereum
  • node/cmd/node/main.go
  • node/core/executor.go
  • node/flags/flags.go
  • node/go.mod
  • node/hakeeper/block_fsm.go
  • node/hakeeper/block_payload.go
  • node/hakeeper/config.go
  • node/hakeeper/ha.toml.example
  • node/hakeeper/ha_service.go
  • node/hakeeper/leader_monitor.go
  • node/hakeeper/rpc/api.go
  • node/hakeeper/rpc/auth.go
  • node/hakeeper/rpc/auth_test.go
  • node/hakeeper/rpc/backend.go
  • node/hakeeper/rpc/client.go
  • node/hakeeper/rpc/server.go
  • node/hakeeper/rpc/types.go
  • node/l1sequencer/signer.go
  • node/l1sequencer/verifier.go
  • node/sequencer/tm_node.go
  • node/sync/syncer.go
  • node/types/retryable_client.go
  • node/types/retryable_client_test.go
  • ops/docker-sequencer-test/docker-compose.ha-override.yml
  • ops/docker-sequencer-test/docker-compose.override.yml
  • ops/docker-sequencer-test/run-ha-test.sh
  • ops/docker-sequencer-test/run-perf-test.sh
  • ops/docker-sequencer-test/run-test.sh
  • ops/docker/.env
  • ops/l2-genesis/go.mod
  • ops/tools/go.mod
  • oracle/go.mod
  • token-price-oracle/go.mod
  • tx-submitter/go.mod

Comment thread go-ethereum
@@ -1 +1 @@
Subproject commit 045be0fdc7ca6f80e18eb4e26f7452500292ccec
Subproject commit 62952ec7d188953ee7b3224c3693baffa97c8084
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check go-ethereum version consistency across submodule, go.mod files, and Makefile

echo "=== Submodule commit ==="
cat go-ethereum

echo -e "\n=== Makefile go-ethereum pins ==="
rg -n "ETHEREUM_SUBMODULE_COMMIT_OR_TAG|ETHEREUM_TARGET_VERSION" Makefile

echo -e "\n=== go.mod replace directives for go-ethereum ==="
fd -e mod -x rg -n "github.com/morph-l2/go-ethereum" {}

echo -e "\n=== Extracting commit from pseudo-version (if present) ==="
# Pseudo-version format: v0.0.0-yyyymmddhhmmss-abcdefabcdef
fd -e mod -x rg -oP "github.com/morph-l2/go-ethereum.*v\d+\.\d+\.\d+-\d{14}-\K[a-f0-9]{12}" {}

Repository: morph-l2/morph

Length of output: 2224


🏁 Script executed:

git config -f .gitmodules --get-regexp 'go-ethereum' || echo "No .gitmodules found"
git submodule status go-ethereum 2>/dev/null || echo "Submodule status failed, checking .git/modules"
cat .git/modules/go-ethereum/HEAD 2>/dev/null || echo "No HEAD file"
cd go-ethereum && git rev-parse HEAD 2>/dev/null || echo "Cannot get HEAD from submodule"

Repository: morph-l2/morph

Length of output: 275


Fix submodule commit mismatch with version declarations.

The go-ethereum submodule is currently at commit 62952ec7d188953ee7b3224c3693baffa97c8084, but the Makefile and all go.mod files declare a different commit: 56deb7072ae467a12a850815c7a5c09b7c2782ba (referenced as version v0.0.0-20260508105911-56deb7072ae4 in go.mod). This inconsistency breaks reproducible builds and will cause version mismatches between the actual submodule and the declared dependency version. Either update the submodule to the Makefile's intended commit, or update the Makefile and all go.mod files to point to the current submodule commit.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go-ethereum` at line 1, The repository's go-ethereum submodule commit
(62952ec7d188953ee7b3224c3693baffa97c8084) does not match the commit referenced
in the Makefile and go.mod files (56deb7072ae467a12a850815c7a5c09b7c2782ba /
v0.0.0-20260508105911-56deb7072ae4); fix by making them consistent: either
update the submodule to the declared commit (56deb707...) or update the Makefile
and every go.mod to reference the actual submodule commit (62952ec7...).
Concretely, pick one source-of-truth, then update the Makefile variable(s) that
pin go-ethereum and all go.mod require/replace entries to the chosen
commit/version, and ensure the git submodule is checked out/synced to that
commit so builds are reproducible.

Comment thread node/cmd/node/main.go
Comment on lines +153 to +160
if signer != nil && executor.Syncer() == nil {
l1Syncer, err := node.NewSyncer(ctx, home, nodeConfig)
if err != nil {
return fmt.Errorf("failed to init L1 syncer for post-upgrade sequencer: %w", err)
}
executor.SetSyncer(l1Syncer)
l1Syncer.Start()
nodeConfig.Logger.Info("L1 syncer start", "reason", "post-upgrade sequencer not in PBFT validator set")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Track and stop the eager syncer.

This branch starts l1Syncer, but the shutdown path only stops the outer syncer variable. A later startup error or a clean shutdown leaves this polling loop and its RPC client running until process exit. Reuse the outer variable here, or stop executor.Syncer() during cleanup.

Suggested fix
-			l1Syncer, err := node.NewSyncer(ctx, home, nodeConfig)
+			syncer, err = node.NewSyncer(ctx, home, nodeConfig)
 			if err != nil {
 				return fmt.Errorf("failed to init L1 syncer for post-upgrade sequencer: %w", err)
 			}
-			executor.SetSyncer(l1Syncer)
-			l1Syncer.Start()
+			executor.SetSyncer(syncer)
+			syncer.Start()
 			nodeConfig.Logger.Info("L1 syncer start", "reason", "post-upgrade sequencer not in PBFT validator set")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if signer != nil && executor.Syncer() == nil {
l1Syncer, err := node.NewSyncer(ctx, home, nodeConfig)
if err != nil {
return fmt.Errorf("failed to init L1 syncer for post-upgrade sequencer: %w", err)
}
executor.SetSyncer(l1Syncer)
l1Syncer.Start()
nodeConfig.Logger.Info("L1 syncer start", "reason", "post-upgrade sequencer not in PBFT validator set")
if signer != nil && executor.Syncer() == nil {
syncer, err = node.NewSyncer(ctx, home, nodeConfig)
if err != nil {
return fmt.Errorf("failed to init L1 syncer for post-upgrade sequencer: %w", err)
}
executor.SetSyncer(syncer)
syncer.Start()
nodeConfig.Logger.Info("L1 syncer start", "reason", "post-upgrade sequencer not in PBFT validator set")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@node/cmd/node/main.go` around lines 153 - 160, This branch starts l1Syncer
via node.NewSyncer and l1Syncer.Start but never assigns it to the outer syncer
variable, so shutdown only stops the original syncer and leaves l1Syncer
running; fix by ensuring the started syncer is the one cleaned up — either
assign the outer syncer variable to l1Syncer (e.g., syncer = l1Syncer) or update
the cleanup path to fetch executor.Syncer() and Stop() it; make the change in
the branch where you call node.NewSyncer, executor.SetSyncer, and l1Syncer.Start
so the same syncer instance is stopped on shutdown or error.

Comment on lines +165 to +198
func (h *HAService) Commit(block *types.BlockV2) error {
t0 := time.Now()

data, err := encodeBlock(block)
if err != nil {
return fmt.Errorf("Commit: encode: %w", err)
}
encodeDur := time.Since(t0)

t1 := time.Now()
f := h.r.Apply(data, raftInfiniteTimeout)
if err := f.Error(); err != nil {
return err
}
raftDur := time.Since(t1)

if resp := f.Response(); resp != nil {
if err, ok := resp.(error); ok {
panic(fmt.Sprintf("hakeeper: leader FSM.Apply failed: %v", err))
}
}

totalDur := time.Since(t0)
h.logger.Debug("[PERF] HAService.Commit",
"height", block.Number,
"encode_ms", float64(encodeDur.Microseconds())/1000.0,
"raft_ms", float64(raftDur.Microseconds())/1000.0,
"total_ms", float64(totalDur.Microseconds())/1000.0,
"dataBytes", len(data),
"txCount", len(block.Transactions),
)

return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Infinite timeout on Raft.Apply can hang the sequencer indefinitely.

Line 175 uses raftInfiniteTimeout (0), which makes the leader block forever until quorum is achieved. If the cluster loses quorum (e.g., network partition, follower crashes), Commit() will hang indefinitely, freezing block production.

While this ensures correctness (no blocks committed without quorum), it creates an operational hazard: the sequencer appears unresponsive with no automatic recovery. Operators must restore quorum manually.

Consider adding:

  • A finite timeout (e.g., 30s-60s) with retry logic and alerting
  • Health check endpoints that report quorum loss
  • Circuit-breaker pattern to detect prolonged failures

Alternatively, document this behavior explicitly in operator runbooks so that monitoring can detect hung Commit() calls via metrics (e.g., time since last block).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@node/hakeeper/ha_service.go` around lines 165 - 198, Commit currently calls
h.r.Apply with raftInfiniteTimeout (0) which can block forever if quorum is
lost; change Commit to use a finite timeout instead of raftInfiniteTimeout
(e.g., 30s–60s) when calling h.r.Apply, add retry logic with exponential backoff
and a maximum retry window around the Apply call (use the returned Future f and
its Response/Error handling to surface failures), and integrate a
circuit-breaker/health flag that trips after repeated Apply timeouts to
short-circuit further commits and emit an alert/metric; also expose
quorum/health state via a health-check endpoint so operators and monitoring can
detect and react to stalled Commit calls.

Comment on lines +28 to +29
func DialAPIClient(ctx context.Context, addr string, token string) (*APIClient, error) {
c, err := ethrpc.DialContext(ctx, "http://"+addr)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Hardcoded HTTP scheme prevents TLS encryption.

The function hardcodes "http://", forcing all RPC communication to be unencrypted. The management API handles sensitive cluster operations (leader transfer, membership changes) that should be protected in production.

🔒 Proposed fix to support both HTTP and HTTPS
-func DialAPIClient(ctx context.Context, addr string, token string) (*APIClient, error) {
-	c, err := ethrpc.DialContext(ctx, "http://"+addr)
+// DialAPIClient dials a hakeeper RPC server at the given address and returns
+// an APIClient. addr should include the scheme (http:// or https://).
+// token is sent as the Authorization header on every request;
+// pass empty string if the server has no auth configured.
+// The caller is responsible for calling Close() when done.
+func DialAPIClient(ctx context.Context, addr string, token string) (*APIClient, error) {
+	c, err := ethrpc.DialContext(ctx, addr)

Callers should then pass the full URL with scheme, e.g., "https://127.0.0.1:8545".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@node/hakeeper/rpc/client.go` around lines 28 - 29, The DialAPIClient function
currently prepends "http://" to addr, forcing unencrypted connections; change it
to use the provided addr as the full URL (including scheme) when calling
ethrpc.DialContext so callers can pass "https://..." for TLS; ensure
DialAPIClient (and any callers) expect a full URL and optionally validate that
addr has a valid scheme (http or https) before calling ethrpc.DialContext to
avoid silently malformed URLs.

Comment on lines +120 to +135
func (rc *RetryableClient) NewL2BlockV2(ctx context.Context, executableL2Data *catalyst.ExecutableL2Data, isSafe bool) (err error) {
if retryErr := backoff.Retry(func() error {
respErr := rc.authClient.NewL2BlockV2(ctx, executableL2Data, isSafe)
if respErr != nil {
rc.logger.Error("NewL2BlockV2 failed",
"block_number", executableL2Data.Number,
"isSafe", isSafe,
"error", respErr)
if retryableError(respErr) {
return respErr
}
err = respErr
}
return nil
}, rc.b); retryErr != nil {
return retryErr
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🌐 Web query:

github.com/cenkalti/backoff/v4 ExponentialBackOff Reset before reuse concurrent use thread safe

💡 Result:

The ExponentialBackOff implementation in the cenkalti/backoff library (including v4 and v5) is explicitly not thread-safe [1][2][3]. The library maintainer has explicitly declined requests to add internal mutexes to the ExponentialBackOff struct, stating a preference to keep the library minimal and requiring users to handle thread safety externally if needed [4]. Key points regarding concurrent use and resetting: 1. Thread Safety: Because the implementation is not thread-safe, you must not access or modify an ExponentialBackOff instance (including calling NextBackOff or Reset) from multiple goroutines simultaneously [5][4]. 2. Resetting: The Reset method is intended to restart the backoff timer and interval [1]. Like other methods on the struct, it is not thread-safe and must be protected by external synchronization (such as a sync.Mutex) if the instance is shared across goroutines [5][4]. 3. Recommended Approach: If you require thread-safe behavior, you should wrap the ExponentialBackOff instance with your own synchronization logic (e.g., a mutex) in your application code [4]. Alternatively, for many common use cases, the library recommends using the provided Retry function, which manages the backoff lifecycle internally [6][7].

Citations:


🏁 Script executed:

# Find all wrapper methods in RetryableClient that use rc.b
rg -A 5 'func \(rc \*RetryableClient\)' node/types/retryable_client.go | head -100

Repository: morph-l2/morph

Length of output: 4454


🏁 Script executed:

# Check how RetryableClient methods are called in the codebase
rg 'rc\.' node/types/retryable_client.go | wc -l

Repository: morph-l2/morph

Length of output: 61


🏁 Script executed:

# Find usages of RetryableClient in other files
rg -l 'RetryableClient' node/ --type go | head -20

Repository: morph-l2/morph

Length of output: 192


🏁 Script executed:

# Check if RetryableClient methods are called in concurrent contexts
rg -B 2 -A 2 'go.*rc\.' node/ --type go | head -50

Repository: morph-l2/morph

Length of output: 40


🏁 Script executed:

# Look for how RetryableClient instance is passed and used in executors/sequencers
rg 'NewRetryableClient\|retryClient' node/sequencer/tm_node.go -A 3 -B 3

Repository: morph-l2/morph

Length of output: 40


🏁 Script executed:

# Check for concurrent invocations in derivation
rg '\brc\.' node/derivation/derivation.go | head -20

Repository: morph-l2/morph

Length of output: 40


🏁 Script executed:

# Get more context on how retryClient is used in the codebase
rg 'retryClient' node/sequencer/ -B 1 -A 1 --type go

Repository: morph-l2/morph

Length of output: 40


🏁 Script executed:

# Check executor.go for retryClient usage
rg 'retryClient\|rc' node/core/executor.go -B 1 -A 1 --type go | head -40

Repository: morph-l2/morph

Length of output: 40


🏁 Script executed:

# Look at the actual RetryableClient struct definition and initialization
cat -n node/types/retryable_client.go | head -50

Repository: morph-l2/morph

Length of output: 2173


🏁 Script executed:

# Read the NewRetryableClient function
sed -n '51,70p' node/types/retryable_client.go

Repository: morph-l2/morph

Length of output: 887


🏁 Script executed:

# Check how RetryableClient is instantiated and stored (look for fields in parent structs)
rg 'retryClient|RetryableClient' node/sequencer/ node/core/ node/derivation/ node/blocktag/ -B 2 -A 2 --type go | head -80

Repository: morph-l2/morph

Length of output: 2712


🏁 Script executed:

# Search for concurrent method calls on retryClient or similar patterns
rg 'go.*\(.*\)|goroutine|spawn' node/sequencer/tm_node.go -B 2 -A 2 --type go | head -50

Repository: morph-l2/morph

Length of output: 40


🏁 Script executed:

# Look for goroutine spawning in modules that use RetryableClient
rg 'go func' node/sequencer/ node/core/ node/derivation/ node/blocktag/ --type go | head -30

Repository: morph-l2/morph

Length of output: 145


🏁 Script executed:

# Check if the Executor or other structs spawn goroutines that call l2Client methods
rg -A 5 'func.*l2Client' node/core/executor.go --type go | head -40

Repository: morph-l2/morph

Length of output: 40


🏁 Script executed:

# Look for concurrent calls to l2Client methods in derivation
rg 'l2Client\.' node/derivation/derivation.go --type go -B 1 -A 1 | head -50

Repository: morph-l2/morph

Length of output: 689


🏁 Script executed:

# Get full context of the goroutines in derivation.go
rg -B 10 -A 10 'go func' node/derivation/derivation.go --type go | head -100

Repository: morph-l2/morph

Length of output: 1237


🏁 Script executed:

# Check if multiple goroutines access l2Client concurrently
cat -n node/derivation/derivation.go | grep -A 30 'go func'

Repository: morph-l2/morph

Length of output: 2142


🏁 Script executed:

# Check if there's any synchronization (mutex) around rc.b or the RetryableClient
rg 'sync.Mutex|Lock|Unlock' node/types/retryable_client.go --type go

Repository: morph-l2/morph

Length of output: 40


🏁 Script executed:

# Verify: the same retryClient instance is used in a loop calling multiple methods
sed -n '149,170p' node/derivation/derivation.go

Repository: morph-l2/morph

Length of output: 420


🏁 Script executed:

# Check derivationBlock method to see all l2Client calls in the main loop
rg -A 50 'func.*derivationBlock' node/derivation/derivation.go | head -80

Repository: morph-l2/morph

Length of output: 1897


🏁 Script executed:

# Check if this PR actually implements the fix or just adds NewL2BlockV2
# Look at the full retryable_client.go to see the struct definition
wc -l node/types/retryable_client.go

Repository: morph-l2/morph

Length of output: 93


🏁 Script executed:

# Check the end of the file to see if there's a fix for the backoff reuse
tail -50 node/types/retryable_client.go

Repository: morph-l2/morph

Length of output: 1794


🏁 Script executed:

# Verify the PR only added NewL2BlockV2 without fixing the shared backoff issue
git diff --stat 2>/dev/null | grep retryable_client || rg 'newRetryBackoff\|fresh' node/types/retryable_client.go --type go

Repository: morph-l2/morph

Length of output: 40


Use a fresh backoff per RPC call.

This new wrapper still reuses the shared rc.b field. Since github.com/cenkalti/backoff/v4's ExponentialBackOff is not thread-safe and RetryableClient is shared across goroutines (e.g., in Derivation.Start()), concurrent calls to different methods can corrupt retry state and interfere with each other's backoff schedules. Create and reset a fresh backoff for each retry attempt instead of reusing the shared field.

Suggested approach
 type RetryableClient struct {
 	authClient *authclient.Client
 	ethClient  *ethclient.Client
-	b          backoff.BackOff
 	logger     tmlog.Logger
 }

+func newRetryBackoff() backoff.BackOff {
+	bo := backoff.NewExponentialBackOff()
+	bo.MaxElapsedTime = GethRetryMaxElapsedTime
+	bo.Reset()
+	return bo
+}
+
 func NewRetryableClient(authClient *authclient.Client, ethClient *ethclient.Client, logger tmlog.Logger) *RetryableClient {
 	logger = logger.With("module", "retryClient")
-	bo := backoff.NewExponentialBackOff()
-	bo.MaxElapsedTime = GethRetryMaxElapsedTime
 	return &RetryableClient{
 		authClient: authClient,
 		ethClient:  ethClient,
-		b:          bo,
 		logger:     logger,
 	}
 }

Then swap rc.b for newRetryBackoff() in each wrapper, including NewL2BlockV2.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@node/types/retryable_client.go` around lines 120 - 135, The wrapper
NewL2BlockV2 on RetryableClient is using the shared rc.b backoff which is not
thread-safe; create a fresh backoff per RPC by calling newRetryBackoff() (or
otherwise constructing/resetting a new backoff) inside NewL2BlockV2 and pass
that to backoff.Retry instead of rc.b; update the call site in NewL2BlockV2 to
use the fresh backoff and apply the same pattern to the other RetryableClient
wrapper methods so each RPC uses its own independent backoff instance.

Comment on lines +1412 to +1420
leader_height_before=$(get_block_number "$HA_RPC_NODE0")
follower_height_before=$(get_block_number "$L2_RPC_NODE0")
sleep 10
leader_height_after=$(get_block_number "$HA_RPC_NODE0")
follower_height_after=$(get_block_number "$L2_RPC_NODE0")

local follower_delta=$((follower_height_after - follower_height_before))
local gap=$((leader_height_after - follower_height_after))
if [ "$follower_delta" -ge 1 ] && [ "$gap" -lt 10 ]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

TC-P2P-01 is reading block height from the HA admin RPC, not the leader geth RPC.

HA_RPC_NODE0 is the admin endpoint, so eth_blockNumber here returns no L2 height. That makes gap invalid and can let the test pass even when the fullnode is far behind. It also hard-codes node 0 instead of the current leader.

Suggested fix
-    leader_height_before=$(get_block_number "$HA_RPC_NODE0")
+    local leader_rpc leader_geth_rpc
+    leader_rpc=$(find_leader_rpc)
+    leader_geth_rpc=$(ha_rpc_to_geth_rpc "$leader_rpc")
+    leader_height_before=$(get_block_number "$leader_geth_rpc")
     follower_height_before=$(get_block_number "$L2_RPC_NODE0")
     sleep 10
-    leader_height_after=$(get_block_number "$HA_RPC_NODE0")
+    leader_height_after=$(get_block_number "$leader_geth_rpc")
     follower_height_after=$(get_block_number "$L2_RPC_NODE0")
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 1412-1412: leader_height_before appears unused. Verify use (or export if used externally).

(SC2034)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ops/docker-sequencer-test/run-ha-test.sh` around lines 1412 - 1420,
leader_height_before/after are being fetched from HA_RPC_NODE0 (the admin RPC)
which doesn't return L2 blockNumber; replace the HA admin RPC with the current
leader's geth RPC when calling get_block_number so leader_height_* reflects the
leader fullnode L2 height, and stop hard-coding node 0 — use the variable that
holds the current leader RPC endpoint (the same leader identifier used elsewhere
in the script) so gap = leader_height_after - follower_height_after is valid;
update the two calls to get_block_number that set leader_height_before and
leader_height_after accordingly.

Comment on lines +277 to +307
grep 'produceBlock' "$logfile" | awk '
{
build=0; sign=0; commit=0; total=0; tx=0; gas=0
for(i=1;i<=NF;i++) {
if($i ~ /build_ms=/) { split($i,a,"="); build=a[2]+0 }
if($i ~ /sign_ms=/) { split($i,a,"="); sign=a[2]+0 }
if($i ~ /raft_commit_ms=/) { split($i,a,"="); commit=a[2]+0 }
if($i ~ /apply_ms=/) { split($i,a,"="); commit=a[2]+0 }
if($i ~ /total_ms=/) { split($i,a,"="); total=a[2]+0 }
if($i ~ /txCount=/) { split($i,a,"="); tx=a[2]+0 }
if($i ~ /gasUsed=/) { split($i,a,"="); gas=a[2]+0 }
}
n++; s_build+=build; s_sign+=sign; s_commit+=commit; s_total+=total; s_tx+=tx; s_gas+=gas
if(build>max_build) max_build=build
if(commit>max_commit) max_commit=commit
if(total>max_total) max_total=total
if(n==1 || build<min_build) min_build=build
if(n==1 || commit<min_commit) min_commit=commit
if(n==1 || total<min_total) min_total=total
}
END {
if(n>0) {
printf " %-18s avg=%-10.2f min=%-10.2f max=%.2f\n", "build_ms:", s_build/n, min_build, max_build
printf " %-18s avg=%-10.2f min=%-10.2f max=%.2f\n", "sign_ms:", s_sign/n, 0, 0
printf " %-18s avg=%-10.2f min=%-10.2f max=%.2f\n", "raft_commit_ms:", s_commit/n, min_commit, max_commit
printf " %-18s avg=%-10.2f min=%-10.2f max=%.2f\n", "total_ms:", s_total/n, min_total, max_total
printf " %-18s avg=%.1f\n", "txCount:", s_tx/n
printf " %-18s avg=%.0f\n", "gasUsed:", s_gas/n
}
}'
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bug: apply_ms clobbers raft_commit_ms in the produceBlock awk parser.

Lines 283 and 284 both assign to the same accumulator (commit):

if($i ~ /raft_commit_ms=/) { split($i,a,"="); commit=a[2]+0 }
if($i ~ /apply_ms=/)  { split($i,a,"="); commit=a[2]+0 }

Whichever field appears later on the log line wins, so the summary printed as raft_commit_ms: (line 301) actually reports apply_ms (or a mix, depending on field order). apply_ms is also never surfaced in its own row even though the parser pretends to read it. This silently corrupts one of the headline metrics the script exists to produce.

🐛 Proposed fix: track `apply` separately and print it
             {
-                build=0; sign=0; commit=0; total=0; tx=0; gas=0
+                build=0; sign=0; commit=0; apply=0; total=0; tx=0; gas=0
                 for(i=1;i<=NF;i++) {
                     if($i ~ /build_ms=/)  { split($i,a,"="); build=a[2]+0 }
                     if($i ~ /sign_ms=/)   { split($i,a,"="); sign=a[2]+0 }
                     if($i ~ /raft_commit_ms=/) { split($i,a,"="); commit=a[2]+0 }
-                    if($i ~ /apply_ms=/)  { split($i,a,"="); commit=a[2]+0 }
+                    if($i ~ /apply_ms=/)  { split($i,a,"="); apply=a[2]+0 }
                     if($i ~ /total_ms=/)  { split($i,a,"="); total=a[2]+0 }
                     if($i ~ /txCount=/)   { split($i,a,"="); tx=a[2]+0 }
                     if($i ~ /gasUsed=/)   { split($i,a,"="); gas=a[2]+0 }
                 }
-                n++; s_build+=build; s_sign+=sign; s_commit+=commit; s_total+=total; s_tx+=tx; s_gas+=gas
+                n++; s_build+=build; s_sign+=sign; s_commit+=commit; s_apply+=apply; s_total+=total; s_tx+=tx; s_gas+=gas
                 if(build>max_build) max_build=build
                 if(commit>max_commit) max_commit=commit
                 if(total>max_total) max_total=total
                 if(n==1 || build<min_build) min_build=build
                 if(n==1 || commit<min_commit) min_commit=commit
                 if(n==1 || total<min_total) min_total=total
             }
             END {
                 if(n>0) {
                     printf "  %-18s avg=%-10.2f min=%-10.2f max=%.2f\n", "build_ms:", s_build/n, min_build, max_build
-                    printf "  %-18s avg=%-10.2f min=%-10.2f max=%.2f\n", "sign_ms:", s_sign/n, 0, 0
+                    printf "  %-18s avg=%-10.2f\n", "sign_ms:", s_sign/n
                     printf "  %-18s avg=%-10.2f min=%-10.2f max=%.2f\n", "raft_commit_ms:", s_commit/n, min_commit, max_commit
+                    printf "  %-18s avg=%-10.2f\n", "apply_ms:", s_apply/n
                     printf "  %-18s avg=%-10.2f min=%-10.2f max=%.2f\n", "total_ms:", s_total/n, min_total, max_total
                     printf "  %-18s avg=%.1f\n", "txCount:", s_tx/n
                     printf "  %-18s avg=%.0f\n", "gasUsed:", s_gas/n
                 }
             }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ops/docker-sequencer-test/run-perf-test.sh` around lines 277 - 307, The
produceBlock awk parser currently writes apply_ms into the same variable as
raft_commit_ms (both assign to commit), corrupting raft_commit_ms; update the
parser to parse apply_ms into its own variable (e.g., apply), add accumulators
s_apply and trackers max_apply/min_apply alongside the existing
s_commit/max_commit/min_commit, increment s_apply and update max_apply/min_apply
in the per-line block, and add a printf row for "apply_ms:" in the END block
(similar to raft_commit_ms:) so apply_ms is reported separately while
raft_commit_ms remains correct.

Comment on lines +215 to +225
try:
run_command([
'cast', 'send', l1_sequencer_addr,
'initializeHistory(address,uint64)',
sequencer_addr, str(upgrade_height),
'--rpc-url', 'http://127.0.0.1:9545',
'--private-key', deployer_pk
])
log.info('L1Sequencer history initialized successfully')
except Exception as e:
log.info(f'L1Sequencer initializeHistory failed (may already be initialized): {e}')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't ignore initializeHistory failures on a fresh deployment.

This script redeploys Proxy__L1Sequencer just above, so “may already be initialized” should not be the normal case here. Catching every exception and continuing can start the rest of the test with no sequencer history and hide the real setup failure.

Suggested direction
-    except Exception as e:
-        log.info(f'L1Sequencer initializeHistory failed (may already be initialized): {e}')
+    except Exception as e:
+        raise RuntimeError(f'L1Sequencer initializeHistory failed: {e}') from e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
run_command([
'cast', 'send', l1_sequencer_addr,
'initializeHistory(address,uint64)',
sequencer_addr, str(upgrade_height),
'--rpc-url', 'http://127.0.0.1:9545',
'--private-key', deployer_pk
])
log.info('L1Sequencer history initialized successfully')
except Exception as e:
log.info(f'L1Sequencer initializeHistory failed (may already be initialized): {e}')
try:
run_command([
'cast', 'send', l1_sequencer_addr,
'initializeHistory(address,uint64)',
sequencer_addr, str(upgrade_height),
'--rpc-url', 'http://127.0.0.1:9545',
'--private-key', deployer_pk
])
log.info('L1Sequencer history initialized successfully')
except Exception as e:
raise RuntimeError(f'L1Sequencer initializeHistory failed: {e}') from e
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ops/docker-sequencer-test/run-test.sh` around lines 215 - 225, The try/except
around the initializeHistory call is swallowing all errors and masking real
failures; change it so initializeHistory failures on a fresh deploy surface:
either remove the broad except and let exceptions propagate from
run_command('initializeHistory(address,uint64)', ...) so the script fails on
real errors, or catch the exception, inspect the error text returned by
run_command and only suppress it if it unmistakably indicates "already
initialized" (otherwise re-raise); update the block around run_command /
initializeHistory / L1Sequencer to log and then raise for unexpected errors so
the test cannot continue without a valid sequencer history.

Comment on lines +530 to +533
build_malicious_image() {
log_info "Building malicious node image from test/p2p-security branch..."
cd "$BITGET_ROOT"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

build-malicious uses an undefined repo root.

The script defines MORPH_ROOT and POLYREPO_ROOT, but not BITGET_ROOT, so this command exits on the first cd before it builds anything.

Suggested fix
 build_malicious_image() {
     log_info "Building malicious node image from test/p2p-security branch..."
-    cd "$BITGET_ROOT"
+    cd "$POLYREPO_ROOT"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
build_malicious_image() {
log_info "Building malicious node image from test/p2p-security branch..."
cd "$BITGET_ROOT"
build_malicious_image() {
log_info "Building malicious node image from test/p2p-security branch..."
cd "$POLYREPO_ROOT"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ops/docker-sequencer-test/run-test.sh` around lines 530 - 533, The
build_malicious_image function uses an undefined BITGET_ROOT which causes cd to
fail; fix by either initializing BITGET_ROOT from an existing repo variable or
using that existing variable directly. For example, set
BITGET_ROOT="${BITGET_ROOT:-$POLYREPO_ROOT}" near script initialization or
replace cd "$BITGET_ROOT" inside build_malicious_image with cd "$POLYREPO_ROOT"
(or cd "$MORPH_ROOT" if that is the intended repo), referencing
build_malicious_image, BITGET_ROOT, POLYREPO_ROOT and MORPH_ROOT to locate the
change.

Comment on lines +534 to +552
# Save current tendermint branch state
cd tendermint
local original_branch
original_branch=$(git branch --show-current)
git stash 2>/dev/null || true

# Switch to malicious branch
git checkout test/p2p-security
cd "$BITGET_ROOT"

# Build using same Dockerfile, different tag
docker build -t morph-node-malicious:latest \
-f morph/ops/docker-sequencer-test/Dockerfile.l2-node-test .

# Switch back
cd tendermint
git checkout "$original_branch"
git stash pop 2>/dev/null || true
cd "$BITGET_ROOT"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Always restore the tendermint checkout.

This function stashes changes and checks out test/p2p-security, but with set -e any failing git checkout or docker build leaves the repo on the wrong branch with a pending stash. Restore in a trap or use a separate worktree.

Suggested hardening
 build_malicious_image() {
     log_info "Building malicious node image from test/p2p-security branch..."
-    cd "$BITGET_ROOT"
+    cd "$POLYREPO_ROOT"
 
     # Save current tendermint branch state
     cd tendermint
     local original_branch
     original_branch=$(git branch --show-current)
     git stash 2>/dev/null || true
+    trap 'git checkout "$original_branch" >/dev/null 2>&1 || true; git stash pop 2>/dev/null || true' RETURN
 
     # Switch to malicious branch
     git checkout test/p2p-security
-    cd "$BITGET_ROOT"
+    cd "$POLYREPO_ROOT"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ops/docker-sequencer-test/run-test.sh` around lines 534 - 552, The script
currently stashes local changes, checks out test/p2p-security, and runs docker
build but does not guarantee cleanup on failure; update the block that sets
original_branch, runs git stash, git checkout test/p2p-security and docker build
to always restore the original branch and stash state by installing a trap-based
cleanup (or use a separate git worktree) that on EXIT checks if original_branch
is set and then runs git checkout "$original_branch" and git stash pop (if stash
was created), ensuring the trap is registered immediately after computing
original_branch and before any potentially failing command (refer to the
original_branch variable, git stash, git checkout test/p2p-security and docker
build -t morph-node-malicious:latest).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant