Skip to content

fix(tmpnetjs): fail fast with actionable errors on boot misconfigurations#4

Merged
owenwahlgren merged 2 commits into
mainfrom
fix/tmpnet-boot-diagnostics
Jun 12, 2026
Merged

fix(tmpnetjs): fail fast with actionable errors on boot misconfigurations#4
owenwahlgren merged 2 commits into
mainfrom
fix/tmpnet-boot-diagnostics

Conversation

@owenwahlgren

Copy link
Copy Markdown
Collaborator

Problem

Three boot misconfigurations each burned minutes of timeout with the real cause buried in a node log — or reported nowhere at all:

  1. Missing local staking keys → nodes boot with ephemeral certs, aren't genesis validators, and P-Chain bootstrap times out with Timed out waiting for P-Chain bootstrap: undefined. The only signal (bls: node is not a validator) sits in /ext/health.
  2. RPCChainVM protocol mismatch (avalanchego vs subnet-evm plugin) → the L1 RPC 404s for the full 3-minute timeout while error creating chain ... handshake failed sits in myl1-rpc-*.log.
  3. Booting over a half-dead previous run → stale nodes hold ports (bind: address already in use minutes in), and a late-running reaper from the old up can kill the new network's nodes.

Changes

  • Preflight checks at the top of up() (internal/preflight.ts):
    • refuse when the pid file records live processes from a previous run
    • refuse when any node HTTP port can't be bound
    • compare RPCChainVM protocol versions of avalanchego (--version-json) and the subnet-evm plugin (--version), best-effort, and refuse on mismatch with both versions and paths in the message
  • Staking keys are now required: resolveStakingKeysDir throws (listing dirs tried + the AVALANCHEGO_STAKING_KEYS_DIR fix) instead of silently falling back to ephemeral certs that can never bootstrap; a partial key set errors per-node instead of downgrading one node to a non-validator.
  • PreflightError skips the reap-on-failure path — nothing was spawned, and reaping would kill the previous (possibly healthy) network the error is telling the user about. (Found live: the first version of the stale-network check reaped the very network it refused to boot over.)
  • L1 RPC timeouts now attach the cause: both wait sites scan the node logs for the error creating chain line for that blockchain ID and append it (internal/diagnose.ts).
  • Better timeout reporting: waitForBootstrap/waitForNodeID report the last RPC response instead of undefined, with a pointer to the /ext/health signal; startPrimaryNetwork logs the resolved binary + staking-keys paths up front.

Verification

All on a real network (macOS, avalanchego v1.14.0 + subnet-evm v0.8.0):

  • stale-network check: ran up over a live network → instant refusal listing live pids, previous network untouched
  • protocol mismatch: avalanchego(proto 45) + subnet-evm(proto 44) → instant refusal naming both versions (previously: 3-min timeout)
  • missing staking keys: instant refusal with fix instructions (previously: 60s timeout, undefined)
  • happy path: full up boots clean through L1 + ICM + relayer, [up] network ready

…ions

Three failure modes previously burned minutes of timeout with the real
cause buried in a node log or nowhere at all:

- Missing local staking keys: nodes booted with ephemeral certs, were not
  genesis validators, and P-Chain bootstrap timed out reporting
  "undefined". Now refuses to boot up front, listing the dirs tried and
  the AVALANCHEGO_STAKING_KEYS_DIR fix; a partial key set also errors
  instead of silently downgrading one node to a non-validator.

- RPCChainVM protocol mismatch between avalanchego and the subnet-evm
  plugin: the L1 RPC 404'd for the full 3-minute timeout while the
  handshake error sat in the node log. New preflight compares both
  binaries' protocol versions (best-effort --version parsing) and
  refuses to boot on mismatch; the L1-RPC timeout errors now also attach
  the "error creating chain" line scanned from the node logs.

- Booting over a half-dead previous run: stale nodes held ports and a
  late reaper could kill the new network's nodes. up() now refuses when
  the pid file records live processes or node ports are taken.

Preflight failures throw PreflightError and skip the reap-on-failure
path — nothing was spawned, and reaping would kill the previous
(possibly healthy) network the error is telling the user about.

Also: waitForBootstrap/waitForNodeID timeouts now report the last RPC
response (not "undefined"), with a pointer to the /ext/health signal;
startPrimaryNetwork logs the resolved binary + staking keys paths.
Comment thread packages/tmpnetjs/src/internal/preflight.ts Fixed
Addresses the CodeQL finding on PR #4: execSync interpolated the
env-derived binary path into a shell command string. execFileSync
invokes the binary directly with no shell, so the path is only ever
an argv entry.
@owenwahlgren owenwahlgren merged commit b3c579c into main Jun 12, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants