Skip to content

Post-audit cleanup: lifecycle bugs, harness fidelity, examples, CLI, installer#2

Merged
owenwahlgren merged 6 commits into
mainfrom
fix/post-audit-cleanup
May 26, 2026
Merged

Post-audit cleanup: lifecycle bugs, harness fidelity, examples, CLI, installer#2
owenwahlgren merged 6 commits into
mainfrom
fix/post-audit-cleanup

Conversation

@owenwahlgren

Copy link
Copy Markdown
Collaborator

Summary

End-to-end cleanup pass driven by a production-readiness audit before flipping the repo public. Six commits, partitioned by domain:

  1. chore: bump workspace to v0.1.0, add CHANGELOG.md
  2. fix(tmpnetjs): unify pid schema, kill process groups, make boot interruption-safe
  3. feat(tmpnetjs): CLI cleanup, snapshot meta sidecar, NetworkTimeouts API
  4. fix(examples): transfer-token decimals mismatch, env-var fallbacks, safer add-validator teardown
  5. feat(harness): per-messenger origin tracking, fee>0 example/test, gated _harness_* admin
  6. feat(installer): required checksum verification with explicit escape hatch

What this fixes

Process lifecycle (was the biggest sharp edge)

The previous orchestrator had two PID writers with incompatible schemas clobbering each other, and down() only SIGTERM'd parent PIDs. The subnet-evm plugin children were orphaned under PID 1 every cycle — a dev machine accumulated 5 plugins from 4-day-old runs.

  • Single canonical pid schema: { hash?, processes: [{ name, pid, pgid, kind, ... }] }
  • Spawns with detached: true, kills the entire process group (kill(-pgid, ...)) with SIGTERM → 3s → SIGKILL
  • Defensive ps-based sweep keyed on the subnet-evm VM ID and the workDir (so a sweep in one workspace doesn't kill another)
  • up() records PIDs incrementally and installs SIGINT/SIGTERM handlers — Ctrl-C during boot now tears down cleanly

Universal-deployer correctness on re-up

deployIcmStack previously assumed the Teleporter deployer (Anvil#1) had nonce=0 on every chain at deploy time. After down → up (snapshot path), C-Chain state was restored with the deployer at nonce 2 from the previous deploy, but L1 was recreated fresh at nonce 0. Result: Teleporter at different addresses on the two chains, breaking the universal-deployer invariant the relayer requires. Now:

  • Idempotent deploy: check the canonical (deployer, nonce=0) address — reuse if Teleporter is already there
  • Defensive guard: if no code at canonical but deployer nonce > 0, abort with a "run pnpm run clean" hint instead of silently producing a broken network
  • On stale-snapshot rejection, also wipe p.data so the fresh cold boot starts truly clean

CLI no longer lies

  • tmpnetjs status was advertised in --help but returned "not yet implemented" — now implements it, tolerant of both old and new pid schemas
  • tmpnetjs up --fresh was parsed then explicitly discarded (void fresh;) — now wired to up({ fresh: true })
  • tmpnetjs down --keep-snapshot was misleading (default behavior, no-op); the actual destructive flag is now --delete-snapshot
  • Stage prints throughout up() — the previously-silent first 2 minutes of cold boot is now visible

Harness fidelity gaps closed

Three places where a test could pass dishonestly while a real chain would behave differently:

  • Universal-deployer check was structurally bypassed in MockWarpPrecompile.getVerifiedWarpMessage — fixed by tracking per-messenger origin; test_rogueSourceMessenger_universalDeployerCheck_reverts confirms
  • Fees > 0 path had zero coverage — added FeeMessage.sol + test exercising real safeTransferFrom + redeemRelayerRewards
  • _harness_* admin entrypoints were unguarded — any contract under test could forge messages mid-test; now gated by a one-shot _harness_init(harness_) + onlyHarness modifier

forge test: 14 → 18 passed.

Examples bugs

  • transfer-token.ts deployed a decimals()=6 token but parsed amounts at 18 decimals — a user copying the scaffold with real 6-decimal USDC would have shipped a 10^12× over-send. Token is now consistently 18 decimals and renamed to DEMO/Demo Token.
  • examples/README.md documented env vars that didn't work and a ExampleERC20 contract name that doesn't exist — now matches the code (flag > env > default precedence).
  • add-validator.ts told users to clean up with pkill -9 avalanchego, which nukes the whole tmpnet. Now prints the actual spawned PID + log path.

Supply chain (installer)

  • icm-relayer + signature-aggregator binary downloads now require checksum verification (was best-effort and silently skipped on parse miss or 404)
  • INTERCHAIN_KIT_SKIP_CHECKSUM=1 is the explicit escape hatch, with a loud console.warn when used
  • Test coverage 11 → 21 via a download injector that lets unit tests exercise 404, missing-entry, and bypass paths without network

Verification

  • pnpm install --frozen-lockfile: clean
  • pnpm -r --filter './packages/**' build: clean
  • pnpm --filter @interchain-kit/examples typecheck: clean
  • pnpm --filter @interchain-kit/icm-services-installer test: 21/21 pass
  • forge test --root contracts: 18/18 pass
  • forge fmt --check --root contracts: clean
  • End-to-end live network: cold-up + ICM demo + ICTT demo + snapshot-restore-up + ICM demo, all green
  • pnpm run down: zero orphaned plugin processes (previously: 5+ orphans every cycle)

Test plan

  • CI is green
  • Local cold boot still works (no pnpm run clean first should be fine post-this-PR)
  • Local snapshot-restore boot still works
  • pnpm run down leaves no avalanchego/srEXi/icm-relayer/signature-aggregator processes alive

Initial pre-public-release versions. Adds Keep-a-Changelog-format
CHANGELOG.md summarizing what ships in 0.1.0.
…ruption-safe

Previously two writers (spawnTracked vs the orchestrator's inline
writePidRecord) clobbered each other on pids.json, and down() only
SIGTERM'd parent PIDs — subnet-evm plugin children were left orphaned
under PPID=1 every cycle. Confirmed in the wild: 5 plugins from
4-day-old runs accumulated on a dev machine.

This commit:

- Consolidates on the richer internal/process.ts schema:
  { hash?, processes: [{ name, pid, pgid, kind, binary, logFile, startedAt }] }.
  Records PIDs incrementally via spawnTracked instead of batching at the
  end, so SIGINT during boot can reap what was spawned so far.
- Spawns every long-running process detached so it becomes a process
  group leader; kills via process.kill(-pgid, ...) with SIGTERM → 3s
  grace → SIGKILL. Adds a defensive ps-based sweep filtered by VM ID
  AND workDir for any leaked plugin children.
- Installs SIGINT/SIGTERM handlers in up() and wraps the boot in
  try/catch so a half-finished up still tears down cleanly.
- Makes installFeeStatePatch reversible — exports uninstallFeeStatePatch
  and calls it from every teardown path so embedders (Vitest setups
  etc.) don't end up with a permanently wrapped globalThis.fetch.
- Adds stage prints throughout up() so the previously-silent 2-minute
  cold-boot phase ([primary] booting, P-Chain bootstrap, CreateSubnet,
  CreateChain, ConvertSubnetToL1Tx) is visible.
- Strips $HOME/code/avalanchego and $HOME/code/avalanche-benchmark
  fallback paths from spawn.ts and l1/create.ts — those leaked a
  developer's personal layout into a public repo.
- Makes deployIcmStack idempotent: checks the canonical
  (deployer, nonce=0) address first and reuses if Teleporter is already
  there (covers the snapshot-restore path, where C-Chain state from a
  prior run already has Teleporter at the universal-deployer address).
  Defensive guard: if no code at the canonical address but deployer
  nonce > 0, abort with a "run pnpm run clean" hint instead of landing
  Teleporter at a non-canonical address.
- On stale-snapshot rejection (validateSnapshot returns ok=false), wipe
  the data dir too. Otherwise residual C-Chain state from the rejected
  run leaves the Teleporter deployer with a non-zero nonce and breaks
  the universal-deployer invariant the relayer depends on.
- cli.ts: implement \`tmpnetjs status\` (was returning "not yet
  implemented"). Wire \`--fresh\` through to up({fresh:true}) (was
  parsed then discarded). Replace the misleading \`--keep-snapshot\`
  flag (default behavior, no-op) with the actually-destructive
  \`--delete-snapshot\`. Rewrote USAGE to match.
- snapshot.ts: write a snapshot.meta.json sidecar alongside the
  tarball (configHash, avalanchegoBinary fingerprint via mtime+size,
  captured-at timestamp). validateSnapshot returns {ok, reason} so
  callers can fall through to cold boot on mismatch instead of
  silently restoring stale state across an avalanchego upgrade.
- types.ts + internal/config.ts: introduce NetworkTimeouts so slow
  CI runners can override the magic values in validator-set.ts
  (postAdvance, epoch, sigagg health, l1 RPC). Defaults preserved.
- internal/config.ts: paths(workDir) now exposes relayerConfigPath,
  sigaggConfigPath, and the sibling icmRelayerStorageDir — \`clean()\`
  wipes all three, fixing the leftover-config-on-clean issue.
- Root README: drop the "17 harness tests" claim (actual is 18 after
  this PR adds fee + universal-deployer tests).
…validator teardown

- transfer-token.ts: the DemoERC20 was deployed with decimals()=6 but
  ERC20TokenHome was told tokenHomeDecimals=18 and parseUnits used 18.
  A user copying the scaffold with a real 6-decimal token (USDC) would
  silently send 10^12× their intended amount. Fix: deploy DemoERC20
  with 18 decimals explicitly, rename the token to "Demo Token"/"DEMO"
  so the symbol stops lying.
- send-message.ts + transfer-token.ts: accept process.env.DESTINATION
  and AMOUNT as fallbacks (flag > env > default). The examples/README
  documented env-var invocations that didn't actually work.
- examples/README.md: fix the `ExampleERC20` -> `DemoERC20` reference
  and document the flag/env precedence.
- add-validator.ts: the script previously told users to clean up with
  `pkill -9 avalanchego`, which nukes the entire tmpnet. The script
  already knows the spawned child's PID — print it and the log path
  instead.
…e admin entrypoints

Three harness fidelity gaps a previous audit flagged. Each gap let a
test pass dishonestly while a real chain would behave differently.

1. Universal-deployer assumption was structurally bypassed.
   MockWarpPrecompile.getVerifiedWarpMessage used to return
   originSenderAddress=msg.sender unconditionally, so Teleporter's
   "require(originSenderAddress == address(this))" check at
   TeleporterMessenger:259-262 always passed. A user with a custom
   factory deploying messengers at different addresses got false greens.

   Fix: Inflight tracks the source messenger; the harness records
   canonicalMessenger[] for whatever it deployed via deployChain, and
   for any source outside that set the mock reports the actual sender
   so the real Teleporter check fires. New
   test_rogueSourceMessenger_universalDeployerCheck_reverts confirms.

2. Fees > 0 path was untested. Every example used feeInfo.amount=0,
   missing safeTransferFrom + redeemRelayerRewards entirely. Adds
   contracts/src/examples/icm-basics/FeeMessage.sol (reuses
   DemoERC20) and a test that asserts the fee actually moves and
   redeemRelayerRewards credits the harness as relayer.

3. _harness_* admin entrypoints on MockWarpPrecompile were
   unguarded. Any contract under test could forge messages mid-test.
   Gated via a one-shot _harness_init(harness_) + onlyHarness modifier;
   FoundryWarpHarness sets itself as the harness at construction.
   test_unauthorisedHarnessAdmin_reverts proves the gate.

forge test: 14 → 18 passed.
…atch

icm-services-installer used to swallow every checksum-related error
("if checksums.txt 404s, continue without verification — per spec").
The releases this installer pins (icm-relayer v1.7.5,
signature-aggregator v0.5.4) do publish checksums.txt, so make
verification required:

- checksums.txt 404, parse-miss, missing-entry, and hash mismatch all
  throw with actionable errors that name the override env var.
- INTERCHAIN_KIT_SKIP_CHECKSUM=1 reverts to the prior best-effort
  behavior, emitting a loud console.warn each time it's used.
- Refactored InstallOptions to accept an optional download injector so
  the new tests can exercise 404, missing-entry, and bypass paths
  without network. Exported Downloader type for those tests.

Test coverage: 11 → 21. New cases cover findChecksum
(present/absent/case-mismatch/malformed), verifySha256
(happy/mismatch), and three installer paths through the injector.
@owenwahlgren owenwahlgren merged commit 3b234c2 into main May 26, 2026
5 checks passed
@owenwahlgren owenwahlgren deleted the fix/post-audit-cleanup branch May 26, 2026 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant