Skip to content

fix(review,cso): SHA-pin PR diff to prevent worktree-flip-during-review#1317

Open
habassa5 wants to merge 5 commits intogarrytan:mainfrom
habassa5:fix/security-review-stale-git-context
Open

fix(review,cso): SHA-pin PR diff to prevent worktree-flip-during-review#1317
habassa5 wants to merge 5 commits intogarrytan:mainfrom
habassa5:fix/security-review-stale-git-context

Conversation

@habassa5
Copy link
Copy Markdown

@habassa5 habassa5 commented May 4, 2026

Summary

/review and /cso use working-tree-dependent git refs (origin/<base>, <base>..HEAD) when computing the PR diff. Inside Agent SDK sessions where nested subagents share the worktree, a stray git checkout anywhere in the call tree (e.g., to inspect a sibling branch's file) silently re-targets every later git diff command, and the review reports findings on the wrong branch. Observed empirically three times in one session in a downstream project's PR reviewers.

This PR introduces a {{PR_DIFF_PIN}} template fragment (Step 0.5 of the affected skills) that resolves BASE_SHA and HEAD_SHA to immutable commit SHAs at the start of the skill — via gh pr view --json baseRefOid,headRefOid in PR context, or git rev-parse origin/<base> plus local HEAD otherwise — and rewrites every subsequent git diff/git log/git show to use those SHAs by value. Commit SHAs are immutable across worktree flips.

The bug, by example

A /review session is in progress on branch feature-B. A nested subagent runs git checkout feature-A to inspect a sibling branch's implementation, then doesn't switch back. From that point on, every git diff origin/main in the review skill renders against feature-A, and the review reports vulnerabilities and dead code from feature-A while purporting to review feature-B.

# Before the fix (working-tree dependent — bug)
git diff origin/main           # whatever branch HEAD currently points to
git log origin/main..HEAD      # whatever branch HEAD currently points to
git diff origin/main...HEAD    # whatever branch HEAD currently points to
git diff --name-only origin/HEAD...   # same — Claude Code's built-in /security-review uses this exact pattern

# After the fix (SHA-pinned — correct)
git diff "$BASE_SHA" "$HEAD_SHA"
git log "$BASE_SHA..$HEAD_SHA"
git show "$HEAD_SHA:VERSION"

Empirical evidence (from the downstream project that surfaced this)

Three back-to-back PR reviewers in the same session observed wrong-branch diffs:

What's NOT changed

  • /security-review (Claude Code built-in) has the exact same class of bug — cli.js ships the prompt git diff --name-only origin/HEAD... literally — but lives in Anthropic's binary and is out of gstack's reach. After this PR, /cso is a strictly safer alternative for security audits run from inside multi-agent sessions. The Claude Code bug is being filed separately upstream with Anthropic.
  • /ship, /codex, /document-release have similar bare-ref patterns. They are reachable from outside multi-agent reviews where the flip risk is lower; this PR leaves them on bare refs for back-compat. Adopting {{PR_DIFF_PIN}} there is one-line each in their .tmpl. Tracked as follow-up.
  • scripts/slop-diff.ts internally uses ${base}...HEAD and walks the working tree. Properly fixing it requires changing the script's interface to accept a head-SHA argument. Tracked as follow-up.

Test plan

  • bun test test/pr-diff-pin-regression.test.ts — 8 tests, ~6s, free tier
    • Includes a real two-branch-fixture test that reproduces the failure mode end-to-end (proves the bug is real, then proves SHA-pinning fixes it)
    • Smell tests catch regressions where someone re-introduces a bare-ref pattern into review/SKILL.md.tmpl or cso/SKILL.md.tmpl — both fenced bash blocks AND inline backtick-quoted commands
    • Verified empirically: deliberately injecting Run \git diff origin/`` into review/SKILL.md.tmpl makes the smell test fail loudly
  • bun test test/skill-collision-sentinel.test.ts — confirms no skill-name collisions introduced
  • bun run gen:skill-docs --dry-run — confirms generated SKILL.md files are in sync with templates
  • Adversarial review by independent codex CLI at HIGH reasoning effort (per CLAUDE.md 全ファイル日本語化: スコープ整理と実施計画 #67) surfaced 7 substantive concerns; all addressed in commit 9404fa9a. Re-running the same adversarial framing on the fixed branch is recommended before merge.
  • End-to-end fresh-clone validation (per CLAUDE.md /plan-ceo-review stack in the planning mode and started implementing the code #66): gh repo clone <fork> -- --branch fix/... to a clean directory, bun install, run regression test, all pass.

Adversarial codex findings addressed

Severity Finding Resolution
P1 Resolver-emitted commands still used bare refs Threaded ctx.skillName === 'review' through 6 affected resolvers; pinned-SHA forms inside /review, bare refs preserved for /ship
P2 Uncommitted local edits dropped in local-/review path Added $REVIEW_DIRTY env var (default 0 in PR context, 1 for local pre-PR), Step 3 instructs combining git diff "$BASE_SHA" with git diff "$HEAD_SHA" when set
P2 Stale base ref via fetch + rev-parse Now uses gh pr view --json baseRefOid directly (immutable) inside PR context
P2 Stale base ref when fetch fails Warn explicitly + add git cat-file -e probe for both pinned SHAs
P2 gh pr diff blessed as immutable but isn't Replaced with explicit warning; recommend gh api /repos/.../compare/$BASE_SHA...$HEAD_SHA if user wants GitHub-rendered diff
P2 Smell test only checked fenced bash blocks Extended to also catch inline backtick-quoted git/gh/bun imperatives
P2 /cso exit(1) on missing SHAs broke non-diff scope flags Soft-fail with WARN; /review re-asserts requirement with explicit check

Files changed

CHANGELOG.md                        |  55 ++
VERSION                             |   2 +-
package.json                        |   2 +-
scripts/resolvers/utility.ts        |  ~140 (+ generatePrDiffPin)
scripts/resolvers/index.ts          |   3
scripts/resolvers/review.ts         |  51 ± (skillName-conditional pinned vs bare)
scripts/resolvers/review-army.ts    |  18 ± (skillName-conditional pinned vs bare)
review/SKILL.md.tmpl                |  29 ± (Step 0.5 inclusion + Step 1/3 rewrite)
review/SKILL.md                     | 108 ± (regenerated)
review/checklist.md                 |   4 (doc reference to pinned form)
review/greptile-triage.md           |   2 (doc reference to pinned form)
cso/SKILL.md.tmpl                   |  13 ± (Step 0.5 inclusion + Phase 2 diff-mode rewrite)
cso/SKILL.md                        | 129 ± (regenerated)
test/pr-diff-pin-regression.test.ts | 280 (new — 8 tests, ~6s, free tier)

Failure mode tracking

This change closes the failure mode named shared-checkout-branch-flip-during-review in CLAUDE.md tracking. The same name appears in Step 0.5 documentation and the regression test's docblock so future maintainers can grep for it.

Compatibility / breaking changes

None. The bare-ref forms are preserved for /ship and other skills that don't yet adopt {{PR_DIFF_PIN}}. Users running /review and /cso will see a new "Pinned review context:" log line at the start of the skill (Step 0.5), but the rest of the workflow is functionally identical except in the buggy worktree-flip case.

🤖 Generated with Claude Code


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

habassa5 and others added 5 commits May 4, 2026 04:48
…mid-review

Per gstack's bisect-commits style, this is the source-of-truth commit
(template + resolver + test). The regenerated SKILL.md files land in
the next commit. The CHANGELOG and VERSION bump land in the third.

Bug: a long-running review skill that reads git state through symbolic
refs (`HEAD`, `origin/<base>`, `origin/HEAD`) silently re-renders against
whatever the working tree is checked out to at the moment each `git diff`
runs — not against the branch the user asked to review. Inside Agent SDK
sessions where nested subagents share the worktree, a stray `git checkout`
anywhere in the call tree (e.g., to inspect a sibling branch's file) flips
HEAD and never restores it. Every later diff command then targets the
wrong branch, and the review reports findings on unrelated code.

Empirical evidence (claude-teams-bot, 2026-05-04): three consecutive
PR reviewers in one session observed the upstream Claude Code built-in
`/security-review` rendering against the wrong branch. The same class
of bug exists in gstack `/review` (uses `git diff origin/<base>`) and
`/cso` `--diff` mode (uses `git log <base>..HEAD`).

Fix: a new `{{PR_DIFF_PIN}}` template fragment (Step 0.5 of the skill)
resolves `BASE_SHA` and `HEAD_SHA` to immutable commit SHAs at the start
of the skill, via `gh pr view --json headRefOid` in PR context or
`git rev-parse origin/<base>` plus local `HEAD` otherwise. Every later
`git diff`, `git log`, and `git show` invocation is rewritten to use
those SHAs by value. SHAs are immutable across worktree flips.

Out of scope (left for follow-up PRs once per-skill verification is
done): `ship`, `codex`, and `document-release` have similar bare-ref
patterns but are reachable from outside multi-agent reviews where the
flip risk is lower. Adopting `{{PR_DIFF_PIN}}` there is one line each.

Not addressed: the upstream Claude Code `/security-review` built-in has
the same bug (`git diff --name-only origin/HEAD...`) but lives in
`cli.js` and is out of gstack's reach. After this change, gstack
`/cso` is a strictly safer alternative for security audits run from
inside multi-agent sessions.

Test: `test/pr-diff-pin-regression.test.ts` builds a real two-branch
fixture, simulates the worktree flip, and asserts (a) bare `git diff
main` reports the wrong branch (proving the failure mode is real),
(b) `git diff "$BASE_SHA" "$HEAD_SHA"` reports the right branch
(proving the fix works), (c) the same holds for `git log` and
`git show`, and (d) the templates use `{{PR_DIFF_PIN}}` and SHAs in
imperative bash blocks. 8 tests, ~6s, free tier.

Refs: claude-teams-bot CLAUDE.md §58 (stale-git-context bug),
§62-§68 (build-discipline gates), failure-mode name
`shared-checkout-branch-flip-during-review`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mechanical regeneration via `bun run gen:skill-docs`. No manual edits.
Mirrors the previous commit's template changes into the rendered files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumps to v1.26.3.0 (PATCH on top of v1.26.2.0). Single-skill bug fix
plus regression test plus reusable resolver. Net diff ~660 lines, no
new user-facing capability beyond the safety guarantee, fits PATCH
guidance per CLAUDE.md commit-style rules.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent codex review (HIGH effort, adversarial mindset per CLAUDE.md
garrytan#67) surfaced 7 substantive concerns against the initial fix. This commit
addresses all of them. Findings:

P1 — Resolver-emitted commands still used bare refs.
  /review imports several resolver-generated sections (SCOPE_DRIFT,
  PLAN_COMPLETION_AUDIT_REVIEW, ADVERSARIAL_STEP, REVIEW_ARMY) whose
  templates emitted `git diff origin/<base>` and `git log origin/<base>..HEAD`
  unchanged. Even with PR_DIFF_PIN at the top of the skill, those later
  sections would re-render against the working tree if a subagent flipped
  it. Fix: thread `ctx.skillName === 'review'` through the affected
  resolvers (`generateScopeDrift`, `generatePlanCompletionAuditInner`,
  `generateAdversarialStep`, `generateSpecialistSelection`,
  `generateSpecialistDispatch`, `generateRedTeam`) and emit pinned-SHA
  forms inside `/review` while leaving `/ship` on the bare-ref form for
  back-compat (ship doesn't yet adopt PR_DIFF_PIN).

P2 — Uncommitted local edits dropped in the local-/review path.
  Pre-fix `/review` ran `git diff origin/<base>` against the working
  tree, which included uncommitted edits — a real workflow for users
  who run /review before /ship creates the PR. Fix: introduce
  `$REVIEW_DIRTY` env var (default 0 in PR context, 1 for local pre-PR).
  When 1, Step 3 instructs the agent to combine `git diff "$BASE_SHA"`
  (committed, pinned) with `git diff "$HEAD_SHA"` (uncommitted, by
  necessity working-tree-dependent — the user's caller knows when they're
  in dirty mode and reads it once at the start).

P2 — Use baseRefOid from gh pr view directly (immutable) instead of
re-resolving via fetch + rev-parse origin/<base>.
  `gh pr view` returns both `baseRefOid` and `headRefOid` — pinning to
  those is strictly more correct than `git rev-parse origin/<base>`,
  which can be stale if the local clone hasn't fetched recently. Fix:
  read baseRefOid from the same `gh pr view` JSON we already fetch.
  Fall back to `git rev-parse origin/<base>` only outside PR context.

P2 — Stale base ref when fetch fails.
  In non-PR fallback, `git fetch origin "$BASE_BRANCH"` is wrapped in
  `|| true` and silently pins whatever local tracking ref exists. Fix:
  warn explicitly when the fetch fails, surface what was pinned. Also
  add a `git cat-file -e` probe for both pinned SHAs and warn when
  either is not present in the local object store (catches missing
  fetch-of-PR-head + missing-pull-ref scenarios).

P2 — Don't bless `gh pr diff "$PR_NUMBER"` as immutable.
  Initial fix recommended `gh pr diff` as an alternative; codex pointed
  out that it re-resolves HEAD and BASE server-side at every call, so a
  force-pushed PR head or fast-forwarded base mid-review changes its
  output. Fix: replace the recommendation with an explicit warning. If
  the user genuinely wants the GitHub-rendered diff, point them at
  `gh api /repos/.../compare/$BASE_SHA...$HEAD_SHA` which IS
  SHA-pinned.

P2 — Smell test only checks fenced bash blocks, missing inline imperatives.
  Step 3 of the original /review used inline backtick-wrapped commands
  ("Run `git diff origin/<base>` to ..."). The smell test in
  pr-diff-pin-regression.test.ts only inspected fenced bash blocks, so
  a future regression that puts a bare ref inside backticks in narrative
  prose would slip through. Fix: extend imperativeBashCommands() to also
  pull inline backtick-wrapped `git`/`gh`/`bun` commands. Skip don't/do
  comparison rows and explicit "**Don't**" labels via heuristic, with
  comments calling out each filtering rule. Verified empirically: a
  deliberately-injected `Run \`git diff origin/<base>\`` inline string
  now fails the test loudly.

P2 — Don't require PR pinning for non-diff /cso modes.
  Original fix exited(1) when SHAs couldn't be resolved, which broke
  /cso --infra, /cso --supply-chain, /cso --owasp (these don't need a
  diff). Fix: soft-fail with WARN. /review re-asserts the requirement
  with an explicit `[ -n "$BASE_SHA" ] && [ -n "$HEAD_SHA" ]` check at
  Step 1; /cso scope-flag modes degrade gracefully.

P3 — Pin slop scan head as well.
  Acknowledged but out of scope. `scripts/slop-diff.ts` internally uses
  `${base}...HEAD`, computes merge-base from working-tree HEAD, and
  walks the worktree directly. Properly fixing it requires changing
  the slop-diff.ts internals to accept a head-SHA argument. Tracked as
  follow-up; current change passes `$BASE_SHA` to slop-diff which still
  drifts on its head side but at least pins the base.

Tests still pass (8/8 in pr-diff-pin-regression). The deliberate-bad-
inline-pattern smoke test confirms the smell test catches the new class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mechanical regeneration via `bun run gen:skill-docs`. No manual edits.
Mirrors the previous commit's resolver+template changes into the
rendered files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant