Skip to content

benchmarks: Suite C post-dedup re-run + diff vs 2026-06-11 baseline (#78)#88

Merged
edheltzel merged 2 commits into
mainfrom
worktree-benchmarks+suite-c-post-dedup-78
Jun 18, 2026
Merged

benchmarks: Suite C post-dedup re-run + diff vs 2026-06-11 baseline (#78)#88
edheltzel merged 2 commits into
mainfrom
worktree-benchmarks+suite-c-post-dedup-78

Conversation

@edheltzel

Copy link
Copy Markdown
Owner

Closes #78 — the Hardening & Dedup Proof milestone capstone.

What this does

Runs recall dedup --execute over the seeded Suite C corpora, re-runs Suite C, and diffs against the committed benchmarks/results/2026-06-11T09-36-53-suite-C.jsonl baseline. The diff is dedup's efficacy report — the empirical evidence the parked entity-keying gate (#49) is waiting on.

Deps #63 (cross-run dedup safety, PR #79) and #70 (silent-zero hardening, PR #83) are already on main.

Mechanism (surgical, opt-in)

Adds a dedup pass to runSuiteC, gated by RECALL_BENCH_C_DEDUP=1 (default OFF). Between seedFixture() and the measured queries it runs the real dedup code path, mirroring recall dedup --execute exactly:

const plan = planDedup(db);                        // all tables, default 0.95 threshold, semantic on
applyDedupPlan(db, plan, { destructive: false });  // --execute → MARK (no --delete)

dedup_planned / dedup_marked counts are emitted as samples per corpus size, and the "Dedup was NOT run" caveat flips to a "Dedup WAS run" caveat when enabled.

The run is real, not an all-zeros artifact

  • Silent-zero canary (fix(benchmarks): harden Suite C against silently recording an all-zero baseline #83) passed + getLastSearchErrors() clean (the harness throws on either).
  • Dedup exercised: marked 0 / 5 / 288 / 8,328 across 100 / 1k / 10k / 100k (8,621 total).
  • Inertness/reproducibility proof: re-running with dedup OFF reproduced the committed baseline byte-for-byte across all 76 relevance metrics, so every delta is attributable solely to dedup.

Headline result — MRR@5 (overall)

corpus baseline post-dedup Δ
100 0.7381 0.7381 +0.0000
1,000 0.2857 0.2857 +0.0000
10,000 0.1095 0.1131 +0.0036
100,000 0.0000 0.0143 +0.0143

Only 9 of 76 relevance metrics changed, all tiny, all at 10k/100k — from thinning incidental noise-vs-noise exact collisions, not from resolving the precision trap. The MRR ladder still collapses 0.74 → 0.29 → 0.11 → 0.01.

Why (the #49 case): the corpus trap is near-duplicates (target.text + " - " + suffix) and entity-collisions. Exact-match dedup keys on normalized full text so it never groups a target with its near-duplicate, and the semantic pass needs stored embeddings that neither Suite C nor recall dedup --execute produces. Exact + stored-embedding dedup is structurally unable to move precision-under-noise; lightweight entity keying (#49) or an embedding-backed semantic pass is required.

Committed artifacts

  • benchmarks/results/2026-06-18T08-53-25-suite-C.jsonl / .md — the post-dedup run
  • benchmarks/results/2026-06-18T08-53-25-suite-C.diff.md — full diff vs baseline
  • benchmarks/results/2026-06-18T08-53-25-suite-C.manifest.md — reproducible run manifest

Verification

bun run lint clean; bun test tests/benchmarks/ tests/lib/dedup.property.test.ts tests/commands/dedup.test.ts → 69 pass (incl. a new test asserting the dedup pass is opt-in and wired).

…ine (#78)

Add an opt-in dedup pass to runSuiteC (RECALL_BENCH_C_DEDUP=1) that mirrors
'recall dedup --execute' (exact + stored-embedding semantic, non-destructive
marking) over each seeded corpus between seeding and measurement. Default OFF,
so the no-dedup baseline path is unchanged — verified by reproducing the
committed 2026-06-11 baseline byte-for-byte across all 76 relevance metrics.

Commit the post-dedup results (jsonl + md), the diff vs baseline, and a
reproducible run manifest. Headline: exact-only dedup moves MRR negligibly
(+0/+0/+0.0036/+0.0143 across 100/1k/10k/100k); the precision-under-noise
collapse is intact because the corpus trap is near-duplicate/entity-collision
noise, not exact duplicates, and the semantic pass is skipped (no stored
embeddings). This is the empirical evidence gate #49 was waiting on.
@edheltzel edheltzel merged commit a80943b into main Jun 18, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

benchmarks: re-run Suite C post-dedup and diff against the 2026-06-11 baseline

1 participant