benchmarks: Suite C post-dedup re-run + diff vs 2026-06-11 baseline (#78) by edheltzel · Pull Request #88 · edheltzel/Recall

edheltzel · 2026-06-18T08:56:40Z

Closes #78 — the Hardening & Dedup Proof milestone capstone.

What this does

Runs recall dedup --execute over the seeded Suite C corpora, re-runs Suite C, and diffs against the committed benchmarks/results/2026-06-11T09-36-53-suite-C.jsonl baseline. The diff is dedup's efficacy report — the empirical evidence the parked entity-keying gate (#49) is waiting on.

Deps #63 (cross-run dedup safety, PR #79) and #70 (silent-zero hardening, PR #83) are already on main.

Mechanism (surgical, opt-in)

Adds a dedup pass to runSuiteC, gated by RECALL_BENCH_C_DEDUP=1 (default OFF). Between seedFixture() and the measured queries it runs the real dedup code path, mirroring recall dedup --execute exactly:

const plan = planDedup(db);                        // all tables, default 0.95 threshold, semantic on
applyDedupPlan(db, plan, { destructive: false });  // --execute → MARK (no --delete)

dedup_planned / dedup_marked counts are emitted as samples per corpus size, and the "Dedup was NOT run" caveat flips to a "Dedup WAS run" caveat when enabled.

The run is real, not an all-zeros artifact

Silent-zero canary (fix(benchmarks): harden Suite C against silently recording an all-zero baseline #83) passed + getLastSearchErrors() clean (the harness throws on either).
Dedup exercised: marked 0 / 5 / 288 / 8,328 across 100 / 1k / 10k / 100k (8,621 total).
Inertness/reproducibility proof: re-running with dedup OFF reproduced the committed baseline byte-for-byte across all 76 relevance metrics, so every delta is attributable solely to dedup.

Headline result — MRR@5 (overall)

corpus	baseline	post-dedup	Δ
100	0.7381	0.7381	+0.0000
1,000	0.2857	0.2857	+0.0000
10,000	0.1095	0.1131	+0.0036
100,000	0.0000	0.0143	+0.0143

Only 9 of 76 relevance metrics changed, all tiny, all at 10k/100k — from thinning incidental noise-vs-noise exact collisions, not from resolving the precision trap. The MRR ladder still collapses 0.74 → 0.29 → 0.11 → 0.01.

Why (the #49 case): the corpus trap is near-duplicates (target.text + " - " + suffix) and entity-collisions. Exact-match dedup keys on normalized full text so it never groups a target with its near-duplicate, and the semantic pass needs stored embeddings that neither Suite C nor recall dedup --execute produces. Exact + stored-embedding dedup is structurally unable to move precision-under-noise; lightweight entity keying (#49) or an embedding-backed semantic pass is required.

Committed artifacts

benchmarks/results/2026-06-18T08-53-25-suite-C.jsonl / .md — the post-dedup run
benchmarks/results/2026-06-18T08-53-25-suite-C.diff.md — full diff vs baseline
benchmarks/results/2026-06-18T08-53-25-suite-C.manifest.md — reproducible run manifest

Verification

bun run lint clean; bun test tests/benchmarks/ tests/lib/dedup.property.test.ts tests/commands/dedup.test.ts → 69 pass (incl. a new test asserting the dedup pass is opt-in and wired).

…ine (#78) Add an opt-in dedup pass to runSuiteC (RECALL_BENCH_C_DEDUP=1) that mirrors 'recall dedup --execute' (exact + stored-embedding semantic, non-destructive marking) over each seeded corpus between seeding and measurement. Default OFF, so the no-dedup baseline path is unchanged — verified by reproducing the committed 2026-06-11 baseline byte-for-byte across all 76 relevance metrics. Commit the post-dedup results (jsonl + md), the diff vs baseline, and a reproducible run manifest. Headline: exact-only dedup moves MRR negligibly (+0/+0/+0.0036/+0.0143 across 100/1k/10k/100k); the precision-under-noise collapse is intact because the corpus trap is near-duplicate/entity-collision noise, not exact duplicates, and the semantic pass is skipped (no stored embeddings). This is the empirical evidence gate #49 was waiting on.

edheltzel added 2 commits June 18, 2026 04:56

Merge branch 'main' into worktree-benchmarks+suite-c-post-dedup-78

4400c83

edheltzel merged commit a80943b into main Jun 18, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

benchmarks: Suite C post-dedup re-run + diff vs 2026-06-11 baseline (#78)#88

benchmarks: Suite C post-dedup re-run + diff vs 2026-06-11 baseline (#78)#88
edheltzel merged 2 commits into
mainfrom
worktree-benchmarks+suite-c-post-dedup-78

edheltzel commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

edheltzel commented Jun 18, 2026

What this does

Mechanism (surgical, opt-in)

The run is real, not an all-zeros artifact

Headline result — MRR@5 (overall)

Committed artifacts

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant