benchmarks: Suite C post-dedup re-run + diff vs 2026-06-11 baseline (#78)#88
Merged
Merged
Conversation
…ine (#78) Add an opt-in dedup pass to runSuiteC (RECALL_BENCH_C_DEDUP=1) that mirrors 'recall dedup --execute' (exact + stored-embedding semantic, non-destructive marking) over each seeded corpus between seeding and measurement. Default OFF, so the no-dedup baseline path is unchanged — verified by reproducing the committed 2026-06-11 baseline byte-for-byte across all 76 relevance metrics. Commit the post-dedup results (jsonl + md), the diff vs baseline, and a reproducible run manifest. Headline: exact-only dedup moves MRR negligibly (+0/+0/+0.0036/+0.0143 across 100/1k/10k/100k); the precision-under-noise collapse is intact because the corpus trap is near-duplicate/entity-collision noise, not exact duplicates, and the semantic pass is skipped (no stored embeddings). This is the empirical evidence gate #49 was waiting on.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #78 — the Hardening & Dedup Proof milestone capstone.
What this does
Runs
recall dedup --executeover the seeded Suite C corpora, re-runs Suite C, and diffs against the committedbenchmarks/results/2026-06-11T09-36-53-suite-C.jsonlbaseline. The diff is dedup's efficacy report — the empirical evidence the parked entity-keying gate (#49) is waiting on.Deps #63 (cross-run dedup safety, PR #79) and #70 (silent-zero hardening, PR #83) are already on
main.Mechanism (surgical, opt-in)
Adds a dedup pass to
runSuiteC, gated byRECALL_BENCH_C_DEDUP=1(default OFF). BetweenseedFixture()and the measured queries it runs the real dedup code path, mirroringrecall dedup --executeexactly:dedup_planned/dedup_markedcounts are emitted as samples per corpus size, and the "Dedup was NOT run" caveat flips to a "Dedup WAS run" caveat when enabled.The run is real, not an all-zeros artifact
getLastSearchErrors()clean (the harness throws on either).Headline result — MRR@5 (overall)
Only 9 of 76 relevance metrics changed, all tiny, all at 10k/100k — from thinning incidental noise-vs-noise exact collisions, not from resolving the precision trap. The MRR ladder still collapses 0.74 → 0.29 → 0.11 → 0.01.
Why (the #49 case): the corpus trap is near-duplicates (
target.text + " - " + suffix) and entity-collisions. Exact-match dedup keys on normalized full text so it never groups a target with its near-duplicate, and the semantic pass needs stored embeddings that neither Suite C norrecall dedup --executeproduces. Exact + stored-embedding dedup is structurally unable to move precision-under-noise; lightweight entity keying (#49) or an embedding-backed semantic pass is required.Committed artifacts
benchmarks/results/2026-06-18T08-53-25-suite-C.jsonl/.md— the post-dedup runbenchmarks/results/2026-06-18T08-53-25-suite-C.diff.md— full diff vs baselinebenchmarks/results/2026-06-18T08-53-25-suite-C.manifest.md— reproducible run manifestVerification
bun run lintclean;bun test tests/benchmarks/ tests/lib/dedup.property.test.ts tests/commands/dedup.test.ts→ 69 pass (incl. a new test asserting the dedup pass is opt-in and wired).