Fix buffered MinHashLSH query aggregation across storage backends#307
Fix buffered MinHashLSH query aggregation across storage backends#307dipeshbabu wants to merge 4 commits intoekzhu:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the collect_query_buffer method in datasketch/lsh.py to correctly process buffered queries by unioning candidates across bands for each query before intersecting across the buffer. It also adds a test case to verify that buffered results match direct query results. Feedback identifies a potential bug where the use of zip could truncate results when using the Cassandra storage backend and suggests a more efficient implementation for the set().union() call.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #307 +/- ##
=========================================
Coverage ? 77.52%
=========================================
Files ? 15
Lines ? 2060
Branches ? 0
=========================================
Hits ? 1597
Misses ? 463
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
Fix
MinHashLSH.collect_query_buffer()so buffered queries aggregate candidates the same way as repeated calls toquery(), including whenusing the Cassandra storage backend.
Problem
The buffered query path was intersecting per-band result sets directly. That is stricter than normal LSH query behavior, which unions
candidates across bands for a query and only then intersects across multiple buffered queries.
This caused valid candidates to be dropped when using buffered queries.
The Cassandra backend also exposed a related issue in buffered selects: repeated buffered lookups with the same hash key could be collapsed
instead of preserving one result list per buffered query. That breaks per-query aggregation logic.
Fix
prepicklebehaviorTest
collect_query_buffer()returns the same candidates asquery()for a case where the old implementationdropped a valid match
Verification
[0, 1]uvx ruff check .uv run pytestin Linux/WSL:158 passed, 76 skippedlychee