Skip to content

Move AsyncMinHashLSH from experimental to datasketch.aio#314

Open
ekzhu wants to merge 14 commits intomasterfrom
claude/create-pr-update-description-1Uyvr
Open

Move AsyncMinHashLSH from experimental to datasketch.aio#314
ekzhu wants to merge 14 commits intomasterfrom
claude/create-pr-update-description-1Uyvr

Conversation

@ekzhu
Copy link
Copy Markdown
Owner

@ekzhu ekzhu commented Apr 14, 2026

Summary

Promotes AsyncMinHashLSH out of the datasketch.experimental namespace into a first-class datasketch.aio submodule, deprecates the old import paths, and runs the storage-backend integration tests directly on PRs. This is a re-submission of #301 from a new branch, with the description rewritten to accurately reflect the changes.

What does this PR do?

New datasketch.aio submodule

  • Adds datasketch/aio/__init__.py, datasketch/aio/lsh.py, and moves datasketch/experimental/aio/storage.py to datasketch/aio/storage.py.
  • Exposes AsyncMinHashLSH, AsyncMinHashLSHInsertionSession, and AsyncMinHashLSHDeleteSession from datasketch.aio.
  • Re-exports AsyncMinHashLSH from the top-level datasketch package so from datasketch import AsyncMinHashLSH works. The import itself is always safe; instantiation still requires motor / redis.asyncio to be installed.

Deprecation of datasketch.experimental

  • datasketch.experimental, datasketch.experimental.aio, and datasketch.experimental.aio.lsh now re-export from datasketch.aio and emit a DeprecationWarning on attribute access.
  • Warnings are wired up via PEP 562 __getattr__ and cached back into globals() so the warning fires exactly once per process and intermediate package imports don't trigger redundant warnings.
  • TYPE_CHECKING re-exports keep static analyzers happy without importing the real symbols eagerly.

Packaging

  • Adds a new aio optional-dependency group in pyproject.toml (aiounittest, motor>3.6.0).
  • Keeps experimental_aio as a duplicated alias for backwards compatibility (PEP 621 does not support referencing another optional-dependency group).
  • Drops */experimental/* from the coverage omit list now that the module is no longer experimental-only.

Docs

  • docs/documentation.rst and docs/lsh.rst updated to point at datasketch.aio.AsyncMinHashLSH.
  • Removes the "currently experimental, interface may change" note from the async LSH docs.

CI

  • test-cassandra.yml, test-mongo.yml, and test-redis.yml now trigger on push / pull_request against master directly instead of cascading off a completed Test workflow run, so integration tests actually gate PRs.

Tests

  • test/aio/test_lsh.py updated to import from the new location and gains a test for the get_subset_counts async method.

Backwards compatibility

Old imports still work:

from datasketch.experimental import AsyncMinHashLSH            # DeprecationWarning
from datasketch.experimental.aio import AsyncMinHashLSH        # DeprecationWarning
from datasketch.experimental.aio.lsh import AsyncMinHashLSH    # DeprecationWarning

New preferred imports:

from datasketch.aio import AsyncMinHashLSH
from datasketch import AsyncMinHashLSH

Checklist

  • Are unit tests passing?
  • Documentation added/updated for all public APIs?
  • Is this a breaking change? No — old imports keep working behind a DeprecationWarning.

https://claude.ai/code

ekzhu and others added 9 commits January 19, 2026 23:08
- Create new `datasketch/aio/` submodule with AsyncMinHashLSH,
  AsyncMinHashLSHInsertionSession, and AsyncMinHashLSHDeleteSession
- Add optional AsyncMinHashLSH export to main datasketch package
- Deprecate experimental module with warnings pointing to new location
- Update tests to use new import path
- Rename `experimental_aio` dependency group to `aio` (keep alias)
- Remove experimental from coverage omit

New import paths:
- `from datasketch.aio import AsyncMinHashLSH`
- `from datasketch import AsyncMinHashLSH`

Old imports still work but emit DeprecationWarning.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Sort __all__ lists alphabetically (RUF022)
- Add noqa: E402 for intentional imports after deprecation warnings
- Add blank line after docstring Example section (D413)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Bug fixes:
- Add missing await on self.execute() in AsyncRedisBuffer
- Fix getmany() to correctly use Redis pipeline (don't await individual commands)
- Add getmany() method to AsyncMongoListStorage for get_subset_counts support
- Fix typo: "AsyncMinHash" -> "AsyncMinHashLSH" in error message
- Remove dead code: await self.keys (storage objects aren't awaitable)

Performance:
- Change tuple buffers to lists in AsyncMongoBuffer for O(1) append

Documentation:
- Fix module docstring example to use prepickle=True for string keys
- Simplify insertion_session and delete_session docstring examples

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This test exercises the getmany() method that was fixed in a previous
commit, ensuring proper coverage for the async implementation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change workflow triggers from workflow_run (which runs after main Test
completes and doesn't show as PR checks) to push/pull_request triggers.
This makes integration test results visible directly in PR status checks.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change workflow trigger from workflow_run to push/pull_request triggers
for consistency with MongoDB and Redis test workflows.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Initial plan

* Update documentation and fix RedisStorage import issue

Co-authored-by: ekzhu <320302+ekzhu@users.noreply.github.com>

* Add clarifying comments for optional Redis dependency handling

Co-authored-by: ekzhu <320302+ekzhu@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: ekzhu <320302+ekzhu@users.noreply.github.com>
…ecations (#313)

- Delete orphaned datasketch/experimental/aio/storage.py. Nothing imports
  from it (the shims all forward to datasketch.aio.lsh), and the stale
  file still contained the pre-fix bugs that datasketch/aio/storage.py
  silently fixed: un-awaited AsyncRedisBuffer.execute(), tuple-based
  buffer stacks, missing AsyncMongoListStorage.getmany, and a broken
  pipeline getmany impl.

- Drop the dead try/except ImportError fallback around
  `from datasketch.aio import AsyncMinHashLSH` in datasketch/__init__.py.
  datasketch.aio.storage already degrades gracefully when motor/redis are
  absent, so the fallback was unreachable and only served to rebind the
  public symbol to None for type checkers.

- Fix AsyncMinHashLSH.get_subset_counts under prepickle=True: query keys
  now get pickled to match the stored representation, so the lookup
  actually finds the inserted keys. Adds a regression test.

- Refactor the experimental deprecation shims to emit exactly one
  DeprecationWarning per deprecated import path (down from three). Uses
  PEP 562 __getattr__ with globals() caching so the warning fires on
  first attribute access and subsequent accesses skip the hook. Static
  analyzers see the __all__ names via TYPE_CHECKING-guarded imports.

- Minor: add asyncio.run(main()) to the datasketch.aio docstring example
  so it's copy-paste runnable, and add a KEEP IN SYNC comment on the
  duplicated experimental_aio extras group in pyproject.toml.

Co-authored-by: Claude <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request transitions the asynchronous MinHash LSH implementation from an experimental status to a stable datasketch.aio module, including deprecation proxies for the old paths. The changes also introduce a getmany method for storage backends and update the project configuration to include an aio extra. Feedback focuses on resolving a potential type error regarding the storage basename, restoring explicit storage initialization to prevent runtime issues, and optimizing MongoDB's getmany implementation with asyncio.gather.

Comment thread datasketch/aio/lsh.py
Comment thread datasketch/aio/lsh.py
Comment thread datasketch/aio/storage.py Outdated
Addresses the gemini-code-assist review comment on PR #314: the previous
sequential loop issued N round trips in series. asyncio.gather runs the
underlying find() queries concurrently.

https://claude.ai/code
@ekzhu
Copy link
Copy Markdown
Owner Author

ekzhu commented Apr 14, 2026

@dipeshbabu could you help review this PR?

Comment thread datasketch/aio/storage.py
async def getmany(self, *keys):
pipe = self._redis.pipeline()
pipe.multi()
for key in keys:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now hardcodes lrange(), but AsyncRedisSetStorage inherits getmany() too. Before this, the call went through _get_items(), so the set-
backed subclass used smembers() correctly. As written, unordered Redis storage will issue LRANGE against set keys and fail with WRONGTYPE.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in feead1d.

AsyncRedisListStorage.getmany now routes through _get_items, matching the existing itemcounts / _get_len pattern so AsyncRedisSetStorage's smembers override is used when the method is called on set-backed storage:

async def getmany(self, *keys):
    pipe = self._redis.pipeline()
    for key in keys:
        await self._get_items(pipe, self.redis_key(key))
    return await pipe.execute()

I also added a regression test (test_unordered_storage_getmany in test/aio/test_lsh.py) that calls getmany on an unordered hashtable storage, which would have hit WRONGTYPE against Redis sets under the old implementation.

On the CI suggestion — updated the three integration-test workflows (test-mongo.yml, test-redis.yml, test-cassandra.yml) to install the new aio extra instead of the deprecated experimental_aio alias. Both still resolve to the same dependency set, but using aio aligns CI with the new public API surface as you suggested.


Generated by Claude Code

@dipeshbabu
Copy link
Copy Markdown
Contributor

There’s one blocking issue here in the Redis async storage path.

AsyncRedisListStorage.getmany() now calls lrange() directly. That works for the list-backed storage, but AsyncRedisSetStorage inherits the same method. Before this change, getmany() went through _get_items(), so the set-backed subclass used smembers() correctly. With the current version, unordered Redis storage will send LRANGE to set keys and fail with WRONGTYPE.

I think this should be fixed before merge, either by routing getmany() back through _get_items() or by overriding it in AsyncRedisSetStorage. Since datasketch.aio is now the public surface, it would be good to align CI/install coverage with the new aio extra too.

Hardcoding `LRANGE` in `AsyncRedisListStorage.getmany` broke the
inherited method on `AsyncRedisSetStorage`: the subclass overrides
`_get_items` to use `SMEMBERS`, but the pipeline in `getmany` bypassed
that override and would fail with `WRONGTYPE` against set keys.

Route `getmany` through `_get_items` so the subclass override applies,
following the existing `itemcounts`/`_get_len` pattern in the same file.

Add a regression test that calls `getmany` on an unordered (set-backed)
hashtable storage, which exercises the previously-broken path under
Redis.

Update the three integration-test workflows (mongo, redis, cassandra)
to install the new `aio` extra instead of the deprecated
`experimental_aio` alias, aligning CI install coverage with the new
public API surface.

https://claude.ai/code
Copy link
Copy Markdown
Contributor

@dipeshbabu dipeshbabu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  The Redis `getmany()` issue I flagged is fixed in `feead1d`; routing through `_get_items()` restores the                    
  `AsyncRedisSetStorage`/`SMEMBERS` behavior, and the added regression test covers that path. The integration workflows are also now  
  installing the new `aio` extra, and the latest CI run is green.                                                                     
                                                                                                                                      
  I found one remaining blocker before promoting this out of `experimental`: `prepickle=True` is not applied consistently in the async
  public key APIs. `has_key()` and `_remove()` still use the caller's raw key, while `_insert()` stores the pickled key. This breaks  
  `has_key()` and `remove()` for non-bytes keys with `prepickle=True`.                                                                
                                                                                                                                      
  Please fix that and add regression coverage in `test/aio/test_lsh.py`, ideally under `TestAsyncMinHashLSHWithPrepickle` starting at 
  line 583. The tests should cover `has_key()`, `remove()`, and ideally `delete_session.remove()` with a string key.

Comment thread datasketch/aio/lsh.py

async def has_key(self, key):
"""See :class:`datasketch.MinHashLSH`."""
return await self.keys.has_key(key)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still misses prepickle=True for the public key APIs. _insert() pickles the key before storing it, and query() unpickles
returned candidates, but has_key() receives the caller's raw key here. With prepickle=True, this returns False for a key that
was just inserted:

async with AsyncMinHashLSH(storage_config=storage_config, prepickle=True) as lsh:                                                   
    await lsh.insert("a", m)                                                                                                        
    assert await lsh.has_key("a")  # currently False                                                                                
                                                                                                                                    
_remove() has the same issue on lines 291-300: it checks and fetches/removes using the raw key, so await lsh.remove("a") raises     
ValueError after inserting "a" with prepickle=True.                                                                                 
                                                                                                                                    
Please normalize/pickle keys consistently for has_key() and _remove(). Be careful not to double-pickle the key in _insert(), since  
_insert() currently calls has_key() after it has already pickled the key.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified the bug and fixed in 85907e7.

You're right on both counts — has_key() was forwarding the raw user key to storage, and _remove() was using the raw key for the existence check, the keys.get lookup, and the deletes. Under prepickle=True storage holds pickled bytes, so:

await lsh.insert("a", m)
await lsh.has_key("a")   # returned False
await lsh.remove("a")    # raised ValueError

I mirrored the sync MinHashLSH._remove pattern (datasketch/lsh.py:520-522): pickle inside _remove first, then call storage primitives directly so the existence check, lookup, and deletes all see the stored form.

To handle the double-pickling concern in _insert: has_key() now pickles internally for the public API, but _insert was already pickling the key before its has_key call. Rather than depend on subtle ordering, I switched _insert's duplicate-check call to self.keys.has_key(key) directly, since key is already in its stored (pickled) form at that point.

Added three regression tests under TestAsyncMinHashLSHWithPrepickle covering exactly the surface you called out: has_key, remove, and delete_session.remove with string keys (all parameterized over both mongo and redis storage).


Generated by Claude Code

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ekzhu, looks good now; i didn't see any blockers.

claude added 3 commits April 16, 2026 21:49
`AsyncMinHashLSH.has_key()` forwarded the caller's raw key to storage,
and `_remove()` reused the raw key for the existence check, hashtable
lookup, and the delete itself. Under `prepickle=True` the storage holds
pickled bytes, so:

  await lsh.insert("a", m)
  await lsh.has_key("a")   # returned False
  await lsh.remove("a")    # raised ValueError

Mirror the sync `MinHashLSH._remove` pattern: pickle the key inside
`_remove` first and call the storage primitives (`self.keys.has_key`,
`self.keys.get`, `self.keys.remove`, `hashtable.remove_val`) directly so
they all see the stored representation.

`has_key()` now pickles internally for the public API. `_insert()` was
calling `has_key()` after it had already pickled the key, which
accidentally worked but would have started double-pickling once
`has_key` learned to pickle on its own; switch `_insert`'s duplicate
check to `self.keys.has_key()` directly to keep that path single-pickle.

Add regression tests under `TestAsyncMinHashLSHWithPrepickle` covering
`has_key`, `remove`, and `delete_session.remove` with string keys.

https://claude.ai/code
Add four tests covering behaviors that are part of the public async
contract but were only tested implicitly (or not at all):

- test_insert_duplicate_raises: re-inserting an existing bytes key
  must raise ValueError
- test_insert_check_duplication_false: check_duplication=False bypass
  must allow re-insertion without error
- test_is_empty: is_empty() must flip False after insert and back to
  True after removing all keys
- test_insert_duplicate_prepickle: duplicate detection must also work
  for pickled string keys and the bypass must still apply

Design gaps (no async `merge` or `add_to_query_buffer`) are tracked
separately - they are missing APIs rather than missing tests and are
out of scope for this PR.

https://claude.ai/code
Pipeline command methods in redis.asyncio are synchronous (they queue
and return self). The previous approach routed through async _get_items
which awaited the pipeline return value — while Pipeline.__await__
makes this technically valid, it adds unnecessary async machinery to
what should be a synchronous queue operation.

Revert AsyncRedisListStorage.getmany to the original direct
pipe.lrange() pattern, and add an explicit getmany override in
AsyncRedisSetStorage that uses pipe.smembers(). This is the
"override in subclass" approach from the reviewer's suggestion, and
keeps pipeline command queuing purely synchronous.

https://claude.ai/code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants