Skip to content

Maintenance of diskann-providers #899

@hildebrandmw

Description

@hildebrandmw

The diskann-providers crate has become a dumping ground for architectural cruft. Containing it in one place is useful, but it blurs what actually belongs here. Here's my opinionated take on how to tackle this.

I think four things need to happen before we can cleanly extract the inmem providers into their own crate (and similarly split out bf-tree), ordered from easy/mechanical to architecturally involved:

  • Cruft Removal: Consolidate and clean up.
  • Migrate unit/integration tests in diskann_async.rs to diskann.
  • Move reasonable portions of the PQ code to diskann-quantization.
  • Finalize our comprehensive saving and loading story.
  • Make a clean diskann-inmem crate and others.

Importantly, I don't think much outside the following justifies creating an auxiliary crate — at least until several rounds of cleansing and extraction have been performed. (I reserve the right to be wrong.)

  • MinMaxRepr: Maybe we should group this with the implementations in diskann and feature MinMax more prominently?
  • Saving/Loading overhaul. This is a great candidate for a new crate.

Cruft Removal

There's a bunch of utilities that have been either superseded or are straight up dead code (though not caught due to public exports). These should be audited and either deleted, or upgraded to their new counterparts. A few quick examples:

  • utils/normalizing_util.rs: Is this still used?
  • utils/kmeans.rs: KMeans has been replaced for PQ related purposes by the faster implementation in diskann-quantization. The last user of this flow is the disk-index partitioning. It's possible that the BLAS based implementation of the diskann-providers is faster than the approach taken in diskann-quantization for full-dimensional data. To that end, we should either:
    • Audit and replace the use of KMeans in disk partitioning to use diskann-quantization and get rid of the diskann-providers implementation.
    • Upgrade the implementation in diskann-quantization to be on-par for full-dimensional data.
    • Move the implementation to diskann-disk where it's actually used.
  • utils/math_util.rs: See the above kmeans use. Also, vector generation with norm can be replaced by the implementations in diskann-utils.
  • Rethink file-centric flow? Many of the implementations in diskann-providers/utils derive their staying power because they work in-place on raw files. Is this still needed and if so, is there a cleaner way of decoupling the implementation from files?
  • rayon_util.rs: The purpose of rayon_util is to make the rayon thread pool explicit and to allow inspection of the number of threads available.
    One side-effect is that traits are designed to either take an explicit thread pool or an integer for the number of threads.
    Because this is done with generics, it encourages monomorphization high in the stack.
    I vote we either always take thread pools to make this explicit to the callers, or use an enum for this to cut down on generics.
  • Synthetic/structured data: Audit the users and move synthetic data generation to more appropriate locations (diskann-tools already has its own synthetic data creation?).
    Structured data like the grid points already have a replacement in diskann.
    We can remove the implementations in diskann-providers.
  • model/scratch/pq_scratch.rs: I'm pretty sure the only ultimate users here are diskann-disk and that this data structure and its transitive dependencies (quantizer_preprocess.rs) can all be moved to diskann-disk.
  • model/graph/graph_data_model/adjacency_list.rs: This has almost been completely replaced with the struct in diskann/graph, but diskann-disk apparently uses the interesting conversion from &[u8]. This can either be moved to diskann-disk, or diskann-disk can be updated to use the implementation in diskann/graph.
  • model/graph/traits/graph_data_types.rs: This type should no longer be needed by non-diskann-disk code (and if it is, changing it to a raw T: VectorRepr should be straight-forward.
    We should remove this type from diskann-providers and update remaining uses within diskann-providers to just use a vector type, ID type, and associated data type independently as needed.
    The problem with using this large list of associated types to instantiate generic functions that only use a portion is that it can lead to redundant monomorphizations, and makes uses less clear on whether all associated data types are used or not.
  • Get rid of AlignedBoxWithSlice. This has literally been replaced with diskann-quantization::alloc::Poly as its implementation. Its one non-trivial method split_into_nonoverlapping_mut_slices can simply be replaced with rayon's ParChunksMut.

Test Migration

  • Migrate tests. Currently, diskann_async.rs is pulling double-duty as tests for the inmem index and tests for diskann/graph/index.rs.
    The test infrastructure in diskann has gotten much better for running such unit tests.
    All unit tests should be migrated to diskann and use appropriate baselines.

This might not necessarily be straight forward to decouple the graph invariant tests from the inmem specific code, but using the test provider in diskann will provide much better checks on the indexing code.

If we want inmem tests still, we should look carefully at using a trait-object to wrap the code under test to avoid the insane number of test monomorphizations that are currently present in diskann_async.rs

I do question the usefulness of these tests when it comes to performing long-term guarantees. For really protecting the integrity of the inmem providers in an end-to-end context, I think we'll want the upcoming native A/B test infrastructure in diskann-benchmark and use checked-in baselines as a basis for regression testing.

PQ

Much of the PQ code in diskann-providers predates the diskann-quantization crate. Some items are candidates for moving to diskann-quantization or elsewhere.

What to do with FixedChunkPQTable

I really do not like this data structure. The way it stores the PQ pivot data is almost the worst way one can store pivots data for any kind of efficient processing.
Ideally, we'd use something more like the TransposedTable in diskann-quantization for distance table population (though this comes at the potential cost of slower quant-quant distances).
However, there are a few issues that keep us from being able to just do this:

  1. For historical reasons, a centroids vector has been used to center PQ data.
    Vanilla PQ based on L2 distances between centers, an affine shift is not really useful.
    However, we need to support this shift for backwards compatibility with some implementations.
    I'd be comfortable moving application of this shift elsewhere in the pre-processing stack, but I do not think diskann-quantization should be obligated to support this.
  2. The state of OPQ is pretty nebulous at this point. As far as I know, no one is using it and I'd be surprised if it actually worked correctly. We should make a decision on whether we actually want to support this or if we should simply ditch it. The latter would greatly simplify a lot of code.
  3. The FixedChunkPQTable supports quant-quant distances. The implementation is efficient-ish, but could be much better. We may consider building a distance table specifically to accelerate quant-quant distances, but this would be large (at least 64kB per chunk).

Candidates for moving to diskann-quantization

  • Distance computers (table lookup + bulk lookup): Even this is not necessarily straight-forward because of memory pooling.

Candidates for moving out

  • Multi-PQ: This can likely be moved internally to where it is actually needed.

Open Questions

The code in pq_construction.rs is too opinionated on its file operations to justify moving to a lower level crate.

Save/Load Overhaul

Most of the code in storage/* is related to the various ad-hoc storage mechanisms used by various index implementations.
Suhas has been working on a proposal for a structured and unified way of saving and loading.
Hopefully, we can use this to dump the problematic SaveWith/LoadWith traits and get rid of the various ad-hoc methods strewn about.
I do not think it's worth trying to clean this up until the replacement is ready.

Extracting and Fortifying Inmem

There are currently several issues with the in-memory providers, not least of which is being buried 8 layers deep in a directory.

  • Deep-rooted unsafety in AlignedMemoryVectorStore (this is a hard problem to tackle efficiently).
  • Tricky generic bounds on some quantizers (this is largely diskann-quantization's fault, I'm sorry).
  • Sketchy APIs around start point creation/constructors.

These do not need to necessarily be fixed to create a diskann-inmem, but there are some considerations that must be made first.
Because of the reliance of the inmem providers on the PQ code, we either need to fix the PQ story before isolating, or make a new diskann-inmem that relies on diskann-providers until the other issues can be sorted out.
Similarly, the save/load story keeps these implementations largely coupled with the storage related code in diskann-providers.
Thus, without these underlying issues being addressed, a crate split wouldn't really make a semantic boundary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions