Maintenance of `diskann-providers`

The `diskann-providers` crate has become a dumping ground for architectural cruft. Containing it in one place is useful, but it blurs what actually belongs here. Here's my opinionated take on how to tackle this.

I think four things need to happen before we can cleanly extract the inmem providers into their own crate (and similarly split out bf-tree), ordered from easy/mechanical to architecturally involved:

- [ ] Cruft Removal: Consolidate and clean up.
- [ ] Migrate unit/integration tests in `diskann_async.rs` to `diskann`.
- [ ] Move reasonable portions of the PQ code to `diskann-quantization`.
- [ ] Finalize our comprehensive saving and loading story.
- [ ] Make a clean `diskann-inmem` crate and others.

Importantly, I don't think much outside the following justifies creating an auxiliary crate — at least until several rounds of cleansing and extraction have been performed. (I reserve the right to be wrong.)

- [ ] `MinMaxRepr`: Maybe we should group this with the implementations in `diskann` and feature `MinMax` more prominently?
- [ ] Saving/Loading overhaul. This is a great candidate for a new crate.

### Cruft Removal
There's a bunch of utilities that have been either superseded or are straight up dead code (though not caught due to public exports). These should be audited and either deleted, or upgraded to their new counterparts. A few quick examples:
- [ ] `utils/normalizing_util.rs`: Is this still used?
- [ ] `utils/kmeans.rs`: KMeans has been replaced for PQ related purposes by the faster implementation in `diskann-quantization`. The last user of this flow is the disk-index partitioning. It's possible that the BLAS based implementation of the `diskann-providers` is faster than the approach taken in `diskann-quantization` for full-dimensional data. To that end, we should either:
  - Audit and replace the use of KMeans in disk partitioning to use `diskann-quantization` and get rid of the `diskann-providers` implementation.
  - Upgrade the implementation in `diskann-quantization` to be on-par for full-dimensional data.
  - Move the implementation to `diskann-disk` where it's actually used.
- [ ] `utils/math_util.rs`: See the above `kmeans` use. Also, vector generation with norm can be replaced by the implementations in `diskann-utils`.
- [ ] Rethink file-centric flow? Many of the implementations in `diskann-providers/utils` derive their staying power because they work in-place on raw files. Is this still needed and if so, is there a cleaner way of decoupling the implementation from files?
- [ ] `rayon_util.rs`: The purpose of `rayon_util` is to make the rayon thread pool explicit and to allow inspection of the number of threads available.
      One side-effect is that traits are designed to either take an explicit thread pool or an integer for the number of threads.
      Because this is done with generics, it encourages monomorphization high in the stack.
      I vote we either always take thread pools to make this explicit to the callers, or use an enum for this to cut down on generics.
- [ ] Synthetic/structured data: Audit the users and move synthetic data generation to more appropriate locations (`diskann-tools` already has its own synthetic data creation?).
      Structured data like the grid points already have a replacement in `diskann`.
      We can remove the implementations in `diskann-providers`.
- [ ] `model/scratch/pq_scratch.rs`: I'm pretty sure the only ultimate users here are `diskann-disk` and that this data structure and its transitive dependencies (`quantizer_preprocess.rs`) can all be moved to `diskann-disk`.
- [ ] `model/graph/graph_data_model/adjacency_list.rs`: This has almost been completely replaced with the struct in `diskann/graph`, but `diskann-disk` apparently uses the interesting conversion from `&[u8]`. This can either be moved to `diskann-disk`, or `diskann-disk` can be updated to use the implementation in `diskann/graph`.
- [ ] `model/graph/traits/graph_data_types.rs`: This type *should* no longer be needed by non-`diskann-disk` code (and if it is, changing it to a raw `T: VectorRepr` should be straight-forward.
      We should remove this type from `diskann-providers` and update remaining uses within `diskann-providers` to just use a vector type, ID type, and associated data type independently as needed.
      The problem with using this large list of associated types to instantiate generic functions that only use a portion is that it can lead to redundant monomorphizations, and makes uses less clear on whether all associated data types are used or not.
- [ ] Get rid of `AlignedBoxWithSlice`. This has literally been replaced with `diskann-quantization::alloc::Poly` as its implementation. Its one non-trivial method `split_into_nonoverlapping_mut_slices` can simply be replaced with rayon's `ParChunksMut`.

### Test Migration

- [ ] Migrate tests. Currently, `diskann_async.rs` is pulling double-duty as tests for the inmem index and tests for `diskann/graph/index.rs`.
  The test infrastructure in `diskann` has gotten much better for running such unit tests.
  All unit tests should be migrated to `diskann` and use appropriate baselines.

This might not necessarily be straight forward to decouple the graph invariant tests from the inmem specific code, but using the test provider in `diskann` will provide much better checks on the indexing code.

If we want inmem tests still, we should look carefully at using a trait-object to wrap the code under test to avoid the insane number of test monomorphizations that are currently present in `diskann_async.rs`

I do question the usefulness of these tests when it comes to performing long-term guarantees. For really protecting the integrity of the inmem providers in an end-to-end context, I think we'll want the upcoming native A/B test infrastructure in `diskann-benchmark` and use checked-in baselines as a basis for regression testing.

### PQ

Much of the PQ code in `diskann-providers` predates the `diskann-quantization` crate. Some items are candidates for moving to `diskann-quantization` or elsewhere.

#### What to do with `FixedChunkPQTable`

I really do not like this data structure. The way it stores the PQ pivot data is almost the worst way one can store pivots data for any kind of efficient processing.
Ideally, we'd use something more like the `TransposedTable` in `diskann-quantization` for distance table population (though this comes at the potential cost of slower quant-quant distances).
However, there are a few issues that keep us from being able to just do this:

1. For historical reasons, a `centroids` vector has been used to center PQ data.
   Vanilla PQ based on L2 distances between centers, an affine shift is not really useful.
   However, we need to support this shift for backwards compatibility with some implementations.
   I'd be comfortable moving application of this shift elsewhere in the pre-processing stack, but I do **not** think `diskann-quantization` should be obligated to support this.
2. The state of OPQ is pretty nebulous at this point. As far as I know, no one is using it and I'd be surprised if it actually worked correctly. We should make a decision on whether we actually want to support this or if we should simply ditch it. The latter would greatly simplify a lot of code.
3. The `FixedChunkPQTable` supports quant-quant distances. The implementation is efficient-ish, but could be much better. We may consider building a distance table specifically to accelerate quant-quant distances, but this would be large (at least 64kB per chunk).

#### Candidates for moving to `diskann-quantization`

- [ ] Distance computers (table lookup + bulk lookup): Even this is not necessarily straight-forward because of memory pooling.

#### Candidates for moving out

- [ ] Multi-PQ: This can likely be moved internally to where it is actually needed.

#### Open Questions

The code in `pq_construction.rs` is too opinionated on its file operations to justify moving to a lower level crate.

### Save/Load Overhaul

Most of the code in `storage/*` is related to the various ad-hoc storage mechanisms used by various index implementations.
Suhas [has been working on](https://github.com/microsoft/DiskANN/pull/807) a proposal for a structured and unified way of saving and loading.
Hopefully, we can use this to dump the problematic `SaveWith`/`LoadWith` traits and get rid of the various ad-hoc methods strewn about.
I do not think it's worth trying to clean this up until the replacement is ready.

### Extracting and Fortifying Inmem

There are currently several issues with the in-memory providers, not least of which is being buried 8 layers deep in a directory.

* Deep-rooted [unsafety](https://github.com/microsoft/DiskANN/blob/55e157a47b4f72fb56e085a99ea80342072b48c7/diskann-providers/src/model/graph/provider/async_/common.rs#L174-L181) in `AlignedMemoryVectorStore` (this is a hard problem to tackle efficiently).
* Tricky generic bounds on some quantizers (this is largely `diskann-quantization`'s fault, I'm sorry).
* Sketchy APIs around start point creation/constructors.

These do not need to necessarily be fixed to create a `diskann-inmem`, but there are some considerations that must be made first.
Because of the reliance of the inmem providers on the PQ code, we either need to fix the PQ story before isolating, or make a new `diskann-inmem` that relies on `diskann-providers` until the other issues can be sorted out.
Similarly, the save/load story keeps these implementations largely coupled with the storage related code in `diskann-providers`.
Thus, without these underlying issues being addressed, a crate split wouldn't really make a semantic boundary.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maintenance of `diskann-providers` #899

Cruft Removal

Test Migration

PQ

What to do with `FixedChunkPQTable`

Candidates for moving to `diskann-quantization`

Candidates for moving out

Open Questions

Save/Load Overhaul

Extracting and Fortifying Inmem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Maintenance of diskann-providers #899

Description

Cruft Removal

Test Migration

PQ

What to do with FixedChunkPQTable

Candidates for moving to diskann-quantization

Candidates for moving out

Open Questions

Save/Load Overhaul

Extracting and Fortifying Inmem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Maintenance of `diskann-providers` #899

What to do with `FixedChunkPQTable`

Candidates for moving to `diskann-quantization`