Skip to content

fix: optimize segment compaction by reusing first vector index#114

Closed
kgeg401 wants to merge 3 commits into
alibaba:mainfrom
kgeg401:fix/issue-98-optimize-merge-reuse
Closed

fix: optimize segment compaction by reusing first vector index#114
kgeg401 wants to merge 3 commits into
alibaba:mainfrom
kgeg401:fix/issue-98-optimize-merge-reuse

Conversation

@kgeg401
Copy link
Copy Markdown
Contributor

@kgeg401 kgeg401 commented Feb 14, 2026

Summary

  • add a no-filter compaction fast path that reuses the first segment's vector index as merge base
  • keep full rebuild behavior as fallback for filtered compaction and unsupported index layouts
  • cover both vector-index and filtered fallback regressions with segment helper tests

Root Cause

ReduceVectorIndex always rebuilt merged indexes from scratch, even when compaction had no filter and segments were layout-compatible. That caused avoidable index rebuild work during merge/compact.

Validation

  • Added CompactTask_VectorIndexThreeSegmentsRegression to verify no-filter multi-segment compaction correctness
  • Added CompactTask_FilterMultiSegmentsRegression to verify filtered path still uses fallback semantics
  • Local compile/test execution is not available in this environment (no C/C++ compiler toolchain), so CI is the authoritative validation

Fixes #98

@feihongxu0824
Copy link
Copy Markdown
Collaborator

Thank you for your contribution. The zvec maintainer team is currently on holiday for the Chinese New Year (Spring Festival) in mainland China. Please expect delayed responses, and we appreciate your patience.

@kgeg401 kgeg401 force-pushed the fix/issue-98-optimize-merge-reuse branch from 0cb23df to 98d5708 Compare February 28, 2026 16:06
@kgeg401 kgeg401 force-pushed the fix/issue-98-optimize-merge-reuse branch from 98d5708 to 17581fd Compare February 28, 2026 16:07
@kgeg401
Copy link
Copy Markdown
Contributor Author

kgeg401 commented Mar 1, 2026

Added a quantized-path regression test for segment compaction:\n\n- New CompactTask_QuantizedVectorIndexThreeSegmentsRegression in segment_helper_test.cc.\n- Covers multi-segment, no-filter compaction with quantized vector indexes and verifies quantized index availability after compaction.\n\nThis is intended to directly exercise the get_quant_vector_indexer(...) merge-reuse path.

@kgeg401
Copy link
Copy Markdown
Contributor Author

kgeg401 commented Mar 8, 2026

Added the quantized-path regression test for segment compaction, and the PR is otherwise unchanged. This should be ready for maintainer review when bandwidth allows.

@JalinWang JalinWang self-requested a review March 12, 2026 09:24
@JalinWang
Copy link
Copy Markdown
Collaborator

Thanks a lot for the contribution, and sorry for the long delay on this review — we had some internal bandwidth constraints. We've made PR/issue handling to a regular cadence :)

On the change itself: I feel this optimization is better placed in the core reducer layer(mixed_streamer_reducer) rather than the DB layer. If it stays only in DB, users of core directly can’t benefit from it. What do you think?

Also, nice improvement on deduplicating repeated merge logic. One more cleanup suggestion: merge_options construction could likely be moved out of the lambda and built once outside the loop.

@JalinWang
Copy link
Copy Markdown
Collaborator

Following up on my earlier comment — after digging deeper into the implementation and thinking about where this is heading, I'd like to walk back the suggestion to push this into mixed_streamer_reducer.

Reversing the layering call

Moving "pick a base indexer / copy the file / fall back to full rebuild" breaks the abstraction. Also, future compaction/optimize improvements might bring more db-level scheduling (e.g., segments reordering). So we'd rather keep this layer in SegmentHelper.

A few issues in the current implementation

  1. UT doesn't compile.

  2. row_id_filter is never null, so the reuse path is dead code. ExecuteCompactTask unconditionally constructs make_shared<RowIdFilter>(...) regardless of whether delete_row_id_bitmap is empty, so the first check in can_reuse_first_indexer (filter != nullptr) always short-circuits.

  3. No handling for mixed index types (output HNSW, input FLAT). A writing segment's vector indexer is FLAT, but compact output uses the user-configured type (HNSW / HNSW_RABITQ / IVF / …). The PR breaks the on-disk format of indexes, leading to crashes.

  4. Rules in can_reuse_first_indexer are too strict.

    • input_segments.size() <= 1 → false: a single input segment is actually the best case for reuse (just copy), shouldn't be excluded.
    • fetch_indexers(input_segment).size() != 1 → false: writing segments are composed of multiple blocks, each with its own indexer — "one segment, many indexers" is the common case, and this rule rejects almost all realistic compactions. The check should be on whether the first indexer of the first segment is reusable, not on every segment having exactly one.

Next steps

We've taken your PR as a starting point, fixed the issues above, and refactored a bit. That work is up at #440. We'll close this PR once that's merged — thanks again for the initial version, it gave us a base to build on :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhance]: improve optimze()/merge()

3 participants