Skip to content

perf(segment): reuse first vector index file as merge base during compaction#440

Open
JalinWang wants to merge 38 commits into
alibaba:mainfrom
JalinWang:fix/issue-98-optimize-merge-reuse
Open

perf(segment): reuse first vector index file as merge base during compaction#440
JalinWang wants to merge 38 commits into
alibaba:mainfrom
JalinWang:fix/issue-98-optimize-merge-reuse

Conversation

@JalinWang
Copy link
Copy Markdown
Collaborator

@JalinWang JalinWang commented Jun 1, 2026

Closes #98.

Summary

ReduceVectorIndex always rebuilt merged vector indexes from scratch during compaction/optimize, even when no filtering was needed and the segments were layout-compatible. This PR adds a fast path that copies the first segment's vector index file as the merge base and only merges the tail segments into it, falling back to a full rebuild when reuse is unsafe.

Changes

Reuse path in segment compaction

src/db/index/segment/segment_helper.{h,cc}

  • New MergeWithOptionalReuse(output_path, field, source_indexers, filter, concurrency, *merged_indexer) encapsulates the copy-or-rebuild decision. Called by both the unquantized branch and both flat/quantize legs of the quantized branch in ReduceVectorIndex.
  • CanReuseFirstIndexer gates reuse on:
    • no filter (filter must be nullptr; not an empty RowIdFilter)
    • non-empty input
    • output index type ∈ { HNSW, HNSW_RABITQ, FLAT } — i.e. streaming-style indexes whose Merge appends to an in-memory graph. IVF/VAMANA rebuild from scratch in Merge (dump-then-reopen) and would silently drop the base file's docs, so they always take the full-rebuild path.
    • first indexer's index_type and quantize_type match the output field's
  • On reuse: FileHelper::CopyFileOpen(create_new=false)Merge(tail_indexers, …). On any copy failure, logs a warning and falls back to the rebuild path.
  • ExecuteCompactTask now leaves row_id_filter null when the delete bitmap is empty, so the no-deletes compaction can actually take the reuse path.

Append-not-overwrite when merging into a non-empty target

src/core/mixed_reducer/mixed_streamer_reducer.cc

  • reduce() initializes next_id from target_streamer_->create_(sparse_)provider()->count() when merging into a non-empty target (the reuse case). Previously it started at 0, which would stamp the new docs over the base data.
  • Guarded against null provider so IVF and other not-yet-loaded streamers on the rebuild path are unaffected (also fixes a SIGSEGV in that path). This should be improved later.

Public accessor

src/db/index/column/vector_column/vector_column_indexer.h

  • Added field_schema() getter so CanReuseFirstIndexer can compare the source indexer's schema (index/quantize type) against the output field.

Tests

tests/db/index/segment/segment_helper_test.cc

  • Extracted a SegmentCompactReuseTest fixture; rewrote the merge-reuse cases as parameterized regressions covering:
    • HNSW / HNSW_RABITQ / FLAT — reuse path
    • IVF / VAMANA — fallback path (reuse must NOT trigger)
    • single-segment compaction (best case for file-copy reuse)
    • multi-segment compaction with a filter (fallback path)
    • mixed-type merge: writing-segment FLAT indexers → compacted HNSW/IVF/… output

Gains

zvec_segment_indexer_reuse_benchmark.md

Dataset Total docs Init Index CompactTask (no reuse) CompactTask (reuse) Saved Reduction Speedup
int8-hnsw-1w 20,000 0.36 s 9.46 s 3.58 s 5.89 s 62.2% 2.65×
int8-hnsw-10w 110,000 11.31 s 65.53 s 4.45 s 61.08 s 93.2% 14.74×
int8-hnsw-100w 1,010,000 123.75 s 541.33 s 5.87 s 535.46 s 98.9% 92.2×

References

kgeg401 and others added 26 commits February 28, 2026 11:07
# Conflicts:
#	src/db/index/segment/segment_helper.cc
IVF streamers return null from create_provider() until the index is
loaded. On the non-reuse merge path the target streamer is empty, so
the eager next_id init crashed with SIGSEGV. Fall back to 0 when the
provider is null — that is the correct starting id for an empty target.
# Conflicts:
#	src/db/index/segment/segment_helper.cc
@JalinWang JalinWang changed the title for ut perf(segment): reuse first vector index file as merge base during compaction Jun 2, 2026
@JalinWang JalinWang requested review from feihongxu0824 and removed request for richyreachy and zhourrr June 2, 2026 03:40
Comment thread src/db/index/segment/segment_helper.cc Outdated
Comment thread src/db/index/segment/segment_helper.cc
Comment thread src/db/index/segment/segment_helper.h Outdated
}

protected:
VersionManager::Ptr CreateVersionManager(const CollectionSchema &schema) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有实测过比如hnsw 100w+1w这种case的optimize性能收益吗?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

最好记录好测试结果,并且可以在commit message里面体现吧

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以直接放到commit message里面,这个link是在个人仓库单独维护的?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK,看到commit message里面已经有了

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议去掉外部link,把test design+result放到commit message里面即可

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image 这里的CreateVectorIndexTask和CompactTask (no reuse)的耗时为啥相差这么大?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhance]: improve optimze()/merge()

3 participants