Implement true LanceDB hybrid retrieval by jioffe502 · Pull Request #2040 · NVIDIA/NeMo-Retriever

jioffe502 · 2026-05-14T21:21:50Z

Summary

Implements real LanceDB hybrid retrieval by passing aligned raw query_texts alongside precomputed vectors.
Keeps query_texts execution-only: it is stripped from persistent VDB constructor kwargs and forwarded only for hybrid=True retrieval calls.
Replaces the LanceDB hybrid=True retrieval NotImplementedError with LanceDB 0.30.2 hybrid query construction: table.search(query_type="hybrid", vector_column_name=..., fts_columns="text").vector(vector).text(query_text).

Behavioral Notes

No CLI surface changes.
Existing overwrite/append semantics are unchanged.
Dense retrieval stays VDB-agnostic; query_texts is not forwarded for dense retrieval.
Hybrid LanceDB retrieval now requires query_texts and validates that query/vector counts match.
where / _filter, top_k, refine_factor, n_probe / nprobes, result_fields, and search_kwargs behavior are preserved.
A conflicting non-hybrid search_kwargs["query_type"] now raises a clear ValueError.

Validation

cd /localhome/local-jioffe/nv-ingest-lancedb/nemo_retriever
/localhome/local-jioffe/.local/bin/uv run --extra dev pytest -q tests/test_retriever_queries.py tests/test_nv_ingest_vdb_operator.py tests/test_lancedb_retrieval_where.py
- 32 passed
/localhome/local-jioffe/.local/bin/uv run --extra dev pytest -q tests/test_root_cli_workflow.py tests/test_graph_pipeline_cli.py tests/test_lancedb_write_policy.py
- 21 passed, 1 warning
git diff --check
- clean

E2E Findings

JP20 LanceDB Hybrid

Page image extraction enabled: yes, default path; no --no-extract-page-as-image.
Pages processed: 1,940
Graph rows: 3,192
Persisted/uploadable rows: 3,185
Recall:
- recall@1: 0.6609
- recall@3: 0.8522
- recall@5: 0.9304
- recall@10: 0.9565
LanceDB indexes confirmed: vector IvfHnswSq plus FTS text_idx.

BO767 LanceDB Hybrid

Pages processed: 54,730
Graph rows: 80,436
Persisted LanceDB rows: 76,299
BEIR queries: 1,005
Total time: 1484.75s / 0:24:44.753
Throughput: 36.86 PPS
Recall:
- recall@1: 0.5811
- recall@3: 0.7950
- recall@5: 0.8488
- recall@10: 0.8985
NDCG:
- ndcg@1: 0.5811
- ndcg@3: 0.7076
- ndcg@5: 0.7297
- ndcg@10: 0.7460
LanceDB table confirmed with 76,299 rows and indexes:
- Index(IvfHnswSq, columns=["vector"], name="vector_idx")
- Index(FTS, columns=["text"], name="text_idx")

Dense vs Hybrid Observability

BEIR metric names are unchanged (recall@k, ndcg@k) and do not themselves indicate dense vs hybrid.
The run summary stdout includes VDB kwargs: {"hybrid": true, ...}.
The runtime summary JSON currently records vdb_op: "lancedb" and metrics but does not include vdb_kwargs or an explicit retrieval mode.
Recommended follow-up: persist vdb_kwargs or retrieval_mode: hybrid|dense into run.runtime.summary.json for easier auditability in future runs.

greptile-apps · 2026-05-14T21:32:02Z

Greptile Summary

Replaces the NotImplementedError stub for LanceDB hybrid retrieval with a working implementation using the LanceDB 0.30.2 hybrid query API (table.search(query_type="hybrid").vector(...).text(...)). The PR also introduces an execution-time query_texts parameter that flows from Retriever through RetrieveVdbOperator to LanceDB.retrieval, stripped from persistent constructor kwargs so dense retrieval is unaffected.

lancedb.py — hybrid path validates and aligns query_texts with vectors, sets query_type=\"hybrid\" and fts_columns in search_kwargs, then calls the LanceDB 0.30.2 builder chain.
operators.py — RetrieveVdbOperator strips query_texts at construction and re-injects it per-call only when effective_hybrid is true.
Six new test_lancedb_retrieval_where.py tests and four new test_nv_ingest_vdb_operator.py tests exercise the happy path, error paths, alignment check, where-filter, and query_type conflict guard.

Confidence Score: 5/5

Safe to merge; the hybrid retrieval path is well-guarded with input validation, and existing dense retrieval is untouched.

The implementation is focused and correct: query_texts is stripped at construction, re-injected only for hybrid calls, and validated for alignment before the search loop. All error paths are exercised by the new tests, E2E recall numbers look solid, and the dense path is unchanged. No data-loss or incorrect-result risk identified.

No files require special attention; the single suggestion in lancedb.py (move alignment check before table connection) is non-blocking.

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/vdb/lancedb.py	Core hybrid retrieval implementation: query_texts extraction, query_type/fts_columns injection, alignment check, and the LanceDB 0.30.2 builder chain. Alignment check fires after table connection — minor fail-fast gap.
nemo_retriever/src/nemo_retriever/vdb/operators.py	Strips query_texts from constructor kwargs and conditionally re-injects it at call time via the effective_hybrid guard; logic is correct and well-tested.
nemo_retriever/src/nemo_retriever/retriever.py	Adds alignment-safety comment and explicitly forwards query_texts in exec_kwargs alongside the embedded vectors; change is minimal and correct.
nemo_retriever/src/nemo_retriever/vdb/adt_vdb.py	Docstring-only updates to reflect the new query_texts kwarg contract; no logic changes.
nemo_retriever/tests/test_lancedb_retrieval_where.py	Six new tests cover happy path, ingestion+retrieval E2E, missing query_texts, length mismatch, where-filter, and conflicting query_type; all follow existing test conventions.
nemo_retriever/tests/test_nv_ingest_vdb_operator.py	Four new operator-level tests verify query_texts forwarding for hybrid, hybrid-vdb-instance, dense-override, and dense-only cases.
nemo_retriever/src/nemo_retriever/vdb/README.md	Documentation updated to reflect hybrid retrieval is now implemented and describes the query_texts parameter contract.

Sequence Diagram

sequenceDiagram
    participant R as Retriever
    participant G as Graph (embed -> retrieve)
    participant Op as RetrieveVdbOperator
    participant DB as LanceDB.retrieval

    R->>G: "execute(df, query_texts=[...], top_k=N)"
    G->>G: embed texts to vectors
    G->>Op: "process(vectors, query_texts=[...], top_k=N)"
    Op->>Op: filter_retrieval_kwargs strips query_texts
    Op->>Op: effective_hybrid? check retrieval_kwargs / vdb.hybrid
    alt "hybrid=True"
        Op->>Op: re-inject query_texts into retrieval_kwargs
        Op->>DB: "retrieval(vectors, hybrid=True, query_texts=[...])"
        DB->>DB: validate query_texts not None
        DB->>DB: "validate len(query_texts)==len(vectors)"
        DB->>DB: "set search_kwargs query_type=hybrid, fts_columns=text"
        DB->>DB: table.search(...).vector(v).text(t).where(...).limit(k)
    else dense
        Op->>DB: "retrieval(vectors, hybrid=False)"
        DB->>DB: table.search([vector], ...).where(...).limit(k)
    end
    DB-->>Op: list[list[dict]]
    Op-->>R: normalized hits

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
nemo_retriever/src/nemo_retriever/vdb/lancedb.py:694-706
The length-alignment check for `query_texts` vs `vectors` is placed after `lancedb.connect(...).open_table(...)`. If the counts don't match the table connection is already open. Moving the alignment validation before the connection follows fail-fast practice and avoids unnecessary I/O on bad input.

```suggestion
        if hybrid:
            vectors_for_search = list(vectors)
            query_texts_list = [query_texts] if isinstance(query_texts, str) else list(query_texts)
            if len(query_texts_list) != len(vectors_for_search):
                raise ValueError(
                    "LanceDB hybrid retrieval requires query_texts length to match vectors length; "
                    f"got query_texts={len(query_texts_list)} vectors={len(vectors_for_search)}."
                )
        else:
            vectors_for_search = vectors
            query_texts_list = []

        table = lancedb.connect(uri=table_path).open_table(table_name)
```

_{Reviews (7): Last reviewed commit: "Clarify hybrid FTS index build" | Re-trigger Greptile}

jioffe502 · 2026-05-18T15:16:41Z

Greptile follow-up addressed in d96dcf47: added retrieval type hints, restored dense lazy iteration, and clarified the missing query_texts error.

…hybrid-search # Conflicts: # nemo_retriever/src/nemo_retriever/vdb/lancedb.py

…hybrid-search

jperez999

Where is the ingestion portion of the code, like where we train the BM25 model? I dont see it in this PR.

jperez999 · 2026-06-02T19:00:10Z

        return counts

-    def retrieval(self, vectors, **kwargs):
+    def retrieval(self, vectors: Iterable[Sequence[float]], **kwargs: Any) -> list[list[dict[str, Any]]]:


i think this should use the hits class shouldnt it?

jioffe502 requested review from a team as code owners May 14, 2026 21:21

jioffe502 requested a review from edknv May 14, 2026 21:21

jioffe502 added 2 commits May 14, 2026 21:25

Implement LanceDB hybrid retrieval

6410d75

Document hybrid query text ordering assumption

1bf9168

greptile-apps Bot reviewed May 14, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/vdb/lancedb.py Outdated

Comment thread nemo_retriever/src/nemo_retriever/vdb/lancedb.py Outdated

Address LanceDB hybrid review comments

d96dcf4

jioffe502 and others added 4 commits May 26, 2026 12:04

Merge branch 'main' into codex/lancedb-true-hybrid-search

89d07a3

Merge remote-tracking branch 'upstream/main' into codex/lancedb-true-…

892b49f

…hybrid-search # Conflicts: # nemo_retriever/src/nemo_retriever/vdb/lancedb.py

Merge remote-tracking branch 'upstream/main' into codex/lancedb-true-…

e965ae7

…hybrid-search

Tighten hybrid retrieval VDB contract

3965bfb

jioffe502 mentioned this pull request Jun 1, 2026

Add VDB runtime summary observability #2198

Draft

jperez999 requested changes Jun 2, 2026

View reviewed changes

Clarify hybrid FTS index build

e3a5f30

jioffe502 force-pushed the codex/lancedb-true-hybrid-search branch from 395ff3d to e3a5f30 Compare June 2, 2026 20:43

jperez999 approved these changes Jun 2, 2026

View reviewed changes

jperez999 merged commit dd62915 into NVIDIA:main Jun 2, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement true LanceDB hybrid retrieval#2040

Implement true LanceDB hybrid retrieval#2040
jperez999 merged 8 commits into
NVIDIA:mainfrom
jioffe502:codex/lancedb-true-hybrid-search

jioffe502 commented May 14, 2026

Uh oh!

greptile-apps Bot commented May 14, 2026 •

edited

Loading

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

jioffe502 commented May 18, 2026

Uh oh!

jperez999 left a comment

Uh oh!

jperez999 Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jioffe502 commented May 14, 2026

Summary

Behavioral Notes

Validation

E2E Findings

JP20 LanceDB Hybrid

BO767 LanceDB Hybrid

Dense vs Hybrid Observability

Uh oh!

greptile-apps Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

jioffe502 commented May 18, 2026

Uh oh!

jperez999 left a comment

Choose a reason for hiding this comment

Uh oh!

jperez999 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 14, 2026 •

edited

Loading