Skip to content

Implement true LanceDB hybrid retrieval#2040

Merged
jperez999 merged 8 commits into
NVIDIA:mainfrom
jioffe502:codex/lancedb-true-hybrid-search
Jun 2, 2026
Merged

Implement true LanceDB hybrid retrieval#2040
jperez999 merged 8 commits into
NVIDIA:mainfrom
jioffe502:codex/lancedb-true-hybrid-search

Conversation

@jioffe502
Copy link
Copy Markdown
Collaborator

Summary

  • Implements real LanceDB hybrid retrieval by passing aligned raw query_texts alongside precomputed vectors.
  • Keeps query_texts execution-only: it is stripped from persistent VDB constructor kwargs and forwarded only for hybrid=True retrieval calls.
  • Replaces the LanceDB hybrid=True retrieval NotImplementedError with LanceDB 0.30.2 hybrid query construction: table.search(query_type="hybrid", vector_column_name=..., fts_columns="text").vector(vector).text(query_text).

Behavioral Notes

  • No CLI surface changes.
  • Existing overwrite/append semantics are unchanged.
  • Dense retrieval stays VDB-agnostic; query_texts is not forwarded for dense retrieval.
  • Hybrid LanceDB retrieval now requires query_texts and validates that query/vector counts match.
  • where / _filter, top_k, refine_factor, n_probe / nprobes, result_fields, and search_kwargs behavior are preserved.
  • A conflicting non-hybrid search_kwargs["query_type"] now raises a clear ValueError.

Validation

  • cd /localhome/local-jioffe/nv-ingest-lancedb/nemo_retriever
  • /localhome/local-jioffe/.local/bin/uv run --extra dev pytest -q tests/test_retriever_queries.py tests/test_nv_ingest_vdb_operator.py tests/test_lancedb_retrieval_where.py
    • 32 passed
  • /localhome/local-jioffe/.local/bin/uv run --extra dev pytest -q tests/test_root_cli_workflow.py tests/test_graph_pipeline_cli.py tests/test_lancedb_write_policy.py
    • 21 passed, 1 warning
  • git diff --check
    • clean

E2E Findings

JP20 LanceDB Hybrid

  • Page image extraction enabled: yes, default path; no --no-extract-page-as-image.
  • Pages processed: 1,940
  • Graph rows: 3,192
  • Persisted/uploadable rows: 3,185
  • Recall:
    • recall@1: 0.6609
    • recall@3: 0.8522
    • recall@5: 0.9304
    • recall@10: 0.9565
  • LanceDB indexes confirmed: vector IvfHnswSq plus FTS text_idx.

BO767 LanceDB Hybrid

  • Pages processed: 54,730
  • Graph rows: 80,436
  • Persisted LanceDB rows: 76,299
  • BEIR queries: 1,005
  • Total time: 1484.75s / 0:24:44.753
  • Throughput: 36.86 PPS
  • Recall:
    • recall@1: 0.5811
    • recall@3: 0.7950
    • recall@5: 0.8488
    • recall@10: 0.8985
  • NDCG:
    • ndcg@1: 0.5811
    • ndcg@3: 0.7076
    • ndcg@5: 0.7297
    • ndcg@10: 0.7460
  • LanceDB table confirmed with 76,299 rows and indexes:
    • Index(IvfHnswSq, columns=["vector"], name="vector_idx")
    • Index(FTS, columns=["text"], name="text_idx")

Dense vs Hybrid Observability

  • BEIR metric names are unchanged (recall@k, ndcg@k) and do not themselves indicate dense vs hybrid.
  • The run summary stdout includes VDB kwargs: {"hybrid": true, ...}.
  • The runtime summary JSON currently records vdb_op: "lancedb" and metrics but does not include vdb_kwargs or an explicit retrieval mode.
  • Recommended follow-up: persist vdb_kwargs or retrieval_mode: hybrid|dense into run.runtime.summary.json for easier auditability in future runs.

@jioffe502 jioffe502 requested review from a team as code owners May 14, 2026 21:21
@jioffe502 jioffe502 requested a review from edknv May 14, 2026 21:21
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

Replaces the NotImplementedError stub for LanceDB hybrid retrieval with a working implementation using the LanceDB 0.30.2 hybrid query API (table.search(query_type="hybrid").vector(...).text(...)). The PR also introduces an execution-time query_texts parameter that flows from Retriever through RetrieveVdbOperator to LanceDB.retrieval, stripped from persistent constructor kwargs so dense retrieval is unaffected.

  • lancedb.py — hybrid path validates and aligns query_texts with vectors, sets query_type=\"hybrid\" and fts_columns in search_kwargs, then calls the LanceDB 0.30.2 builder chain.
  • operators.pyRetrieveVdbOperator strips query_texts at construction and re-injects it per-call only when effective_hybrid is true.
  • Six new test_lancedb_retrieval_where.py tests and four new test_nv_ingest_vdb_operator.py tests exercise the happy path, error paths, alignment check, where-filter, and query_type conflict guard.

Confidence Score: 5/5

Safe to merge; the hybrid retrieval path is well-guarded with input validation, and existing dense retrieval is untouched.

The implementation is focused and correct: query_texts is stripped at construction, re-injected only for hybrid calls, and validated for alignment before the search loop. All error paths are exercised by the new tests, E2E recall numbers look solid, and the dense path is unchanged. No data-loss or incorrect-result risk identified.

No files require special attention; the single suggestion in lancedb.py (move alignment check before table connection) is non-blocking.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/vdb/lancedb.py Core hybrid retrieval implementation: query_texts extraction, query_type/fts_columns injection, alignment check, and the LanceDB 0.30.2 builder chain. Alignment check fires after table connection — minor fail-fast gap.
nemo_retriever/src/nemo_retriever/vdb/operators.py Strips query_texts from constructor kwargs and conditionally re-injects it at call time via the effective_hybrid guard; logic is correct and well-tested.
nemo_retriever/src/nemo_retriever/retriever.py Adds alignment-safety comment and explicitly forwards query_texts in exec_kwargs alongside the embedded vectors; change is minimal and correct.
nemo_retriever/src/nemo_retriever/vdb/adt_vdb.py Docstring-only updates to reflect the new query_texts kwarg contract; no logic changes.
nemo_retriever/tests/test_lancedb_retrieval_where.py Six new tests cover happy path, ingestion+retrieval E2E, missing query_texts, length mismatch, where-filter, and conflicting query_type; all follow existing test conventions.
nemo_retriever/tests/test_nv_ingest_vdb_operator.py Four new operator-level tests verify query_texts forwarding for hybrid, hybrid-vdb-instance, dense-override, and dense-only cases.
nemo_retriever/src/nemo_retriever/vdb/README.md Documentation updated to reflect hybrid retrieval is now implemented and describes the query_texts parameter contract.

Sequence Diagram

sequenceDiagram
    participant R as Retriever
    participant G as Graph (embed -> retrieve)
    participant Op as RetrieveVdbOperator
    participant DB as LanceDB.retrieval

    R->>G: "execute(df, query_texts=[...], top_k=N)"
    G->>G: embed texts to vectors
    G->>Op: "process(vectors, query_texts=[...], top_k=N)"
    Op->>Op: filter_retrieval_kwargs strips query_texts
    Op->>Op: effective_hybrid? check retrieval_kwargs / vdb.hybrid
    alt "hybrid=True"
        Op->>Op: re-inject query_texts into retrieval_kwargs
        Op->>DB: "retrieval(vectors, hybrid=True, query_texts=[...])"
        DB->>DB: validate query_texts not None
        DB->>DB: "validate len(query_texts)==len(vectors)"
        DB->>DB: "set search_kwargs query_type=hybrid, fts_columns=text"
        DB->>DB: table.search(...).vector(v).text(t).where(...).limit(k)
    else dense
        Op->>DB: "retrieval(vectors, hybrid=False)"
        DB->>DB: table.search([vector], ...).where(...).limit(k)
    end
    DB-->>Op: list[list[dict]]
    Op-->>R: normalized hits
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
nemo_retriever/src/nemo_retriever/vdb/lancedb.py:694-706
The length-alignment check for `query_texts` vs `vectors` is placed after `lancedb.connect(...).open_table(...)`. If the counts don't match the table connection is already open. Moving the alignment validation before the connection follows fail-fast practice and avoids unnecessary I/O on bad input.

```suggestion
        if hybrid:
            vectors_for_search = list(vectors)
            query_texts_list = [query_texts] if isinstance(query_texts, str) else list(query_texts)
            if len(query_texts_list) != len(vectors_for_search):
                raise ValueError(
                    "LanceDB hybrid retrieval requires query_texts length to match vectors length; "
                    f"got query_texts={len(query_texts_list)} vectors={len(vectors_for_search)}."
                )
        else:
            vectors_for_search = vectors
            query_texts_list = []

        table = lancedb.connect(uri=table_path).open_table(table_name)
```

Reviews (7): Last reviewed commit: "Clarify hybrid FTS index build" | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/vdb/lancedb.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/vdb/lancedb.py Outdated
@jioffe502
Copy link
Copy Markdown
Collaborator Author

Greptile follow-up addressed in d96dcf47: added retrieval type hints, restored dense lazy iteration, and clarified the missing query_texts error.

Copy link
Copy Markdown
Collaborator

@jperez999 jperez999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the ingestion portion of the code, like where we train the BM25 model? I dont see it in this PR.

return counts

def retrieval(self, vectors, **kwargs):
def retrieval(self, vectors: Iterable[Sequence[float]], **kwargs: Any) -> list[list[dict[str, Any]]]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should use the hits class shouldnt it?

@jioffe502 jioffe502 force-pushed the codex/lancedb-true-hybrid-search branch from 395ff3d to e3a5f30 Compare June 2, 2026 20:43
@jperez999 jperez999 merged commit dd62915 into NVIDIA:main Jun 2, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants