Implement true LanceDB hybrid retrieval#2040
Conversation
Greptile SummaryReplaces the
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/vdb/lancedb.py | Core hybrid retrieval implementation: query_texts extraction, query_type/fts_columns injection, alignment check, and the LanceDB 0.30.2 builder chain. Alignment check fires after table connection — minor fail-fast gap. |
| nemo_retriever/src/nemo_retriever/vdb/operators.py | Strips query_texts from constructor kwargs and conditionally re-injects it at call time via the effective_hybrid guard; logic is correct and well-tested. |
| nemo_retriever/src/nemo_retriever/retriever.py | Adds alignment-safety comment and explicitly forwards query_texts in exec_kwargs alongside the embedded vectors; change is minimal and correct. |
| nemo_retriever/src/nemo_retriever/vdb/adt_vdb.py | Docstring-only updates to reflect the new query_texts kwarg contract; no logic changes. |
| nemo_retriever/tests/test_lancedb_retrieval_where.py | Six new tests cover happy path, ingestion+retrieval E2E, missing query_texts, length mismatch, where-filter, and conflicting query_type; all follow existing test conventions. |
| nemo_retriever/tests/test_nv_ingest_vdb_operator.py | Four new operator-level tests verify query_texts forwarding for hybrid, hybrid-vdb-instance, dense-override, and dense-only cases. |
| nemo_retriever/src/nemo_retriever/vdb/README.md | Documentation updated to reflect hybrid retrieval is now implemented and describes the query_texts parameter contract. |
Sequence Diagram
sequenceDiagram
participant R as Retriever
participant G as Graph (embed -> retrieve)
participant Op as RetrieveVdbOperator
participant DB as LanceDB.retrieval
R->>G: "execute(df, query_texts=[...], top_k=N)"
G->>G: embed texts to vectors
G->>Op: "process(vectors, query_texts=[...], top_k=N)"
Op->>Op: filter_retrieval_kwargs strips query_texts
Op->>Op: effective_hybrid? check retrieval_kwargs / vdb.hybrid
alt "hybrid=True"
Op->>Op: re-inject query_texts into retrieval_kwargs
Op->>DB: "retrieval(vectors, hybrid=True, query_texts=[...])"
DB->>DB: validate query_texts not None
DB->>DB: "validate len(query_texts)==len(vectors)"
DB->>DB: "set search_kwargs query_type=hybrid, fts_columns=text"
DB->>DB: table.search(...).vector(v).text(t).where(...).limit(k)
else dense
Op->>DB: "retrieval(vectors, hybrid=False)"
DB->>DB: table.search([vector], ...).where(...).limit(k)
end
DB-->>Op: list[list[dict]]
Op-->>R: normalized hits
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
nemo_retriever/src/nemo_retriever/vdb/lancedb.py:694-706
The length-alignment check for `query_texts` vs `vectors` is placed after `lancedb.connect(...).open_table(...)`. If the counts don't match the table connection is already open. Moving the alignment validation before the connection follows fail-fast practice and avoids unnecessary I/O on bad input.
```suggestion
if hybrid:
vectors_for_search = list(vectors)
query_texts_list = [query_texts] if isinstance(query_texts, str) else list(query_texts)
if len(query_texts_list) != len(vectors_for_search):
raise ValueError(
"LanceDB hybrid retrieval requires query_texts length to match vectors length; "
f"got query_texts={len(query_texts_list)} vectors={len(vectors_for_search)}."
)
else:
vectors_for_search = vectors
query_texts_list = []
table = lancedb.connect(uri=table_path).open_table(table_name)
```
Reviews (7): Last reviewed commit: "Clarify hybrid FTS index build" | Re-trigger Greptile
|
Greptile follow-up addressed in |
…hybrid-search # Conflicts: # nemo_retriever/src/nemo_retriever/vdb/lancedb.py
jperez999
left a comment
There was a problem hiding this comment.
Where is the ingestion portion of the code, like where we train the BM25 model? I dont see it in this PR.
| return counts | ||
|
|
||
| def retrieval(self, vectors, **kwargs): | ||
| def retrieval(self, vectors: Iterable[Sequence[float]], **kwargs: Any) -> list[list[dict[str, Any]]]: |
There was a problem hiding this comment.
i think this should use the hits class shouldnt it?
395ff3d to
e3a5f30
Compare
Summary
query_textsalongside precomputed vectors.query_textsexecution-only: it is stripped from persistent VDB constructor kwargs and forwarded only forhybrid=Trueretrieval calls.hybrid=TrueretrievalNotImplementedErrorwith LanceDB 0.30.2 hybrid query construction:table.search(query_type="hybrid", vector_column_name=..., fts_columns="text").vector(vector).text(query_text).Behavioral Notes
query_textsis not forwarded for dense retrieval.query_textsand validates that query/vector counts match.where/_filter,top_k,refine_factor,n_probe/nprobes,result_fields, andsearch_kwargsbehavior are preserved.search_kwargs["query_type"]now raises a clearValueError.Validation
cd /localhome/local-jioffe/nv-ingest-lancedb/nemo_retriever/localhome/local-jioffe/.local/bin/uv run --extra dev pytest -q tests/test_retriever_queries.py tests/test_nv_ingest_vdb_operator.py tests/test_lancedb_retrieval_where.py32 passed/localhome/local-jioffe/.local/bin/uv run --extra dev pytest -q tests/test_root_cli_workflow.py tests/test_graph_pipeline_cli.py tests/test_lancedb_write_policy.py21 passed, 1 warninggit diff --checkE2E Findings
JP20 LanceDB Hybrid
--no-extract-page-as-image.1,9403,1923,185recall@1: 0.6609recall@3: 0.8522recall@5: 0.9304recall@10: 0.9565IvfHnswSqplus FTStext_idx.BO767 LanceDB Hybrid
54,73080,43676,2991,0051484.75s/0:24:44.75336.86 PPSrecall@1: 0.5811recall@3: 0.7950recall@5: 0.8488recall@10: 0.8985ndcg@1: 0.5811ndcg@3: 0.7076ndcg@5: 0.7297ndcg@10: 0.746076,299rows and indexes:Index(IvfHnswSq, columns=["vector"], name="vector_idx")Index(FTS, columns=["text"], name="text_idx")Dense vs Hybrid Observability
recall@k,ndcg@k) and do not themselves indicate dense vs hybrid.VDB kwargs: {"hybrid": true, ...}.vdb_op: "lancedb"and metrics but does not includevdb_kwargsor an explicit retrieval mode.vdb_kwargsorretrieval_mode: hybrid|denseintorun.runtime.summary.jsonfor easier auditability in future runs.