Fix direct image extraction dispatch#1996
Conversation
Greptile SummaryThis PR fixes auto-dispatch routing for direct image inputs (BMP, TIFF, TIF) by changing
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/graph_ingestor.py | Changes extract() default from extraction_mode="pdf" to "auto" and updates the docstring; __init__ still initializes _extraction_mode to "pdf", creating a minor internal inconsistency. |
| nemo_retriever/tests/test_ingest_interface.py | Adds parametrized BMP/TIFF image materialization tests and a mixed PDF+BMP auto-dispatch test; both new tests mock private internals of _MultiTypeExtractBase rather than the public operator boundary. |
| nemo_retriever/src/nemo_retriever/graph_ingestor.md | New reference documentation covering GraphIngestor usage, extraction modes, auto-dispatch extension groups, split_config, and post-extraction stages. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["GraphIngestor.extract()"] -->|extraction_mode default| B{"extraction_mode"}
B -->|'auto' NEW DEFAULT| C["MultiTypeExtractOperator"]
B -->|'pdf' explicit| D["PDF/Document Graph"]
B -->|'image' explicit| E["ImageLoadActor"]
B -->|'text' explicit| F["TxtSplitActor"]
B -->|'html' explicit| G["HtmlSplitActor"]
B -->|'audio' explicit| H["MediaChunkActor → ASRActor"]
C -->|".pdf .docx .pptx"| D
C -->|".png .jpg .bmp .tiff .tif .svg"| E
C -->|".txt"| F
C -->|".html"| G
C -->|".mp3 .wav"| H
D --> I["_run_pdf_pipeline"]
E --> J["_run_detection_pipeline"]
I --> K["Final Result DataFrame"]
J --> K
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
nemo_retriever/src/nemo_retriever/graph_ingestor.py:220
The instance attribute is still initialized to `"pdf"` in `__init__`, while `extract()` now defaults to `"auto"`. Anyone reading the class body first will conclude the natural default is PDF mode, when in practice `.extract()` (required for any ingestion) overrides it to `"auto"`. Aligning the two removes the misleading signal.
```suggestion
self._extraction_mode: str = "auto"
```
Reviews (5): Last reviewed commit: "Merge branch 'main' into investigate/ima..." | Re-trigger Greptile
| monkeypatch.setattr( | ||
| "nemo_retriever.graph.multi_type_extract_operator._MultiTypeExtractBase._run_detection_pipeline", | ||
| lambda self, batch_df: batch_df, | ||
| ) |
There was a problem hiding this comment.
Mocking too deep in the stack — fragile tests ahead. Both new tests patch
_MultiTypeExtractBase._run_detection_pipeline and _run_pdf_pipeline, which are private implementation methods two levels below the public boundary. The mocking-discipline rule says to mock at the boundary (the public operator entry point or the I/O seam), not inside the call stack. A future refactoring that renames or restructures these private helpers will silently pass these tests even though the real dispatch logic has changed.
Rule Used: Tests must mock all external service dependencies ... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/tests/test_ingest_interface.py
Line: 92-95
Comment:
Mocking too deep in the stack — fragile tests ahead. Both new tests patch `_MultiTypeExtractBase._run_detection_pipeline` and `_run_pdf_pipeline`, which are private implementation methods two levels below the public boundary. The mocking-discipline rule says to mock at the boundary (the public operator entry point or the I/O seam), not inside the call stack. A future refactoring that renames or restructures these private helpers will silently pass these tests even though the real dispatch logic has changed.
**Rule Used:** Tests must mock all external service dependencies ... ([source](https://app.greptile.com/review/custom-context?memory=mock-external-services))
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| @@ -0,0 +1,213 @@ | |||
| # GraphIngestor | |||
There was a problem hiding this comment.
make sure this is still up to date with all the changes we made recently.
Summary
GraphIngestor.extract()toextraction_mode="auto"so supported non-PDF inputs are routed throughMultiTypeExtractOperatorby extension.extraction_mode="pdf"as the explicit dedicated PDF/document graph path.graph_ingestor.mddocumenting GraphIngestor usage, run modes, extraction modes, supported auto-dispatch groups, split configuration, and post-extraction stages.Testing
python -m pytest nemo_retriever/tests/test_ingest_interface.py -q