Skip to content

Fix direct image extraction dispatch#1996

Open
jioffe502 wants to merge 5 commits into
NVIDIA:mainfrom
jioffe502:investigate/image_extraction_bug
Open

Fix direct image extraction dispatch#1996
jioffe502 wants to merge 5 commits into
NVIDIA:mainfrom
jioffe502:investigate/image_extraction_bug

Conversation

@jioffe502
Copy link
Copy Markdown
Collaborator

@jioffe502 jioffe502 commented May 7, 2026

Summary

  • Default GraphIngestor.extract() to extraction_mode="auto" so supported non-PDF inputs are routed through MultiTypeExtractOperator by extension.
  • Preserve extraction_mode="pdf" as the explicit dedicated PDF/document graph path.
  • Add coverage for BMP/TIFF/TIF page image materialization and mixed PDF+BMP auto dispatch.
  • Add graph_ingestor.md documenting GraphIngestor usage, run modes, extraction modes, supported auto-dispatch groups, split configuration, and post-extraction stages.

Testing

  • python -m pytest nemo_retriever/tests/test_ingest_interface.py -q
  • Full JP20 pipeline smoke-tested with the auto dispatch path.

@jioffe502 jioffe502 requested review from a team as code owners May 7, 2026 19:00
@jioffe502 jioffe502 requested a review from jperez999 May 7, 2026 19:00
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 7, 2026

Greptile Summary

This PR fixes auto-dispatch routing for direct image inputs (BMP, TIFF, TIF) by changing GraphIngestor.extract()'s default extraction_mode from "pdf" to "auto", so callers no longer need to pass extraction_mode="auto" explicitly for mixed or image-only corpora.

  • graph_ingestor.py: one-line default change from "pdf" to "auto" in extract(), with updated docstring.
  • test_ingest_interface.py: adds parametrized BMP/TIFF materialization tests and a mixed PDF+BMP auto-dispatch test, both using monkeypatch.setattr against private _MultiTypeExtractBase methods.
  • graph_ingestor.md: new reference document covering run modes, extraction modes, the auto-dispatch extension table, split_config, and post-extraction stages.

Confidence Score: 5/5

Safe to merge; the change is a one-line default swap in extract() with accompanying tests and documentation, and the MultiTypeExtractOperator auto-dispatch path was already implemented before this PR.

The core change is small and well-scoped: extract() now routes callers through MultiTypeExtractOperator by default rather than requiring an explicit extraction_mode="auto". The underlying dispatch infrastructure predates this PR, so the risk of a broken path is low. The new tests exercise BMP/TIFF materialization and mixed-corpus dispatch, confirming the intended behavior works end-to-end with mocked inference stages.

No files require special attention beyond the minor _extraction_mode initializer inconsistency in graph_ingestor.py.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/graph_ingestor.py Changes extract() default from extraction_mode="pdf" to "auto" and updates the docstring; __init__ still initializes _extraction_mode to "pdf", creating a minor internal inconsistency.
nemo_retriever/tests/test_ingest_interface.py Adds parametrized BMP/TIFF image materialization tests and a mixed PDF+BMP auto-dispatch test; both new tests mock private internals of _MultiTypeExtractBase rather than the public operator boundary.
nemo_retriever/src/nemo_retriever/graph_ingestor.md New reference documentation covering GraphIngestor usage, extraction modes, auto-dispatch extension groups, split_config, and post-extraction stages.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["GraphIngestor.extract()"] -->|extraction_mode default| B{"extraction_mode"}
    B -->|'auto' NEW DEFAULT| C["MultiTypeExtractOperator"]
    B -->|'pdf' explicit| D["PDF/Document Graph"]
    B -->|'image' explicit| E["ImageLoadActor"]
    B -->|'text' explicit| F["TxtSplitActor"]
    B -->|'html' explicit| G["HtmlSplitActor"]
    B -->|'audio' explicit| H["MediaChunkActor → ASRActor"]

    C -->|".pdf .docx .pptx"| D
    C -->|".png .jpg .bmp .tiff .tif .svg"| E
    C -->|".txt"| F
    C -->|".html"| G
    C -->|".mp3 .wav"| H

    D --> I["_run_pdf_pipeline"]
    E --> J["_run_detection_pipeline"]
    I --> K["Final Result DataFrame"]
    J --> K
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
nemo_retriever/src/nemo_retriever/graph_ingestor.py:220
The instance attribute is still initialized to `"pdf"` in `__init__`, while `extract()` now defaults to `"auto"`. Anyone reading the class body first will conclude the natural default is PDF mode, when in practice `.extract()` (required for any ingestion) overrides it to `"auto"`. Aligning the two removes the misleading signal.

```suggestion
        self._extraction_mode: str = "auto"
```

Reviews (5): Last reviewed commit: "Merge branch 'main' into investigate/ima..." | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/graph_ingestor.py
Comment on lines +92 to +95
monkeypatch.setattr(
"nemo_retriever.graph.multi_type_extract_operator._MultiTypeExtractBase._run_detection_pipeline",
lambda self, batch_df: batch_df,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Mocking too deep in the stack — fragile tests ahead. Both new tests patch _MultiTypeExtractBase._run_detection_pipeline and _run_pdf_pipeline, which are private implementation methods two levels below the public boundary. The mocking-discipline rule says to mock at the boundary (the public operator entry point or the I/O seam), not inside the call stack. A future refactoring that renames or restructures these private helpers will silently pass these tests even though the real dispatch logic has changed.

Rule Used: Tests must mock all external service dependencies ... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/tests/test_ingest_interface.py
Line: 92-95

Comment:
Mocking too deep in the stack — fragile tests ahead. Both new tests patch `_MultiTypeExtractBase._run_detection_pipeline` and `_run_pdf_pipeline`, which are private implementation methods two levels below the public boundary. The mocking-discipline rule says to mock at the boundary (the public operator entry point or the I/O seam), not inside the call stack. A future refactoring that renames or restructures these private helpers will silently pass these tests even though the real dispatch logic has changed.

**Rule Used:** Tests must mock all external service dependencies ... ([source](https://app.greptile.com/review/custom-context?memory=mock-external-services))

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@@ -0,0 +1,213 @@
# GraphIngestor
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure this is still up to date with all the changes we made recently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants