fix asr and ocr on default cpu remote by jperez999 · Pull Request #2085 · NVIDIA/NeMo-Retriever

jperez999 · 2026-05-21T16:16:25Z

Description

Summary

End-to-end fix for retriever ingest <video.mp4>, which previously logged success while silently failing both
video frame extraction and audio transcription. Along the way, restructured the ASR actor stack to match the
CPU/GPU split used elsewhere in the codebase, defaulted the remote Parakeet path to NVCF, and made the CLI's
"Ingested …" message report actual landed rows instead of input file count.

What broke and what changed

ASR actor split: remote-only CPU variant, local-only GPU variant

Before: ASRCPUActor was a conflated actor that handled both remote (Riva gRPC) and local (HuggingFace Parakeet)
ASR via an if/else on _use_remote(params). ASRGPUActor inherited from it and merely added the GPU operator
mixin. Construction of either class with default ASRParams() loaded torch unconditionally — which is what made
retriever ingest some.mp3 drag torch + transformers + nemotron-* into the import graph on a CPU-only ingest
path.

After:

ASRCPUActor (audio/cpu_actor.py, new) — remote-only Parakeet/Riva gRPC client. Carries DEFAULT_GRPC_ENDPOINT
= "grpc.nvcf.nvidia.com:443" and DEFAULT_FUNCTION_ID = "bb0837de-8c7b-481f-9ec8-ef5663e9c1fa" class constants,
plus _apply_actor_defaults that pre-fills audio_endpoints, function_id, and auth_token (from NVIDIA_API_KEY
env) when the caller leaves them empty. Honors AUDIO_FUNCTION_ID env override. Never imports torch.
ASRGPUActor (audio/gpu_actor.py, new) — local-only nvidia/parakeet-ctc-1.1b via HuggingFace transformers.
Eagerly loads weights at construction.
_ASRActorBase mixin in audio/asr_actor.py — carries shared _build_output_rows / preprocess / postprocess so
the two variants don't duplicate row-shaping code.
ASRActor archetype — adds prefers_cpu_variant(kwargs) returning True iff params.audio_endpoints is populated,
lazy-imports both variants via cpu_variant_class() / gpu_variant_class(). Also reads AUDIO_GRPC_ENDPOINT from
env at init time and pre-fills audio_endpoints so the env var alone forces the remote (CPU) variant on any
host.
Lazy getattr on audio/asr_actor.py re-exports ASRCPUActor / ASRGPUActor so existing from
nemo_retriever.audio.asr_actor import ASRCPUActor callers keep working without circular-import gymnastics.
Dropped the no-longer-needed AudioChunkParams.audio_only field, its CLI flag (audio/stage.py), and the
matching plumbing through chunk_actor.py / media_interface.py. The video chunker now always pre-extracts MP3
for .mp4/.mov/.avi/.mkv inputs (see Add missing build script, fix docker-compose to use self hosted deplot #4).

Riva client: streaming RPC, correct sample rate, header stripping

ParakeetClient.transcribe() was calling offline_recognize with a RecognitionConfig missing both encoding and
sample_rate_hertz, and convert_to_mono_wav was resampling to 44100 Hz (Parakeet trains at 16 kHz). The NVCF
Parakeet deployment at 1598d209-... and bb0837de-... is streaming-only (verified by probing
GetRivaSpeechRecognitionConfig — both return type=online, streaming=True, offline=False), so offline_recognize
returned INVALID_ARGUMENT: Unavailable model requested no matter what params we sent.

RecognitionConfig now sets encoding=AudioEncoding.LINEAR_PCM and sample_rate_hertz=PARAKEET_SAMPLE_RATE_HZ
(16 kHz).
convert_to_mono_wav resamples to 16 kHz (matching Parakeet's training rate).
transcribe() now calls streaming_response_generator instead of offline_recognize. New _streaming_transcribe
helper:
- Strips the WAV header via the stdlib wave module (the gRPC stream expects raw PCM matching the declared
  encoding; WAV header bytes would be parsed as samples).
- Chunks the PCM payload into 32 KB pieces (~1 s of audio each) and feeds the generator.
- Filters is_final=True results and wraps them in _StreamingResponseShim — a tiny adapter whose .results
  field matches the offline-RPC shape so process_transcription_response keeps working unchanged.

ffmpeg frame extraction: JPEG instead of PNG

The container ships a slim custom-built ffmpeg (/usr/local/bin/ffmpeg, 8.0.1, configured without the PNG
encoder). Frame extraction failed with Default encoder for format image2 (codec png) is probably disabled,
returning 0 frames to VideoFrameOCRActor.

audio/media_interface.py:extract_frames now writes frame%06d.jpg and globs *.jpg. The mjpeg encoder is
built into every ffmpeg variant, so this works regardless of build flags. JPEG @ q:v=2 is near-lossless for OCR
purposes and slashes the per-frame base64 payload by ~10×, which compounds over the OCR HTTP round-trips.
Downstream consumers (Nemotron OCR over base64, perceptual-hash dedup via PIL) are format-agnostic — no other
changes needed.

Audio chunker: always pre-extract MP3 from video

For video inputs the chunker used to preserve the source container, so .mp4 → chunk_NNNN.mp4. The Parakeet
client decodes audio via soundfile/libsndfile, which can't read MP4 containers — every chunk failed with Format
not recognised before reaching the NIM.

media_interface.py:split now unconditionally extracts MP3 audio from any .mp4/.mov/.avi/.mkv input before
chunking. Chunks are emitted as .mp3, which soundfile reads natively.
AudioChunkParams.audio_only and the --audio-only CLI flag are removed (the field was the only knob
controlling this behavior, and there was no reachable consumer that wanted the old behavior).

Param-model defaults: NVCF endpoint baked into the actor

params/models.py:ASRParams docstring updated to reflect the new selection contract (the archetype, not a
field, chooses local vs remote). No force_local field introduced — kept the model minimal.
audio/asr_actor.py:DEFAULT_NGC_ASR_FUNCTION_ID updated to "bb0837de-8c7b-481f-9ec8-ef5663e9c1fa". Single
source of truth — ASRCPUActor.DEFAULT_FUNCTION_ID and asr_params_from_env's default both reference this
constant.
asr_params_from_env and ASRCPUActor both source the bearer token from NVIDIA_API_KEY (was NGC_API_KEY);
auth_token_var parameter default updated accordingly. Stage CLI help text + docstring updated.
audio/stage.py log line no longer claims "local (Parakeet)" when the archetype will actually fill in NVCF —
it now reports archetype-resolved (GPU local / NVCF default) when no explicit endpoint is passed.

CLI ingest message reports actual landed rows

Ingested N document(s) was len(summary["documents"]) — the input file count, never touching LanceDB. Misleading
for any document that explodes into multiple chunks (PDFs, videos), and completely hid the case where every
stage failed and the table got zero rows.

adapters/cli/sdk_workflow.py:ingest_documents runs new _count_lancedb_rows(uri, table) after
vdb_upload(...).ingest() and threads n_rows through the summary. Best-effort: any exception is swallowed and
returns None, so a flaky table read can't break ingestion.
adapters/cli/main.py now prints Ingested {n_files} file(s) → {n_rows} row(s) in LanceDB {uri}/{table}. Falls
back to ... (row count unavailable). when the count failed.

Base dependencies

nvidia-riva-client added to base dependencies in nemo_retriever/pyproject.toml. Required for the Riva gRPC
stubs (riva.client.ASRService, RecognitionConfig, etc.) that the remote ASR path now depends on by default.
nvidia-riva-client v2.25.1 resolves cleanly (uv lock: 355 packages).

Behavior contract after the change

Test impact

Updated tests/test_audio_pipeline_batch.py:test_inprocess_audio_pipeline_local_asr_mocked — now mocks
gather_local_resources to force the GPU variant so the local-Parakeet assertions are deterministic regardless
of CI host.
Updated tests/test_asr_actor.py:test_local_asr_does_not_call_get_client — local-Parakeet probe now constructs
ASRGPUActor directly (the new local-only class) and drops the _client attribute assertion (no longer
meaningful for the GPU variant).
Updated three tests in tests/test_root_cli_workflow.py — assertions now expect the new "N file(s) → M row(s)"
wording and stub sdk_workflow._count_lancedb_rows to keep them off-disk.
All ASR / audio / CLI tests pass (77 passed in the audio batch, 71 passed in the CLI/workflow batch, 1419
passed in the broader sweep).

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

charlesbluca

The container ships a slim custom-built ffmpeg (/usr/local/bin/ffmpeg, 8.0.1, configured without the PNG encoder).

Think this is outdated - with #2052, we should now be pulling ffmpeg 7:4.4.2-0ubuntu0.22.04.1, which I confirmed packages PNG encoder

greptile-apps · 2026-05-21T16:22:15Z

Greptile Summary

This PR fixes end-to-end video ingest by correcting the Riva gRPC streaming path (encoding config + sample rate + WAV header stripping), switching ffmpeg frame extraction from PNG to JPEG, and unconditionally pre-extracting MP3 audio from video containers before chunking. It also restructures the ASR actor stack into discrete CPU (remote/NVCF) and GPU (local Parakeet) variants, defaulting CPU-only hosts to the NVCF Parakeet endpoint, and updates the CLI ingest output to report actual LanceDB row counts.

ASR split: ASRCPUActor (remote Riva gRPC, no torch) and ASRGPUActor (local nvidia/parakeet-ctc-1.1b) replace the old single conflated class; ASRActor archetype lazy-imports the correct variant based on GPU availability and audio_endpoints.
Riva streaming fix: transcribe() now uses streaming_response_generator with correct LINEAR_PCM / 16 kHz config and strips the WAV header via wave before sending raw PCM chunks.
CLI honesty: _count_lancedb_rows() is called post-ingest and the final message now prints N file(s) → M row(s) instead of a misleading input-count-only line.

Confidence Score: 4/5

The core Riva streaming fix, JPEG frame extraction, and unconditional audio pre-extraction are all sound; the ASR actor split is well-structured. One defect in media_interface.split() — the video_audio_separate guard silently never fires for video inputs after the unconditional MP3 pre-extraction reassigns input_path — remains unresolved.

The streaming transcription path, sample-rate alignment, and CPU/GPU actor split are correctly implemented and well-tested. The open video_audio_separate silent no-op means any pipeline with video_audio_separate=True will silently produce no side-car audio chunks on a changed code path introduced by this PR.

nemo_retriever/src/nemo_retriever/audio/media_interface.py — the video_audio_separate guard at line 353 needs to key off the original input suffix (path_input.suffix) rather than the post-conversion suffix.

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/audio/cpu_actor.py	New remote-only ASRCPUActor; NVCF defaults filled by _apply_actor_defaults. Early-return when caller supplies explicit endpoints skips auth_token env-var back-fill.
nemo_retriever/src/nemo_retriever/audio/gpu_actor.py	New local-only ASRGPUActor; eagerly loads ParakeetCTC1B1ASR at construction. Missing public unload()/cleanup() method for GPU memory release.
nemo_retriever/src/nemo_retriever/audio/asr_actor.py	Refactored to archetype + _ASRActorBase shared mixin; lazy re-exports via PEP 562 getattr. Logic for env-var-driven endpoint pre-fill and prefers_cpu_variant is sound.
nemo_retriever/src/nemo_retriever/api/internal/primitives/nim/model_interface/parakeet.py	Streaming transcription path added with correct LINEAR_PCM encoding, 16 kHz sample rate, WAV header stripping, and 32 KB PCM chunking. _StreamingResponseShim correctly adapts streaming results to the offline response shape.
nemo_retriever/src/nemo_retriever/audio/media_interface.py	Unconditional video to MP3 pre-extraction fixes the libsndfile compatibility gap; ffmpeg frame extraction switched to JPEG. The video_audio_separate guard at line 353 uses the reassigned MP3 input_path, so its suffix check always fails for video inputs.
nemo_retriever/src/nemo_retriever/adapters/cli/sdk_workflow.py	_count_lancedb_rows best-effort helper correctly swallows exceptions and reports None; n_rows threads through the summary dict cleanly.
nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py	Added audio_only flag gating frames_enabled; consistent with video_asr_audio_chunk_params no longer forcing audio_only=True.
nemo_retriever/src/nemo_retriever/params/models.py	audio_only field retained in AudioChunkParams; docstring updated for ASRParams; audio_infer_protocol default remains grpc. No breaking changes.

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nemo_retriever/src/nemo_retriever/audio/gpu_actor.py:45-50
The new `ASRGPUActor` loads `ParakeetCTC1B1ASR` at construction time but provides no public method to release the GPU allocation afterwards. Per the `gpu-memory-lifecycle` rule, every model-holding class must expose an explicit cleanup path (`del` the tensor reference + `torch.cuda.empty_cache()`); without it, long-running processes or test suites that instantiate multiple actors will accumulate VRAM with no way to reclaim it.

```suggestion
    def __init__(self, params: ASRParams | None = None) -> None:
        super().__init__(params=params)
        self._params = params or ASRParams()
        from nemo_retriever.model.local import ParakeetCTC1B1ASR

        self._model = ParakeetCTC1B1ASR()

    def unload(self) -> None:
        """Release the local Parakeet model and free GPU memory."""
        self._model = None
        try:
            import torch

            torch.cuda.empty_cache()
        except Exception:
            pass
```

### Issue 2 of 2
nemo_retriever/src/nemo_retriever/audio/cpu_actor.py:766-780
**Auth token silently absent for caller-supplied custom endpoints**

`_apply_actor_defaults` returns early before the env-var token check when `grpc_ep` or `http_ep` is truthy. A caller who passes an explicit endpoint without setting `auth_token` on the params object gets a client with no bearer token, even when the API key env var is present in the shell. The NVCF endpoint path (empty endpoints → early-fill) is correct; it is only the user-supplied-endpoint path that skips the credential lookup.

_{Reviews (5): Last reviewed commit: "fix unit tests" | Re-trigger Greptile}

jperez999 · 2026-05-22T00:28:04Z

Centralize gather_local_resources dispatch (4 src + 4 test files)

Every call site that used to from … import gather_local_resources now imports the module (from nemo_retriever.utils import
ray_resource_hueristics as _rrh) and calls _rrh.gather_local_resources(). Affects graph/executor.py, graph/multi_type_extract_operator.py,
graph/operator_archetype.py, graph/operator_resolution.py.

Why this matters: previously each module bound its own local reference to the function, so tests had to
monkeypatch.setattr("nemo_retriever.graph.X.gather_local_resources", …) separately for every dispatch site. Now there's one canonical patch
point — nemo_retriever.utils.ray_resource_hueristics.gather_local_resources — used uniformly by test_asr_actor.py,
test_audio_pipeline_batch.py, test_ocr_version_selection.py, and test_pipeline_graph.py (which dropped the duplicate patches it used to need
on operator_resolution and operator_archetype).

Code-review fixes

params/models.py:177 — ASRParams.audio_infer_protocol default flipped from "http" → "grpc". Callers who pass an explicit gRPC endpoint
without setting the protocol no longer get an HTTP handshake to a gRPC server; _apply_actor_defaults's early-return preserved the bad default,
and "http" was truthy so the or "grpc" fallback in _get_client couldn't recover.
pyproject.toml:71 — nvidia-riva-client got a lower bound of >=2.25.1, matching the version the PR description names.

Test-hermeticity fix

tests/test_service_pipeline_spec.py — added an autouse monkeypatch.delenv fixture for NVIDIA_API_KEY / NGC_API_KEY. _ParamsModel
auto-resolves unset *api_key fields from those env vars, which then trips ServiceIngestor._strip_server_owned() and turns three unit tests red
whenever the shell has a real key set.

Signed-off-by: Julio Perez <jperez@nvidia.com>

Signed-off-by: Julio Perez <jperez@nvidia.com> Co-authored-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com> (cherry picked from commit 7130a72)

fix asr and ocr on default cpu remote

2587c3f

jperez999 requested review from a team as code owners May 21, 2026 16:16

jperez999 requested a review from nkmcalli May 21, 2026 16:16

charlesbluca requested changes May 21, 2026

View reviewed changes

greptile-apps Bot reviewed May 21, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/params/models.py Outdated

Comment thread nemo_retriever/src/nemo_retriever/params/models.py

Comment thread nemo_retriever/pyproject.toml Outdated

charlesbluca and others added 4 commits May 21, 2026 12:49

Merge branch 'main' into audio-video-up

9140296

merge in main

0020be5

merge in

6eca4de

fix audio_only deprecation

f490277

jperez999 added 2 commits May 21, 2026 18:07

merge in main

e2dde73

fix unit tests

5a51b69

Signed-off-by: Julio Perez <jperez@nvidia.com>

jperez999 requested a review from charlesbluca May 22, 2026 13:18

jdye64 approved these changes May 22, 2026

View reviewed changes

jperez999 merged commit 7130a72 into NVIDIA:main May 22, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix asr and ocr on default cpu remote#2085

fix asr and ocr on default cpu remote#2085
jperez999 merged 7 commits into
NVIDIA:mainfrom
jperez999:audio-video-up

jperez999 commented May 21, 2026

Uh oh!

charlesbluca left a comment

Uh oh!

greptile-apps Bot commented May 21, 2026 •

edited

Loading

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jperez999 commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jperez999 commented May 21, 2026

Description

Checklist

Uh oh!

charlesbluca left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jperez999 commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented May 21, 2026 •

edited

Loading