Skip to content

fix asr and ocr on default cpu remote#2085

Merged
jperez999 merged 7 commits into
NVIDIA:mainfrom
jperez999:audio-video-up
May 22, 2026
Merged

fix asr and ocr on default cpu remote#2085
jperez999 merged 7 commits into
NVIDIA:mainfrom
jperez999:audio-video-up

Conversation

@jperez999
Copy link
Copy Markdown
Collaborator

Description

Summary

End-to-end fix for retriever ingest <video.mp4>, which previously logged success while silently failing both
video frame extraction and audio transcription. Along the way, restructured the ASR actor stack to match the
CPU/GPU split used elsewhere in the codebase, defaulted the remote Parakeet path to NVCF, and made the CLI's
"Ingested …" message report actual landed rows instead of input file count.

What broke and what changed

  1. ASR actor split: remote-only CPU variant, local-only GPU variant

Before: ASRCPUActor was a conflated actor that handled both remote (Riva gRPC) and local (HuggingFace Parakeet)
ASR via an if/else on _use_remote(params). ASRGPUActor inherited from it and merely added the GPU operator
mixin. Construction of either class with default ASRParams() loaded torch unconditionally — which is what made
retriever ingest some.mp3 drag torch + transformers + nemotron-* into the import graph on a CPU-only ingest
path.

After:

  • ASRCPUActor (audio/cpu_actor.py, new) — remote-only Parakeet/Riva gRPC client. Carries DEFAULT_GRPC_ENDPOINT
    = "grpc.nvcf.nvidia.com:443" and DEFAULT_FUNCTION_ID = "bb0837de-8c7b-481f-9ec8-ef5663e9c1fa" class constants,
    plus _apply_actor_defaults that pre-fills audio_endpoints, function_id, and auth_token (from NVIDIA_API_KEY
    env) when the caller leaves them empty. Honors AUDIO_FUNCTION_ID env override. Never imports torch.
  • ASRGPUActor (audio/gpu_actor.py, new) — local-only nvidia/parakeet-ctc-1.1b via HuggingFace transformers.
    Eagerly loads weights at construction.
  • _ASRActorBase mixin in audio/asr_actor.py — carries shared _build_output_rows / preprocess / postprocess so
    the two variants don't duplicate row-shaping code.
  • ASRActor archetype — adds prefers_cpu_variant(kwargs) returning True iff params.audio_endpoints is populated,
    lazy-imports both variants via cpu_variant_class() / gpu_variant_class(). Also reads AUDIO_GRPC_ENDPOINT from
    env at init time and pre-fills audio_endpoints so the env var alone forces the remote (CPU) variant on any
    host.
  • Lazy getattr on audio/asr_actor.py re-exports ASRCPUActor / ASRGPUActor so existing from
    nemo_retriever.audio.asr_actor import ASRCPUActor callers keep working without circular-import gymnastics.
  • Dropped the no-longer-needed AudioChunkParams.audio_only field, its CLI flag (audio/stage.py), and the
    matching plumbing through chunk_actor.py / media_interface.py. The video chunker now always pre-extracts MP3
    for .mp4/.mov/.avi/.mkv inputs (see Add missing build script, fix docker-compose to use self hosted deplot #4).
  1. Riva client: streaming RPC, correct sample rate, header stripping

ParakeetClient.transcribe() was calling offline_recognize with a RecognitionConfig missing both encoding and
sample_rate_hertz, and convert_to_mono_wav was resampling to 44100 Hz (Parakeet trains at 16 kHz). The NVCF
Parakeet deployment at 1598d209-... and bb0837de-... is streaming-only (verified by probing
GetRivaSpeechRecognitionConfig — both return type=online, streaming=True, offline=False), so offline_recognize
returned INVALID_ARGUMENT: Unavailable model requested no matter what params we sent.

  • RecognitionConfig now sets encoding=AudioEncoding.LINEAR_PCM and sample_rate_hertz=PARAKEET_SAMPLE_RATE_HZ
    (16 kHz).
  • convert_to_mono_wav resamples to 16 kHz (matching Parakeet's training rate).
  • transcribe() now calls streaming_response_generator instead of offline_recognize. New _streaming_transcribe
    helper:
    • Strips the WAV header via the stdlib wave module (the gRPC stream expects raw PCM matching the declared
      encoding; WAV header bytes would be parsed as samples).
    • Chunks the PCM payload into 32 KB pieces (~1 s of audio each) and feeds the generator.
    • Filters is_final=True results and wraps them in _StreamingResponseShim — a tiny adapter whose .results
      field matches the offline-RPC shape so process_transcription_response keeps working unchanged.
  1. ffmpeg frame extraction: JPEG instead of PNG

The container ships a slim custom-built ffmpeg (/usr/local/bin/ffmpeg, 8.0.1, configured without the PNG
encoder). Frame extraction failed with Default encoder for format image2 (codec png) is probably disabled,
returning 0 frames to VideoFrameOCRActor.

  • audio/media_interface.py:extract_frames now writes frame%06d.jpg and globs *.jpg. The mjpeg encoder is
    built into every ffmpeg variant, so this works regardless of build flags. JPEG @ q:v=2 is near-lossless for OCR
    purposes and slashes the per-frame base64 payload by ~10×, which compounds over the OCR HTTP round-trips.
  • Downstream consumers (Nemotron OCR over base64, perceptual-hash dedup via PIL) are format-agnostic — no other
    changes needed.
  1. Audio chunker: always pre-extract MP3 from video

For video inputs the chunker used to preserve the source container, so .mp4 → chunk_NNNN.mp4. The Parakeet
client decodes audio via soundfile/libsndfile, which can't read MP4 containers — every chunk failed with Format
not recognised before reaching the NIM.

  • media_interface.py:split now unconditionally extracts MP3 audio from any .mp4/.mov/.avi/.mkv input before
    chunking. Chunks are emitted as .mp3, which soundfile reads natively.
  • AudioChunkParams.audio_only and the --audio-only CLI flag are removed (the field was the only knob
    controlling this behavior, and there was no reachable consumer that wanted the old behavior).
  1. Param-model defaults: NVCF endpoint baked into the actor
  • params/models.py:ASRParams docstring updated to reflect the new selection contract (the archetype, not a
    field, chooses local vs remote). No force_local field introduced — kept the model minimal.
  • audio/asr_actor.py:DEFAULT_NGC_ASR_FUNCTION_ID updated to "bb0837de-8c7b-481f-9ec8-ef5663e9c1fa". Single
    source of truth — ASRCPUActor.DEFAULT_FUNCTION_ID and asr_params_from_env's default both reference this
    constant.
  • asr_params_from_env and ASRCPUActor both source the bearer token from NVIDIA_API_KEY (was NGC_API_KEY);
    auth_token_var parameter default updated accordingly. Stage CLI help text + docstring updated.
  • audio/stage.py log line no longer claims "local (Parakeet)" when the archetype will actually fill in NVCF —
    it now reports archetype-resolved (GPU local / NVCF default) when no explicit endpoint is passed.
  1. CLI ingest message reports actual landed rows

Ingested N document(s) was len(summary["documents"]) — the input file count, never touching LanceDB. Misleading
for any document that explodes into multiple chunks (PDFs, videos), and completely hid the case where every
stage failed and the table got zero rows.

  • adapters/cli/sdk_workflow.py:ingest_documents runs new _count_lancedb_rows(uri, table) after
    vdb_upload(...).ingest() and threads n_rows through the summary. Best-effort: any exception is swallowed and
    returns None, so a flaky table read can't break ingestion.
  • adapters/cli/main.py now prints Ingested {n_files} file(s) → {n_rows} row(s) in LanceDB {uri}/{table}. Falls
    back to ... (row count unavailable). when the count failed.
  1. Base dependencies
  • nvidia-riva-client added to base dependencies in nemo_retriever/pyproject.toml. Required for the Riva gRPC
    stubs (riva.client.ASRService, RecognitionConfig, etc.) that the remote ASR path now depends on by default.
    nvidia-riva-client v2.25.1 resolves cleanly (uv lock: 355 packages).

Behavior contract after the change

image

Test impact

  • Updated tests/test_audio_pipeline_batch.py:test_inprocess_audio_pipeline_local_asr_mocked — now mocks
    gather_local_resources to force the GPU variant so the local-Parakeet assertions are deterministic regardless
    of CI host.
  • Updated tests/test_asr_actor.py:test_local_asr_does_not_call_get_client — local-Parakeet probe now constructs
    ASRGPUActor directly (the new local-only class) and drops the _client attribute assertion (no longer
    meaningful for the GPU variant).
  • Updated three tests in tests/test_root_cli_workflow.py — assertions now expect the new "N file(s) → M row(s)"
    wording and stub sdk_workflow._count_lancedb_rows to keep them off-disk.
  • All ASR / audio / CLI tests pass (77 passed in the audio batch, 71 passed in the CLI/workflow batch, 1419
    passed in the broader sweep).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@jperez999 jperez999 requested review from a team as code owners May 21, 2026 16:16
@jperez999 jperez999 requested a review from nkmcalli May 21, 2026 16:16
Copy link
Copy Markdown
Collaborator

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container ships a slim custom-built ffmpeg (/usr/local/bin/ffmpeg, 8.0.1, configured without the PNG encoder).

Think this is outdated - with #2052, we should now be pulling ffmpeg 7:4.4.2-0ubuntu0.22.04.1, which I confirmed packages PNG encoder

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 21, 2026

Greptile Summary

This PR fixes end-to-end video ingest by correcting the Riva gRPC streaming path (encoding config + sample rate + WAV header stripping), switching ffmpeg frame extraction from PNG to JPEG, and unconditionally pre-extracting MP3 audio from video containers before chunking. It also restructures the ASR actor stack into discrete CPU (remote/NVCF) and GPU (local Parakeet) variants, defaulting CPU-only hosts to the NVCF Parakeet endpoint, and updates the CLI ingest output to report actual LanceDB row counts.

  • ASR split: ASRCPUActor (remote Riva gRPC, no torch) and ASRGPUActor (local nvidia/parakeet-ctc-1.1b) replace the old single conflated class; ASRActor archetype lazy-imports the correct variant based on GPU availability and audio_endpoints.
  • Riva streaming fix: transcribe() now uses streaming_response_generator with correct LINEAR_PCM / 16 kHz config and strips the WAV header via wave before sending raw PCM chunks.
  • CLI honesty: _count_lancedb_rows() is called post-ingest and the final message now prints N file(s) → M row(s) instead of a misleading input-count-only line.

Confidence Score: 4/5

The core Riva streaming fix, JPEG frame extraction, and unconditional audio pre-extraction are all sound; the ASR actor split is well-structured. One defect in media_interface.split() — the video_audio_separate guard silently never fires for video inputs after the unconditional MP3 pre-extraction reassigns input_path — remains unresolved.

The streaming transcription path, sample-rate alignment, and CPU/GPU actor split are correctly implemented and well-tested. The open video_audio_separate silent no-op means any pipeline with video_audio_separate=True will silently produce no side-car audio chunks on a changed code path introduced by this PR.

nemo_retriever/src/nemo_retriever/audio/media_interface.py — the video_audio_separate guard at line 353 needs to key off the original input suffix (path_input.suffix) rather than the post-conversion suffix.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/audio/cpu_actor.py New remote-only ASRCPUActor; NVCF defaults filled by _apply_actor_defaults. Early-return when caller supplies explicit endpoints skips auth_token env-var back-fill.
nemo_retriever/src/nemo_retriever/audio/gpu_actor.py New local-only ASRGPUActor; eagerly loads ParakeetCTC1B1ASR at construction. Missing public unload()/cleanup() method for GPU memory release.
nemo_retriever/src/nemo_retriever/audio/asr_actor.py Refactored to archetype + _ASRActorBase shared mixin; lazy re-exports via PEP 562 getattr. Logic for env-var-driven endpoint pre-fill and prefers_cpu_variant is sound.
nemo_retriever/src/nemo_retriever/api/internal/primitives/nim/model_interface/parakeet.py Streaming transcription path added with correct LINEAR_PCM encoding, 16 kHz sample rate, WAV header stripping, and 32 KB PCM chunking. _StreamingResponseShim correctly adapts streaming results to the offline response shape.
nemo_retriever/src/nemo_retriever/audio/media_interface.py Unconditional video to MP3 pre-extraction fixes the libsndfile compatibility gap; ffmpeg frame extraction switched to JPEG. The video_audio_separate guard at line 353 uses the reassigned MP3 input_path, so its suffix check always fails for video inputs.
nemo_retriever/src/nemo_retriever/adapters/cli/sdk_workflow.py _count_lancedb_rows best-effort helper correctly swallows exceptions and reports None; n_rows threads through the summary dict cleanly.
nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py Added audio_only flag gating frames_enabled; consistent with video_asr_audio_chunk_params no longer forcing audio_only=True.
nemo_retriever/src/nemo_retriever/params/models.py audio_only field retained in AudioChunkParams; docstring updated for ASRParams; audio_infer_protocol default remains grpc. No breaking changes.
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nemo_retriever/src/nemo_retriever/audio/gpu_actor.py:45-50
The new `ASRGPUActor` loads `ParakeetCTC1B1ASR` at construction time but provides no public method to release the GPU allocation afterwards. Per the `gpu-memory-lifecycle` rule, every model-holding class must expose an explicit cleanup path (`del` the tensor reference + `torch.cuda.empty_cache()`); without it, long-running processes or test suites that instantiate multiple actors will accumulate VRAM with no way to reclaim it.

```suggestion
    def __init__(self, params: ASRParams | None = None) -> None:
        super().__init__(params=params)
        self._params = params or ASRParams()
        from nemo_retriever.model.local import ParakeetCTC1B1ASR

        self._model = ParakeetCTC1B1ASR()

    def unload(self) -> None:
        """Release the local Parakeet model and free GPU memory."""
        self._model = None
        try:
            import torch

            torch.cuda.empty_cache()
        except Exception:
            pass
```

### Issue 2 of 2
nemo_retriever/src/nemo_retriever/audio/cpu_actor.py:766-780
**Auth token silently absent for caller-supplied custom endpoints**

`_apply_actor_defaults` returns early before the env-var token check when `grpc_ep` or `http_ep` is truthy. A caller who passes an explicit endpoint without setting `auth_token` on the params object gets a client with no bearer token, even when the API key env var is present in the shell. The NVCF endpoint path (empty endpoints → early-fill) is correct; it is only the user-supplied-endpoint path that skips the credential lookup.

Reviews (5): Last reviewed commit: "fix unit tests" | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/params/models.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/params/models.py
Comment thread nemo_retriever/pyproject.toml Outdated
@jperez999
Copy link
Copy Markdown
Collaborator Author

  1. Centralize gather_local_resources dispatch (4 src + 4 test files)

Every call site that used to from … import gather_local_resources now imports the module (from nemo_retriever.utils import
ray_resource_hueristics as _rrh) and calls _rrh.gather_local_resources(). Affects graph/executor.py, graph/multi_type_extract_operator.py,
graph/operator_archetype.py, graph/operator_resolution.py.

Why this matters: previously each module bound its own local reference to the function, so tests had to
monkeypatch.setattr("nemo_retriever.graph.X.gather_local_resources", …) separately for every dispatch site. Now there's one canonical patch
point — nemo_retriever.utils.ray_resource_hueristics.gather_local_resources — used uniformly by test_asr_actor.py,
test_audio_pipeline_batch.py, test_ocr_version_selection.py, and test_pipeline_graph.py (which dropped the duplicate patches it used to need
on operator_resolution and operator_archetype).

  1. Code-review fixes
  • params/models.py:177 — ASRParams.audio_infer_protocol default flipped from "http" → "grpc". Callers who pass an explicit gRPC endpoint
    without setting the protocol no longer get an HTTP handshake to a gRPC server; _apply_actor_defaults's early-return preserved the bad default,
    and "http" was truthy so the or "grpc" fallback in _get_client couldn't recover.
  • pyproject.toml:71 — nvidia-riva-client got a lower bound of >=2.25.1, matching the version the PR description names.
  1. Test-hermeticity fix
  • tests/test_service_pipeline_spec.py — added an autouse monkeypatch.delenv fixture for NVIDIA_API_KEY / NGC_API_KEY. _ParamsModel
    auto-resolves unset *api_key fields from those env vars, which then trips ServiceIngestor._strip_server_owned() and turns three unit tests red
    whenever the shell has a real key set.

jperez999 added 2 commits May 21, 2026 18:07
Signed-off-by: Julio Perez <jperez@nvidia.com>
@jperez999 jperez999 requested a review from charlesbluca May 22, 2026 13:18
@jperez999 jperez999 merged commit 7130a72 into NVIDIA:main May 22, 2026
9 checks passed
charlesbluca added a commit that referenced this pull request May 22, 2026
Signed-off-by: Julio Perez <jperez@nvidia.com>
Co-authored-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
(cherry picked from commit 7130a72)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants