fix asr and ocr on default cpu remote#2085
Conversation
charlesbluca
left a comment
There was a problem hiding this comment.
The container ships a slim custom-built ffmpeg (/usr/local/bin/ffmpeg, 8.0.1, configured without the PNG encoder).
Think this is outdated - with #2052, we should now be pulling ffmpeg 7:4.4.2-0ubuntu0.22.04.1, which I confirmed packages PNG encoder
Greptile SummaryThis PR fixes end-to-end video ingest by correcting the Riva gRPC streaming path (encoding config + sample rate + WAV header stripping), switching ffmpeg frame extraction from PNG to JPEG, and unconditionally pre-extracting MP3 audio from video containers before chunking. It also restructures the ASR actor stack into discrete CPU (remote/NVCF) and GPU (local Parakeet) variants, defaulting CPU-only hosts to the NVCF Parakeet endpoint, and updates the CLI ingest output to report actual LanceDB row counts.
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/audio/cpu_actor.py | New remote-only ASRCPUActor; NVCF defaults filled by _apply_actor_defaults. Early-return when caller supplies explicit endpoints skips auth_token env-var back-fill. |
| nemo_retriever/src/nemo_retriever/audio/gpu_actor.py | New local-only ASRGPUActor; eagerly loads ParakeetCTC1B1ASR at construction. Missing public unload()/cleanup() method for GPU memory release. |
| nemo_retriever/src/nemo_retriever/audio/asr_actor.py | Refactored to archetype + _ASRActorBase shared mixin; lazy re-exports via PEP 562 getattr. Logic for env-var-driven endpoint pre-fill and prefers_cpu_variant is sound. |
| nemo_retriever/src/nemo_retriever/api/internal/primitives/nim/model_interface/parakeet.py | Streaming transcription path added with correct LINEAR_PCM encoding, 16 kHz sample rate, WAV header stripping, and 32 KB PCM chunking. _StreamingResponseShim correctly adapts streaming results to the offline response shape. |
| nemo_retriever/src/nemo_retriever/audio/media_interface.py | Unconditional video to MP3 pre-extraction fixes the libsndfile compatibility gap; ffmpeg frame extraction switched to JPEG. The video_audio_separate guard at line 353 uses the reassigned MP3 input_path, so its suffix check always fails for video inputs. |
| nemo_retriever/src/nemo_retriever/adapters/cli/sdk_workflow.py | _count_lancedb_rows best-effort helper correctly swallows exceptions and reports None; n_rows threads through the summary dict cleanly. |
| nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py | Added audio_only flag gating frames_enabled; consistent with video_asr_audio_chunk_params no longer forcing audio_only=True. |
| nemo_retriever/src/nemo_retriever/params/models.py | audio_only field retained in AudioChunkParams; docstring updated for ASRParams; audio_infer_protocol default remains grpc. No breaking changes. |
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
nemo_retriever/src/nemo_retriever/audio/gpu_actor.py:45-50
The new `ASRGPUActor` loads `ParakeetCTC1B1ASR` at construction time but provides no public method to release the GPU allocation afterwards. Per the `gpu-memory-lifecycle` rule, every model-holding class must expose an explicit cleanup path (`del` the tensor reference + `torch.cuda.empty_cache()`); without it, long-running processes or test suites that instantiate multiple actors will accumulate VRAM with no way to reclaim it.
```suggestion
def __init__(self, params: ASRParams | None = None) -> None:
super().__init__(params=params)
self._params = params or ASRParams()
from nemo_retriever.model.local import ParakeetCTC1B1ASR
self._model = ParakeetCTC1B1ASR()
def unload(self) -> None:
"""Release the local Parakeet model and free GPU memory."""
self._model = None
try:
import torch
torch.cuda.empty_cache()
except Exception:
pass
```
### Issue 2 of 2
nemo_retriever/src/nemo_retriever/audio/cpu_actor.py:766-780
**Auth token silently absent for caller-supplied custom endpoints**
`_apply_actor_defaults` returns early before the env-var token check when `grpc_ep` or `http_ep` is truthy. A caller who passes an explicit endpoint without setting `auth_token` on the params object gets a client with no bearer token, even when the API key env var is present in the shell. The NVCF endpoint path (empty endpoints → early-fill) is correct; it is only the user-supplied-endpoint path that skips the credential lookup.
Reviews (5): Last reviewed commit: "fix unit tests" | Re-trigger Greptile
Every call site that used to from … import gather_local_resources now imports the module (from nemo_retriever.utils import Why this matters: previously each module bound its own local reference to the function, so tests had to
|
Signed-off-by: Julio Perez <jperez@nvidia.com>
Signed-off-by: Julio Perez <jperez@nvidia.com> Co-authored-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com> (cherry picked from commit 7130a72)
Description
Summary
End-to-end fix for retriever ingest <video.mp4>, which previously logged success while silently failing both
video frame extraction and audio transcription. Along the way, restructured the ASR actor stack to match the
CPU/GPU split used elsewhere in the codebase, defaulted the remote Parakeet path to NVCF, and made the CLI's
"Ingested …" message report actual landed rows instead of input file count.
What broke and what changed
Before: ASRCPUActor was a conflated actor that handled both remote (Riva gRPC) and local (HuggingFace Parakeet)
ASR via an if/else on _use_remote(params). ASRGPUActor inherited from it and merely added the GPU operator
mixin. Construction of either class with default ASRParams() loaded torch unconditionally — which is what made
retriever ingest some.mp3 drag torch + transformers + nemotron-* into the import graph on a CPU-only ingest
path.
After:
= "grpc.nvcf.nvidia.com:443" and DEFAULT_FUNCTION_ID = "bb0837de-8c7b-481f-9ec8-ef5663e9c1fa" class constants,
plus _apply_actor_defaults that pre-fills audio_endpoints, function_id, and auth_token (from NVIDIA_API_KEY
env) when the caller leaves them empty. Honors AUDIO_FUNCTION_ID env override. Never imports torch.
Eagerly loads weights at construction.
the two variants don't duplicate row-shaping code.
lazy-imports both variants via cpu_variant_class() / gpu_variant_class(). Also reads AUDIO_GRPC_ENDPOINT from
env at init time and pre-fills audio_endpoints so the env var alone forces the remote (CPU) variant on any
host.
nemo_retriever.audio.asr_actor import ASRCPUActor callers keep working without circular-import gymnastics.
matching plumbing through chunk_actor.py / media_interface.py. The video chunker now always pre-extracts MP3
for .mp4/.mov/.avi/.mkv inputs (see Add missing build script, fix docker-compose to use self hosted deplot #4).
ParakeetClient.transcribe() was calling offline_recognize with a RecognitionConfig missing both encoding and
sample_rate_hertz, and convert_to_mono_wav was resampling to 44100 Hz (Parakeet trains at 16 kHz). The NVCF
Parakeet deployment at 1598d209-... and bb0837de-... is streaming-only (verified by probing
GetRivaSpeechRecognitionConfig — both return type=online, streaming=True, offline=False), so offline_recognize
returned INVALID_ARGUMENT: Unavailable model requested no matter what params we sent.
(16 kHz).
helper:
encoding; WAV header bytes would be parsed as samples).
field matches the offline-RPC shape so process_transcription_response keeps working unchanged.
The container ships a slim custom-built ffmpeg (/usr/local/bin/ffmpeg, 8.0.1, configured without the PNG
encoder). Frame extraction failed with Default encoder for format image2 (codec png) is probably disabled,
returning 0 frames to VideoFrameOCRActor.
built into every ffmpeg variant, so this works regardless of build flags. JPEG @ q:v=2 is near-lossless for OCR
purposes and slashes the per-frame base64 payload by ~10×, which compounds over the OCR HTTP round-trips.
changes needed.
For video inputs the chunker used to preserve the source container, so .mp4 → chunk_NNNN.mp4. The Parakeet
client decodes audio via soundfile/libsndfile, which can't read MP4 containers — every chunk failed with Format
not recognised before reaching the NIM.
chunking. Chunks are emitted as .mp3, which soundfile reads natively.
controlling this behavior, and there was no reachable consumer that wanted the old behavior).
field, chooses local vs remote). No force_local field introduced — kept the model minimal.
source of truth — ASRCPUActor.DEFAULT_FUNCTION_ID and asr_params_from_env's default both reference this
constant.
auth_token_var parameter default updated accordingly. Stage CLI help text + docstring updated.
it now reports archetype-resolved (GPU local / NVCF default) when no explicit endpoint is passed.
Ingested N document(s) was len(summary["documents"]) — the input file count, never touching LanceDB. Misleading
for any document that explodes into multiple chunks (PDFs, videos), and completely hid the case where every
stage failed and the table got zero rows.
vdb_upload(...).ingest() and threads n_rows through the summary. Best-effort: any exception is swallowed and
returns None, so a flaky table read can't break ingestion.
back to ... (row count unavailable). when the count failed.
stubs (riva.client.ASRService, RecognitionConfig, etc.) that the remote ASR path now depends on by default.
nvidia-riva-client v2.25.1 resolves cleanly (uv lock: 355 packages).
Behavior contract after the change
Test impact
gather_local_resources to force the GPU variant so the local-Parakeet assertions are deterministic regardless
of CI host.
ASRGPUActor directly (the new local-only class) and drops the _client attribute assertion (no longer
meaningful for the GPU variant).
wording and stub sdk_workflow._count_lancedb_rows to keep them off-disk.
passed in the broader sweep).
Checklist