Add OpenTelemetry tracing and metrics#2019
Conversation
Greptile SummaryThis PR adds OpenTelemetry tracing and metrics to the NeMo Retriever pipeline — both library mode (via
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/nim/probe.py | Step 1 span exits without calling _finalize_probe_span when the health check returns a non-2xx status — the most common path triggering step 2. Spans lack probe_status and have UNSET status in this case. |
| nemo_retriever/src/nemo_retriever/observability/configure.py | New file: one-shot OTEL SDK setup with lazy imports, provider wiring, reset_instrument_cache() call post-configure, and atexit-registered shutdown; looks correct. |
| nemo_retriever/src/nemo_retriever/observability/spans.py | New file: operator_span and pipeline_span context managers with clean exception capture, metric recording in finally, and safe_add guards; correct. |
| nemo_retriever/src/nemo_retriever/observability/ray_integration.py | New file: RayOperatorSpanWrapper correctly injects operator into actor, extracts parent context once at construction, and wraps each batch in an operator span. |
| nemo_retriever/src/nemo_retriever/service/processing/pool.py | Adds _JobProgress dataclass, _resolve_batch_span_parents, worker OTEL init with atexit registration, trace_context per-page carrier, and lifecycle metric counters; addressed concerns from prior review rounds. |
| nemo_retriever/src/nemo_retriever/service/app.py | OTEL configured before FastAPI app creation (required for middleware ordering), otel_shutdown stored on app.state and called in lifespan teardown; correct sequence. |
| nemo_retriever/src/nemo_retriever/service/config.py | New OTELServiceConfig Pydantic model with sensible defaults; resource_attributes uses empty-dict sentinel that callers convert to None before passing to OTELConfig. |
| nemo_retriever/src/nemo_retriever/graph/executor.py | Wraps each inprocess operator in operator_span, injects W3C parent carrier for Ray batch operators, and propagates OTEL env vars into runtime_env. |
| nemo_retriever/src/nemo_retriever/graph_ingestor.py | Adds otel param to GraphIngestor.__init__, registers shutdown with atexit, and wraps pipeline execution in pipeline_span; correct handling of library-mode OTEL lifecycle. |
| nemo_retriever/tests/test_observability.py | New test file covering operator span happy/failure paths, service-mode batch context inheritance, and Ray actor parent linkage; session-scoped providers work around OTEL single-install limitation. |
Sequence Diagram
sequenceDiagram
participant Client
participant FastAPI as FastAPI (HTTP span)
participant IngestRouter as ingest.py
participant Pool as ProcessingPool
participant Worker as Worker Subprocess
participant Ray as Ray Actor
Client->>FastAPI: POST /ingest (W3C traceparent)
FastAPI->>IngestRouter: ingest_document()
IngestRouter->>IngestRouter: tag_job / tag_upload
IngestRouter->>Pool: "try_submit(trace_context=inject_current_context())"
Pool-->>Worker: PageDescriptor (trace_context carrier)
Note over Worker: _run_pipeline_batch()
Worker->>Worker: _resolve_batch_span_parents(carrier)
Worker->>Worker: pool.run_pipeline_batch span (parented to HTTP span)
loop each operator
Worker->>Worker: operator_span(stage_name)
end
Worker-->>Pool: BatchWorkerResult
Pool->>Pool: _handle_job_completion + SSE
Note over Ray: RayDataExecutor (batch mode)
Pool->>Ray: ds.map_batches(RayOperatorSpanWrapper)
Ray->>Ray: extract_context(carrier) at __init__
loop each batch
Ray->>Ray: "operator_span(node_name, parent_context=driver_ctx)"
end
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
nemo_retriever/src/nemo_retriever/nim/probe.py:134-138
**Step 1 span exits unfinalized on non-2xx health response**
When the health endpoint returns a non-2xx status (e.g. 404 for cloud NIMs that don't expose `/v1/health/ready`) the `with _probe_span` block exits normally without any `_finalize_probe_span` call. The comment at line 184 explicitly calls this the *common path* that triggers step 2, so every step-2 probe emits a span without `probe_status` and with UNSET span status — breaking any dashboard filter on that attribute.
```suggestion
span.set_attribute("http.status_code", int(resp.status_code))
if resp.ok:
_finalize_probe_span(span, "ok")
_probe_results.append(ProbeResult(url=health_url, name=name, prefix=prefix, status="ok"))
return
_finalize_probe_span(span, "non_2xx", f"http {resp.status_code}")
```
Reviews (4): Last reviewed commit: "add uv.lock" | Re-trigger Greptile
| span.set_attribute("http.status_code", int(resp.status_code)) | ||
| if resp.ok: | ||
| _finalize_probe_span(span, "ok") | ||
| _probe_results.append(ProbeResult(url=health_url, name=name, prefix=prefix, status="ok")) | ||
| return |
There was a problem hiding this comment.
Step 1 span exits unfinalized on non-2xx health response
When the health endpoint returns a non-2xx status (e.g. 404 for cloud NIMs that don't expose /v1/health/ready) the with _probe_span block exits normally without any _finalize_probe_span call. The comment at line 184 explicitly calls this the common path that triggers step 2, so every step-2 probe emits a span without probe_status and with UNSET span status — breaking any dashboard filter on that attribute.
| span.set_attribute("http.status_code", int(resp.status_code)) | |
| if resp.ok: | |
| _finalize_probe_span(span, "ok") | |
| _probe_results.append(ProbeResult(url=health_url, name=name, prefix=prefix, status="ok")) | |
| return | |
| span.set_attribute("http.status_code", int(resp.status_code)) | |
| if resp.ok: | |
| _finalize_probe_span(span, "ok") | |
| _probe_results.append(ProbeResult(url=health_url, name=name, prefix=prefix, status="ok")) | |
| return | |
| _finalize_probe_span(span, "non_2xx", f"http {resp.status_code}") |
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/nim/probe.py
Line: 134-138
Comment:
**Step 1 span exits unfinalized on non-2xx health response**
When the health endpoint returns a non-2xx status (e.g. 404 for cloud NIMs that don't expose `/v1/health/ready`) the `with _probe_span` block exits normally without any `_finalize_probe_span` call. The comment at line 184 explicitly calls this the *common path* that triggers step 2, so every step-2 probe emits a span without `probe_status` and with UNSET span status — breaking any dashboard filter on that attribute.
```suggestion
span.set_attribute("http.status_code", int(resp.status_code))
if resp.ok:
_finalize_probe_span(span, "ok")
_probe_results.append(ProbeResult(url=health_url, name=name, prefix=prefix, status="ok"))
return
_finalize_probe_span(span, "non_2xx", f"http {resp.status_code}")
```
How can I resolve this? If you propose a fix, please make it concise.
Description
Checklist