Skip to content

Add SLURM Ray smoke kit#2010

Draft
charlesbluca wants to merge 18 commits into
NVIDIA:mainfrom
charlesbluca:codex/nemo-intracluster-ray-slurm
Draft

Add SLURM Ray smoke kit#2010
charlesbluca wants to merge 18 commits into
NVIDIA:mainfrom
charlesbluca:codex/nemo-intracluster-ray-slurm

Conversation

@charlesbluca
Copy link
Copy Markdown
Collaborator

@charlesbluca charlesbluca commented May 11, 2026

Description

Adds a SLURM Ray smoke kit for validating NeMo-Retriever against an externally started intra-cluster Ray cluster.

This includes:

  • a profile-driven tools/slurm-ray-smoke/submit.sh launcher that stages the current source tree, renders job.sbatch, and submits it over SSH
  • smoke scripts for both direct RayDataExecutor and GraphIngestor(run_mode="batch", ray_address=...)
  • temporary text-ingestion fixes for local tokenizer paths and Ray/Pandas byte payload normalization
  • README/PATCH/triage documentation for new users running the experiment on their own SLURM clusters

Validation:

  • direct RayDataExecutor smoke succeeded with three rows and split/chunk work placed on different nodes
  • GraphIngestor smoke succeeded with three rows against the same external Ray address
  • Ray reported both nemo_head and nemo_worker custom resources

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

No docker-compose or Helm values changes are included.

@charlesbluca charlesbluca changed the title [codex] Add SLURM Ray smoke kit Add SLURM Ray smoke kit May 11, 2026
@charlesbluca charlesbluca marked this pull request as ready for review May 11, 2026 12:56
@charlesbluca charlesbluca requested review from a team as code owners May 11, 2026 12:56
@charlesbluca charlesbluca requested a review from jioffe502 May 11, 2026 12:56
@charlesbluca charlesbluca marked this pull request as draft May 11, 2026 12:56
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR adds a tools/slurm-ray-smoke/ kit for validating NeMo-Retriever against an externally started intra-cluster Ray cluster, along with two production fixes in nemo_retriever/ that were discovered during the SLURM experiment.

  • nemo_retriever/txt/split.py: _get_tokenizer now skips the HF revision lookup when tokenizer_model_id points to a local path, with a matching unit test.
  • nemo_retriever/txt/ray_data.py: Adds _coerce_binary_payload to normalise memoryview, bytearray, PyArrow scalar, and .tobytes() payloads before text splitting; the integration in TxtSplitCPUActor.process wraps row processing in a bare except Exception: continue that silently drops failures without logging.
  • tools/slurm-ray-smoke/: Profile-driven submit.sh launcher, job.sbatch.sh template, two smoke scripts, and triage docs for running the two-node Ray experiment on arbitrary SLURM clusters.

Confidence Score: 3/5

The two production changes to ray_data.py and split.py need fixes before merging: silent exception swallowing will cause invisible row loss in production, and a test assertion mismatch means the new coercion tests will not pass as written.

The except Exception: continue in TxtSplitCPUActor.process silently discards every row-level failure with no log output — documents disappear from the pipeline without any signal. The new parametrised tests assert params=actor._params (None on the archetype wrapper) but the delegate normalises that to TextChunkParams(), so assert_called_once_with will fail at runtime. Both issues are in the active production path under nemo_retriever/. The SLURM tooling is operational/internal and carries lower risk, though the curl | sh installer step in job.sbatch.sh is worth addressing before wider distribution.

nemo_retriever/src/nemo_retriever/txt/ray_data.py (silent exception) and nemo_retriever/tests/test_actor_operators.py (wrong params assertion) need the most attention before merge.

Security Review

  • tools/slurm-ray-smoke/templates/job.sbatch.sh line 80: curl -LsSf https://astral.sh/uv/install.sh | ... sh downloads and executes a remote installer script on every allocated cluster node with no checksum, signature, or version pin. A compromised CDN response would execute arbitrary code on all SLURM compute nodes.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/txt/ray_data.py Adds _coerce_binary_payload helper and integrates it into TxtSplitCPUActor.process; the bare except Exception: continue silently drops rows without logging, and the helper has unbounded recursion for nested as_py() wrappers.
nemo_retriever/src/nemo_retriever/txt/split.py Adds local-path detection to _get_tokenizer to skip HF revision lookup when a local directory is supplied; clean and well-contained change.
nemo_retriever/tests/test_actor_operators.py Adds parametrised tests for binary payload coercion; assert_called_once_with(..., params=actor._params) compares against None from the archetype wrapper, while the delegate normalises None to TextChunkParams(), so the assertion will always fail.
nemo_retriever/tests/test_txt_split.py Adds test_get_tokenizer_local_path_skips_revision_lookup to verify the local-path branch; straightforward and correct.
tools/slurm-ray-smoke/smokes/_common.py New shared helpers for JSON output, Ray state introspection, and cluster readiness assertions; missing SPDX license header.
tools/slurm-ray-smoke/smokes/nemo_executor_smoke.py Direct RayDataExecutor smoke that pins stages to custom resources and asserts cross-node execution; missing SPDX license header.
tools/slurm-ray-smoke/smokes/nemo_graph_ingestor_smoke.py GraphIngestor batch-mode smoke; missing SPDX license header.
tools/slurm-ray-smoke/submit.sh Profile-driven launcher that validates inputs, stages tarballs, renders job.sbatch, and submits over SSH; input validation is thorough (allowlist, regex, quote checks).
tools/slurm-ray-smoke/templates/job.sbatch.sh SLURM sbatch template that bootstraps Ray across nodes; installs uv via `curl

Comments Outside Diff (1)

  1. nemo_retriever/src/nemo_retriever/txt/ray_data.py, line 121-131 (link)

    P1 Bare except silently drops rows

    Every failure inside the try block (coercion error, tokenizer crash, or any txt_bytes_to_chunks_df exception) is swallowed with continue and produces no log output. In production this means documents disappear from the pipeline without any trace, violating the no-bare-except rule which requires at least logging with full context at an exception boundary.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: nemo_retriever/src/nemo_retriever/txt/ray_data.py
    Line: 121-131
    
    Comment:
    **Bare except silently drops rows**
    
    Every failure inside the `try` block (coercion error, tokenizer crash, or any `txt_bytes_to_chunks_df` exception) is swallowed with `continue` and produces no log output. In production this means documents disappear from the pipeline without any trace, violating the `no-bare-except` rule which requires at least logging with full context at an exception boundary.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 5 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 5
nemo_retriever/src/nemo_retriever/txt/ray_data.py:121-131
**Bare except silently drops rows**

Every failure inside the `try` block (coercion error, tokenizer crash, or any `txt_bytes_to_chunks_df` exception) is swallowed with `continue` and produces no log output. In production this means documents disappear from the pipeline without any trace, violating the `no-bare-except` rule which requires at least logging with full context at an exception boundary.

### Issue 2 of 5
nemo_retriever/tests/test_actor_operators.py:634-641
**`params=actor._params` asserts `None`, not the delegate's default**

`actor = TxtSplitActor()` sets `actor._params = None`. `ArchetypeOperator._resolve_delegate` constructs `TxtSplitCPUActor(params=None)`, whose `__init__` normalises `None` via `params or TextChunkParams()`, so `delegate._params = TextChunkParams()`. The actual call to `txt_bytes_to_chunks_df` passes `params=TextChunkParams()`, but `assert_called_once_with(..., params=actor._params)` asserts `params=None`. The assertion will always fail unless `TextChunkParams() == None`, which it does not. Use `params=TextChunkParams()` (or the equivalent default instance) as the expected value.

### Issue 3 of 5
tools/slurm-ray-smoke/smokes/_common.py:1-3
**Missing SPDX license header**

All three new Python files in `smokes/` (`_common.py`, `nemo_executor_smoke.py`, `nemo_graph_ingestor_smoke.py`) are missing the required SPDX header block. Every `.py` file added to this repository must begin with:

```
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0
```

### Issue 4 of 5
tools/slurm-ray-smoke/templates/job.sbatch.sh:80
**Remote script piped to shell without integrity check**

`curl -LsSf https://astral.sh/uv/install.sh | ... sh` downloads and executes arbitrary code from the network at job runtime with no checksum or signature verification. If the CDN is compromised or the URL redirected, every cluster node running this job would execute malicious code. Consider pinning the installer to a specific version digest, vendoring the `uv` binary into the repository, or downloading to a file first so the contents can be inspected before execution.

### Issue 5 of 5
nemo_retriever/src/nemo_retriever/txt/ray_data.py:24-42
**Potential infinite recursion via `as_py()` chain**

`_coerce_binary_payload` calls itself recursively when `as_py()` is present. If `as_py()` returns an object that also exposes `as_py()` (e.g. a nested Arrow scalar wrapper), this recurses without bound. Adding a depth guard or converting only one level would prevent an unexpected `RecursionError` crashing a Ray worker.

Reviews (1): Last reviewed commit: "Merge remote-tracking branch 'upstream/m..." | Re-trigger Greptile

Comment on lines +634 to +641
def test_process_coerces_binary_payload_variants(self, mock_fn, raw, expected_bytes):
expected = pd.DataFrame({"text": ["chunk"], "path": ["/a.txt"], "page_number": [0], "metadata": [{}]})
mock_fn.return_value = expected
actor = self._make()
df = pd.DataFrame({"bytes": [raw], "path": ["/a.txt"]})
result = actor.process(df)
mock_fn.assert_called_once_with(expected_bytes, "/a.txt", params=actor._params)
pd.testing.assert_frame_equal(result, expected)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 params=actor._params asserts None, not the delegate's default

actor = TxtSplitActor() sets actor._params = None. ArchetypeOperator._resolve_delegate constructs TxtSplitCPUActor(params=None), whose __init__ normalises None via params or TextChunkParams(), so delegate._params = TextChunkParams(). The actual call to txt_bytes_to_chunks_df passes params=TextChunkParams(), but assert_called_once_with(..., params=actor._params) asserts params=None. The assertion will always fail unless TextChunkParams() == None, which it does not. Use params=TextChunkParams() (or the equivalent default instance) as the expected value.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/tests/test_actor_operators.py
Line: 634-641

Comment:
**`params=actor._params` asserts `None`, not the delegate's default**

`actor = TxtSplitActor()` sets `actor._params = None`. `ArchetypeOperator._resolve_delegate` constructs `TxtSplitCPUActor(params=None)`, whose `__init__` normalises `None` via `params or TextChunkParams()`, so `delegate._params = TextChunkParams()`. The actual call to `txt_bytes_to_chunks_df` passes `params=TextChunkParams()`, but `assert_called_once_with(..., params=actor._params)` asserts `params=None`. The assertion will always fail unless `TextChunkParams() == None`, which it does not. Use `params=TextChunkParams()` (or the equivalent default instance) as the expected value.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +1 to +3
from __future__ import annotations

import json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing SPDX license header

All three new Python files in smokes/ (_common.py, nemo_executor_smoke.py, nemo_graph_ingestor_smoke.py) are missing the required SPDX header block. Every .py file added to this repository must begin with:

# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0

Rule Used: Python files added in this PR must include the SPD... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: tools/slurm-ray-smoke/smokes/_common.py
Line: 1-3

Comment:
**Missing SPDX license header**

All three new Python files in `smokes/` (`_common.py`, `nemo_executor_smoke.py`, `nemo_graph_ingestor_smoke.py`) are missing the required SPDX header block. Every `.py` file added to this repository must begin with:

```
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0
```

**Rule Used:** Python files added in this PR must include the SPD... ([source](https://app.greptile.com/review/custom-context?memory=spdx-license-header))

How can I resolve this? If you propose a fix, please make it concise.

tar -xzf "$SOURCE_TARBALL" -C "$SOURCE_ROOT"

if [[ ! -x "$UV_BIN" ]]; then
curl -LsSf https://astral.sh/uv/install.sh | /usr/bin/env UV_INSTALL_DIR="$UV_INSTALL_DIR" sh
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 security Remote script piped to shell without integrity check

curl -LsSf https://astral.sh/uv/install.sh | ... sh downloads and executes arbitrary code from the network at job runtime with no checksum or signature verification. If the CDN is compromised or the URL redirected, every cluster node running this job would execute malicious code. Consider pinning the installer to a specific version digest, vendoring the uv binary into the repository, or downloading to a file first so the contents can be inspected before execution.

Prompt To Fix With AI
This is a comment left during a code review.
Path: tools/slurm-ray-smoke/templates/job.sbatch.sh
Line: 80

Comment:
**Remote script piped to shell without integrity check**

`curl -LsSf https://astral.sh/uv/install.sh | ... sh` downloads and executes arbitrary code from the network at job runtime with no checksum or signature verification. If the CDN is compromised or the URL redirected, every cluster node running this job would execute malicious code. Consider pinning the installer to a specific version digest, vendoring the `uv` binary into the repository, or downloading to a file first so the contents can be inspected before execution.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +24 to +42
def _coerce_binary_payload(value: Any) -> Any:
if value is None:
return None
if isinstance(value, bytes):
return value
if isinstance(value, bytearray):
return bytes(value)
if isinstance(value, memoryview):
return value.tobytes()

as_py = getattr(value, "as_py", None)
if callable(as_py):
return _coerce_binary_payload(as_py())

tobytes = getattr(value, "tobytes", None)
if callable(tobytes):
return tobytes()

return value
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Potential infinite recursion via as_py() chain

_coerce_binary_payload calls itself recursively when as_py() is present. If as_py() returns an object that also exposes as_py() (e.g. a nested Arrow scalar wrapper), this recurses without bound. Adding a depth guard or converting only one level would prevent an unexpected RecursionError crashing a Ray worker.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/txt/ray_data.py
Line: 24-42

Comment:
**Potential infinite recursion via `as_py()` chain**

`_coerce_binary_payload` calls itself recursively when `as_py()` is present. If `as_py()` returns an object that also exposes `as_py()` (e.g. a nested Arrow scalar wrapper), this recurses without bound. Adding a depth guard or converting only one level would prevent an unexpected `RecursionError` crashing a Ray worker.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant