Skip to content

Update #2612#2879

Draft
tdene wants to merge 22 commits into
NVIDIA-NeMo:youngeunk/topology-aware-placementfrom
tdene:tde/update_2612
Draft

Update #2612#2879
tdene wants to merge 22 commits into
NVIDIA-NeMo:youngeunk/topology-aware-placementfrom
tdene:tde/update_2612

Conversation

@tdene

@tdene tdene commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

ananthsub and others added 22 commits June 4, 2026 10:38
Hand-port of the topology-aware actor placement feature, combining the net
effect of upstream commits 8bd2417 (NVLink-aware training) and 2c2b9f60
(topology-aware inference placement). The topology logic is grafted onto
the existing setup shape directly so it doesn't carry along the surrounding
NeMo Gym reservation block from the source branch.

virtual_cluster.py: add NVLINK_DOMAIN_*/TOPO_RANK_* constants and
DEFAULT_PORT_RANGE_*, get_ray_cluster_topology(), select_segment_nodes(),
_sort_bundle_indices_by_topology(); replace GetGPUIDActor with a
_get_gpu_id_info Ray task that also returns (nvlink_domain, topo_rank); add
port_range_low/high, segment_size, node_resource_constraints params and
_nvlink_domain_per_bundle_index state to RayVirtualCluster; merge
node_resource_constraints into bundle specs; topology-aware
_get_sorted_bundle_indices.

vllm_generation.py: add init_cluster_placement_groups staticmethod for
deterministic PG ordering when other components compete for Ray resources;
add topology arguments to allocate_worker_groups and warn when a
model-parallel group straddles NVLink domains.

grpo.py: read cluster.segment_size, build node_resource_constraints from
get_ray_cluster_topology()/select_segment_nodes() in non-colocated setup,
relocate inference cluster creation, call
VllmGeneration.init_cluster_placement_groups so inference PGs claim
domain-aligned nodes first.

ray.sub: write topology_probe.sh that parses ClusterUUID from nvidia-smi -q
and topo_rank from SLURM_TOPOLOGY_ADDR, source it before each ray start, and
register nvlink_domain_<uuid>/topo_rank as Ray resources. Added
\`export RAY_RESOURCES\` (missing in upstream 8bd2417), required so the
variable propagates to the \`bash /launch-head.sh\` child invocation.

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
…JSON

Signed-off-by: Terry Kong <terryk@nvidia.com>
This reverts commit 2c7af4e.

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
…erplate

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
…rence topology

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
…isting configs

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>
Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 21, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added Documentation Improvements or additions to documentation CI Relating to CI labels Jun 21, 2026
@github-actions

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 48c74fa (PR #2879 from tde/update_2612)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of youngeunk/topology-aware-placement branch (fast-forward)
Gym: ✅ PR branch is ahead of youngeunk/topology-aware-placement branch (fast-forward)
Megatron-Bridge: ✅ PR branch is ahead of youngeunk/topology-aware-placement branch (fast-forward)

All submodule changes look good! ✨

@tdene tdene mentioned this pull request Jun 21, 2026
@terrykong terrykong force-pushed the youngeunk/topology-aware-placement branch 2 times, most recently from 0f2632b to 6874bf5 Compare June 22, 2026 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Relating to CI Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants