Update #2612#2879

Draft

tdene wants to merge 22 commits into

NVIDIA-NeMo:youngeunk/topology-aware-placementfrom

tdene:tde/update_2612

tdene commented Jun 21, 2026

Contributor

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

ananthsub and others added 22 commits

June 4, 2026 10:38


          feat: topology-aware inference placement for non-colocated vLLM

06b957b

Hand-port of the topology-aware actor placement feature, combining the net
effect of upstream commits 8bd2417 (NVLink-aware training) and 2c2b9f60
(topology-aware inference placement). The topology logic is grafted onto
the existing setup shape directly so it doesn't carry along the surrounding
NeMo Gym reservation block from the source branch.

virtual_cluster.py: add NVLINK_DOMAIN_*/TOPO_RANK_* constants and
DEFAULT_PORT_RANGE_*, get_ray_cluster_topology(), select_segment_nodes(),
_sort_bundle_indices_by_topology(); replace GetGPUIDActor with a
_get_gpu_id_info Ray task that also returns (nvlink_domain, topo_rank); add
port_range_low/high, segment_size, node_resource_constraints params and
_nvlink_domain_per_bundle_index state to RayVirtualCluster; merge
node_resource_constraints into bundle specs; topology-aware
_get_sorted_bundle_indices.

vllm_generation.py: add init_cluster_placement_groups staticmethod for
deterministic PG ordering when other components compete for Ray resources;
add topology arguments to allocate_worker_groups and warn when a
model-parallel group straddles NVLink domains.

grpo.py: read cluster.segment_size, build node_resource_constraints from
get_ray_cluster_topology()/select_segment_nodes() in non-colocated setup,
relocate inference cluster creation, call
VllmGeneration.init_cluster_placement_groups so inference PGs claim
domain-aligned nodes first.

ray.sub: write topology_probe.sh that parses ClusterUUID from nvidia-smi -q
and topo_rank from SLURM_TOPOLOGY_ADDR, source it before each ray start, and
register nvlink_domain_<uuid>/topo_rank as Ray resources. Added
\`export RAY_RESOURCES\` (missing in upstream 8bd2417), required so the
variable propagates to the \`bash /launch-head.sh\` child invocation.

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>


          bug fix in topology-aware placement

d5970ef

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>


          fix: make vllm placement group init idempotent

2c7af4e

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>


          fix: restore SBATCH singleton dependency in ray.sub

db9e24e

Signed-off-by: Terry Kong <terryk@nvidia.com>


          fix(ray.sub): force base-10 for TOPO_RANK fallbacks to avoid invalid …

69545b9

…JSON

Signed-off-by: Terry Kong <terryk@nvidia.com>


          Revert "fix: make vllm placement group init idempotent"

b51e143

This reverts commit 2c7af4e.

Signed-off-by: Terry Kong <terryk@nvidia.com>


          test

b01c048

Signed-off-by: Terry Kong <terryk@nvidia.com>


          fix: remove unnecessary try/except in _get_gpu_id_info

1826f03

Signed-off-by: Terry Kong <terryk@nvidia.com>


          feat: add segment_size to ClusterConfig and exemplar configs

c1f547a

Signed-off-by: Terry Kong <terryk@nvidia.com>


          feat: add topology-aware placement to SFT

1556d11

Signed-off-by: Terry Kong <terryk@nvidia.com>


          feat: add topology-aware placement to DPO

5ca28b2

Signed-off-by: Terry Kong <terryk@nvidia.com>


          feat: add topology-aware placement to distillation

dd1813c

Signed-off-by: Terry Kong <terryk@nvidia.com>


          test: add unit tests for topology-aware placement

adcd78d

Signed-off-by: Terry Kong <terryk@nvidia.com>


          refactor: extract prepare_segment_topology to eliminate topology boil…

ec11276

…erplate

Signed-off-by: Terry Kong <terryk@nvidia.com>


          feat: add topology-aware placement to RM

1d18106

Signed-off-by: Terry Kong <terryk@nvidia.com>


          fix: use generation_config[backend] to detect vLLM vs SGLang for infe…

03682f9

…rence topology

Signed-off-by: Terry Kong <terryk@nvidia.com>


          fix: assert bundle count matches world_size after topology sort

cae36c7

Signed-off-by: Terry Kong <terryk@nvidia.com>


          fix: make ClusterConfig.segment_size NotRequired to avoid breaking ex…

f70121f

…isting configs

Signed-off-by: Terry Kong <terryk@nvidia.com>


          ci: trigger DCO re-check with updated base branch

dc00082

Signed-off-by: Terry Kong <terryk@nvidia.com>


          Merge remote-tracking branch 'origin/main' into tde/megatron_inf_debug

c05984a

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          fix: Reconcile NVIDIA-NeMo#2315 and NVIDIA-NeMo#2612

1aee1ba

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>


          fix: Reconcile NVIDIA-NeMo#2612 and NVIDIA-NeMo#2267

48c74fa

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>

copy-pr-bot Bot commented Jun 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions Bot added Documentation CI labels

github-actions Bot commented Jun 21, 2026

✅ Submodule Fast-Forward Check Results

Check based on commit: 48c74fa (PR #2879 from tde/update_2612)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of youngeunk/topology-aware-placement branch (fast-forward)
Gym: ✅ PR branch is ahead of youngeunk/topology-aware-placement branch (fast-forward)
Megatron-Bridge: ✅ PR branch is ahead of youngeunk/topology-aware-placement branch (fast-forward)

All submodule changes look good! ✨

tdene mentioned this pull request

#2355 and #2612 conflict #2850

Closed

terrykong force-pushed the youngeunk/topology-aware-placement branch 2 times, most recently from 0f2632b to 6874bf5 Compare

June 22, 2026 19:43

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Documentation