Skip to content

Deprecate AllToAllvDynamic algo and free its eager SharedResource memory (#2467)#2467

Open
saifhhasan wants to merge 1 commit intometa-pytorch:mainfrom
saifhhasan:export-D102687702
Open

Deprecate AllToAllvDynamic algo and free its eager SharedResource memory (#2467)#2467
saifhhasan wants to merge 1 commit intometa-pytorch:mainfrom
saifhhasan:export-D102687702

Conversation

@saifhhasan
Copy link
Copy Markdown

@saifhhasan saifhhasan commented May 9, 2026

Summary:

The ctran AllToAllvDynamic path eagerly allocated ~0.4-0.5 GiB of GPU memory per comm (e.g. PAFT's intra-replica CPU comms with 1024+ ranks) via SharedResource sized to NCCL_CTRAN_ALLTOALLV_DYNAMIC_MAX_NUM_COUNTS_PER_PEER * nRanks * CTRAN_ALGO_MAX_THREAD_BLOCKS, regardless of whether the comm ever used the algo. The feature is being deprecated entirely (see P2290241655 for the memory-validation analysis).

This diff deprecates ctran-side AllToAllvDynamic and removes the integration from NCCLX v2.27/v2.28/v2.29:

ctran changes:

  • Deletes the dedicated AllToAllvDynamic{,Common,HintUtils,PImpl,Split,SplitNonContig} impl/header files and dedicated test files.
  • Removes ctranAllToAllvDynamic{,Split,SplitNonContig,Support} from Ctran.h, related OpType/KernelType/PersistentObj entries, the alltoallv_dynamic OpElem union member, the alltoallvdynamic{,p} type namespaces, and the device-state fields peerAllToAllvDynamicBufsMap / alltoallvDynamicSendbuffsMap.
  • Strips the eager per-comm SharedResource sizing in CtranAlgo.cc (the actual memory win) and the 4 cudaHostAlloc'd CPU staging buffers (sendCounts/sendIndices/sendIndicesBlockLengths/sendbuffsPtr).
  • Removes the NCCL_CTRAN_ALLTOALLV_DYNAMIC_{NUM_THREAD_BLOCKS, THREAD_BLOCK_SIZE,MAX_NUM_COUNTS_PER_PEER} CVARs and the CpuBroadcastTest setenv workaround that was no longer needed once the default is gone.
  • Prunes references in CtranGpeImpl, hints, colltrace, CudaGraphUtilsImpl, genctran.py, BUCK + def_build.bzl + tests/BUCK, and shared test files (CtranGpeKernelUT, KernelFlagCmdOwnershipUT, CollTraceWrapperUT, CtranDistUT, HintsCheck, CtranCudaGraphAllToAllTest).
  • Removes stale ALLTOALLV_DYNAMIC switch cases from CtranGpe.cc (free/setStatus/kernelTypeNameMap) and stale SENDCOUNTS/RECVCOUNTS tmpbuf segments from CtranAlgo.cc.

NCCLX integration changes (v2.27/v2.28/v2.29):

  • Stubs all 5 ncclx::alltoallvDynamic{,Split,SplitNonContig,Dispatch,Combine} functions in collectives.cc to return ncclInvalidUsage (preserving function signatures for ABI compatibility).
  • Removes #define NCCL_ALLTOALLV_DYNAMIC_SUPPORTED from nccl.h.in (keeps API declarations for ABI).
  • Strips AllToAllvDynamicHintUtils usage from ncclx/meta/hints/Hints.cc and ncclx/meta/wrapper/MetaFactory.cc.
  • Removes ctranAllToAllvDynamicSupport test cases from ncclx/meta/tests/CommWithCtranTest.cc.
  • Removes ALLTOALLV_DYNAMIC case arms from ncclx/meta/colltrace/CollTraceFunc.cc.

The Triton GIN/window-based device alltoallv_dynamic in comms/pipes/collectives/triton/ is intentionally kept (it is a separate implementation unrelated to this memory waste).

Reviewed By: minsii

Differential Revision: D102687702

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 9, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 9, 2026

@saifhhasan has exported this pull request. If you are a Meta employee, you can view the originating Diff in D102687702.

saifhhasan pushed a commit to saifhhasan/torchcomms-1 that referenced this pull request May 11, 2026
…ory (meta-pytorch#2467)

Summary:
Pull Request resolved: meta-pytorch#2467

The ctran AllToAllvDynamic path eagerly allocated ~0.4-0.5 GiB of GPU memory per comm (e.g. PAFT's intra-replica CPU comms with 1024+ ranks) via SharedResource sized to NCCL_CTRAN_ALLTOALLV_DYNAMIC_MAX_NUM_COUNTS_PER_PEER * nRanks * CTRAN_ALGO_MAX_THREAD_BLOCKS, regardless of whether the comm ever used the algo. The feature is being deprecated entirely (see P2290241655 for the memory-validation analysis).

This diff deprecates ctran-side AllToAllvDynamic and removes the integration from NCCLX v2.27/v2.28/v2.29:

**ctran changes:**
- Deletes the dedicated AllToAllvDynamic{,Common,HintUtils,PImpl,Split,SplitNonContig} impl/header files and dedicated test files.
- Removes ctranAllToAllvDynamic{,Split,SplitNonContig,Support} from Ctran.h, related OpType/KernelType/PersistentObj entries, the alltoallv_dynamic OpElem union member, the alltoallvdynamic{,p} type namespaces, and the device-state fields peerAllToAllvDynamicBufsMap / alltoallvDynamicSendbuffsMap.
- Strips the eager per-comm SharedResource sizing in CtranAlgo.cc (the actual memory win) and the 4 cudaHostAlloc'd CPU staging buffers (sendCounts/sendIndices/sendIndicesBlockLengths/sendbuffsPtr).
- Removes the NCCL_CTRAN_ALLTOALLV_DYNAMIC_{NUM_THREAD_BLOCKS, THREAD_BLOCK_SIZE,MAX_NUM_COUNTS_PER_PEER} CVARs and the CpuBroadcastTest setenv workaround that was no longer needed once the default is gone.
- Prunes references in CtranGpeImpl, hints, colltrace, CudaGraphUtilsImpl, genctran.py, BUCK + def_build.bzl + tests/BUCK, and shared test files (CtranGpeKernelUT, KernelFlagCmdOwnershipUT, CollTraceWrapperUT, CtranDistUT, HintsCheck, CtranCudaGraphAllToAllTest).
- Removes stale ALLTOALLV_DYNAMIC switch cases from CtranGpe.cc (free/setStatus/kernelTypeNameMap) and stale SENDCOUNTS/RECVCOUNTS tmpbuf segments from CtranAlgo.cc.

**NCCLX integration changes (v2.27/v2.28/v2.29):**
- Stubs all 5 ncclx::alltoallvDynamic{,Split,SplitNonContig,Dispatch,Combine} functions in collectives.cc to return ncclInvalidUsage (preserving function signatures for ABI compatibility).
- Removes #define NCCL_ALLTOALLV_DYNAMIC_SUPPORTED from nccl.h.in (keeps API declarations for ABI).
- Strips AllToAllvDynamicHintUtils usage from ncclx/meta/hints/Hints.cc and ncclx/meta/wrapper/MetaFactory.cc.
- Removes ctranAllToAllvDynamicSupport test cases from ncclx/meta/tests/CommWithCtranTest.cc.
- Removes ALLTOALLV_DYNAMIC case arms from ncclx/meta/colltrace/CollTraceFunc.cc.

The Triton GIN/window-based device alltoallv_dynamic in comms/pipes/collectives/triton/ is intentionally kept (it is a separate implementation unrelated to this memory waste).

Reviewed By: minsii

Differential Revision: D102687702
@saifhhasan saifhhasan force-pushed the export-D102687702 branch from ce7db2c to f6a99d8 Compare May 11, 2026 05:50
@meta-codesync meta-codesync Bot changed the title Deprecate AllToAllvDynamic algo and free its eager SharedResource memory Deprecate AllToAllvDynamic algo and free its eager SharedResource memory (#2467) May 11, 2026
…ory (meta-pytorch#2467)

Summary:
Pull Request resolved: meta-pytorch#2467

The ctran AllToAllvDynamic path eagerly allocated ~0.4-0.5 GiB of GPU memory per comm (e.g. PAFT's intra-replica CPU comms with 1024+ ranks) via SharedResource sized to NCCL_CTRAN_ALLTOALLV_DYNAMIC_MAX_NUM_COUNTS_PER_PEER * nRanks * CTRAN_ALGO_MAX_THREAD_BLOCKS, regardless of whether the comm ever used the algo. The feature is being deprecated entirely (see P2290241655 for the memory-validation analysis).

This diff deprecates ctran-side AllToAllvDynamic and removes the integration from NCCLX v2.27/v2.28/v2.29:

**ctran changes:**
- Deletes the dedicated AllToAllvDynamic{,Common,HintUtils,PImpl,Split,SplitNonContig} impl/header files and dedicated test files.
- Removes ctranAllToAllvDynamic{,Split,SplitNonContig,Support} from Ctran.h, related OpType/KernelType/PersistentObj entries, the alltoallv_dynamic OpElem union member, the alltoallvdynamic{,p} type namespaces, and the device-state fields peerAllToAllvDynamicBufsMap / alltoallvDynamicSendbuffsMap.
- Strips the eager per-comm SharedResource sizing in CtranAlgo.cc (the actual memory win) and the 4 cudaHostAlloc'd CPU staging buffers (sendCounts/sendIndices/sendIndicesBlockLengths/sendbuffsPtr).
- Removes the NCCL_CTRAN_ALLTOALLV_DYNAMIC_{NUM_THREAD_BLOCKS, THREAD_BLOCK_SIZE,MAX_NUM_COUNTS_PER_PEER} CVARs and the CpuBroadcastTest setenv workaround that was no longer needed once the default is gone.
- Prunes references in CtranGpeImpl, hints, colltrace, CudaGraphUtilsImpl, genctran.py, BUCK + def_build.bzl + tests/BUCK, and shared test files (CtranGpeKernelUT, KernelFlagCmdOwnershipUT, CollTraceWrapperUT, CtranDistUT, HintsCheck, CtranCudaGraphAllToAllTest).
- Removes stale ALLTOALLV_DYNAMIC switch cases from CtranGpe.cc (free/setStatus/kernelTypeNameMap) and stale SENDCOUNTS/RECVCOUNTS tmpbuf segments from CtranAlgo.cc.

**NCCLX integration changes (v2.27/v2.28/v2.29):**
- Stubs all 5 ncclx::alltoallvDynamic{,Split,SplitNonContig,Dispatch,Combine} functions in collectives.cc to return ncclInvalidUsage (preserving function signatures for ABI compatibility).
- Removes #define NCCL_ALLTOALLV_DYNAMIC_SUPPORTED from nccl.h.in (keeps API declarations for ABI).
- Strips AllToAllvDynamicHintUtils usage from ncclx/meta/hints/Hints.cc and ncclx/meta/wrapper/MetaFactory.cc.
- Removes ctranAllToAllvDynamicSupport test cases from ncclx/meta/tests/CommWithCtranTest.cc.
- Removes ALLTOALLV_DYNAMIC case arms from ncclx/meta/colltrace/CollTraceFunc.cc.

The Triton GIN/window-based device alltoallv_dynamic in comms/pipes/collectives/triton/ is intentionally kept (it is a separate implementation unrelated to this memory waste).

Reviewed By: minsii

Differential Revision: D102687702
@saifhhasan saifhhasan force-pushed the export-D102687702 branch from f6a99d8 to b8df01e Compare May 11, 2026 05:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant