Deprecate AllToAllvDynamic algo and free its eager SharedResource memory (#2467) by saifhhasan · Pull Request #2467 · meta-pytorch/torchcomms

saifhhasan · 2026-05-09T06:27:00Z

Summary:

The ctran AllToAllvDynamic path eagerly allocated ~0.4-0.5 GiB of GPU memory per comm (e.g. PAFT's intra-replica CPU comms with 1024+ ranks) via SharedResource sized to NCCL_CTRAN_ALLTOALLV_DYNAMIC_MAX_NUM_COUNTS_PER_PEER * nRanks * CTRAN_ALGO_MAX_THREAD_BLOCKS, regardless of whether the comm ever used the algo. The feature is being deprecated entirely (see P2290241655 for the memory-validation analysis).

This diff deprecates ctran-side AllToAllvDynamic and removes the integration from NCCLX v2.27/v2.28/v2.29:

ctran changes:

Deletes the dedicated AllToAllvDynamic{,Common,HintUtils,PImpl,Split,SplitNonContig} impl/header files and dedicated test files.
Removes ctranAllToAllvDynamic{,Split,SplitNonContig,Support} from Ctran.h, related OpType/KernelType/PersistentObj entries, the alltoallv_dynamic OpElem union member, the alltoallvdynamic{,p} type namespaces, and the device-state fields peerAllToAllvDynamicBufsMap / alltoallvDynamicSendbuffsMap.
Strips the eager per-comm SharedResource sizing in CtranAlgo.cc (the actual memory win) and the 4 cudaHostAlloc'd CPU staging buffers (sendCounts/sendIndices/sendIndicesBlockLengths/sendbuffsPtr).
Removes the NCCL_CTRAN_ALLTOALLV_DYNAMIC_{NUM_THREAD_BLOCKS, THREAD_BLOCK_SIZE,MAX_NUM_COUNTS_PER_PEER} CVARs and the CpuBroadcastTest setenv workaround that was no longer needed once the default is gone.
Prunes references in CtranGpeImpl, hints, colltrace, CudaGraphUtilsImpl, genctran.py, BUCK + def_build.bzl + tests/BUCK, and shared test files (CtranGpeKernelUT, KernelFlagCmdOwnershipUT, CollTraceWrapperUT, CtranDistUT, HintsCheck, CtranCudaGraphAllToAllTest).
Removes stale ALLTOALLV_DYNAMIC switch cases from CtranGpe.cc (free/setStatus/kernelTypeNameMap) and stale SENDCOUNTS/RECVCOUNTS tmpbuf segments from CtranAlgo.cc.

NCCLX integration changes (v2.27/v2.28/v2.29):

Stubs all 5 ncclx::alltoallvDynamic{,Split,SplitNonContig,Dispatch,Combine} functions in collectives.cc to return ncclInvalidUsage (preserving function signatures for ABI compatibility).
Removes #define NCCL_ALLTOALLV_DYNAMIC_SUPPORTED from nccl.h.in (keeps API declarations for ABI).
Strips AllToAllvDynamicHintUtils usage from ncclx/meta/hints/Hints.cc and ncclx/meta/wrapper/MetaFactory.cc.
Removes ctranAllToAllvDynamicSupport test cases from ncclx/meta/tests/CommWithCtranTest.cc.
Removes ALLTOALLV_DYNAMIC case arms from ncclx/meta/colltrace/CollTraceFunc.cc.

The Triton GIN/window-based device alltoallv_dynamic in comms/pipes/collectives/triton/ is intentionally kept (it is a separate implementation unrelated to this memory waste).

Reviewed By: minsii

Differential Revision: D102687702

meta-codesync · 2026-05-09T06:27:43Z

@saifhhasan has exported this pull request. If you are a Meta employee, you can view the originating Diff in D102687702.

…ory (meta-pytorch#2467) Summary: Pull Request resolved: meta-pytorch#2467 The ctran AllToAllvDynamic path eagerly allocated ~0.4-0.5 GiB of GPU memory per comm (e.g. PAFT's intra-replica CPU comms with 1024+ ranks) via SharedResource sized to NCCL_CTRAN_ALLTOALLV_DYNAMIC_MAX_NUM_COUNTS_PER_PEER * nRanks * CTRAN_ALGO_MAX_THREAD_BLOCKS, regardless of whether the comm ever used the algo. The feature is being deprecated entirely (see P2290241655 for the memory-validation analysis). This diff deprecates ctran-side AllToAllvDynamic and removes the integration from NCCLX v2.27/v2.28/v2.29: **ctran changes:** - Deletes the dedicated AllToAllvDynamic{,Common,HintUtils,PImpl,Split,SplitNonContig} impl/header files and dedicated test files. - Removes ctranAllToAllvDynamic{,Split,SplitNonContig,Support} from Ctran.h, related OpType/KernelType/PersistentObj entries, the alltoallv_dynamic OpElem union member, the alltoallvdynamic{,p} type namespaces, and the device-state fields peerAllToAllvDynamicBufsMap / alltoallvDynamicSendbuffsMap. - Strips the eager per-comm SharedResource sizing in CtranAlgo.cc (the actual memory win) and the 4 cudaHostAlloc'd CPU staging buffers (sendCounts/sendIndices/sendIndicesBlockLengths/sendbuffsPtr). - Removes the NCCL_CTRAN_ALLTOALLV_DYNAMIC_{NUM_THREAD_BLOCKS, THREAD_BLOCK_SIZE,MAX_NUM_COUNTS_PER_PEER} CVARs and the CpuBroadcastTest setenv workaround that was no longer needed once the default is gone. - Prunes references in CtranGpeImpl, hints, colltrace, CudaGraphUtilsImpl, genctran.py, BUCK + def_build.bzl + tests/BUCK, and shared test files (CtranGpeKernelUT, KernelFlagCmdOwnershipUT, CollTraceWrapperUT, CtranDistUT, HintsCheck, CtranCudaGraphAllToAllTest). - Removes stale ALLTOALLV_DYNAMIC switch cases from CtranGpe.cc (free/setStatus/kernelTypeNameMap) and stale SENDCOUNTS/RECVCOUNTS tmpbuf segments from CtranAlgo.cc. **NCCLX integration changes (v2.27/v2.28/v2.29):** - Stubs all 5 ncclx::alltoallvDynamic{,Split,SplitNonContig,Dispatch,Combine} functions in collectives.cc to return ncclInvalidUsage (preserving function signatures for ABI compatibility). - Removes #define NCCL_ALLTOALLV_DYNAMIC_SUPPORTED from nccl.h.in (keeps API declarations for ABI). - Strips AllToAllvDynamicHintUtils usage from ncclx/meta/hints/Hints.cc and ncclx/meta/wrapper/MetaFactory.cc. - Removes ctranAllToAllvDynamicSupport test cases from ncclx/meta/tests/CommWithCtranTest.cc. - Removes ALLTOALLV_DYNAMIC case arms from ncclx/meta/colltrace/CollTraceFunc.cc. The Triton GIN/window-based device alltoallv_dynamic in comms/pipes/collectives/triton/ is intentionally kept (it is a separate implementation unrelated to this memory waste). Reviewed By: minsii Differential Revision: D102687702

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 9, 2026

meta-codesync Bot added fb-exported meta-exported labels May 9, 2026

saifhhasan force-pushed the export-D102687702 branch from ce7db2c to f6a99d8 Compare May 11, 2026 05:50

meta-codesync Bot changed the title ~~Deprecate AllToAllvDynamic algo and free its eager SharedResource memory~~ Deprecate AllToAllvDynamic algo and free its eager SharedResource memory (#2467) May 11, 2026

saifhhasan force-pushed the export-D102687702 branch from f6a99d8 to b8df01e Compare May 11, 2026 05:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate AllToAllvDynamic algo and free its eager SharedResource memory (#2467)#2467

Deprecate AllToAllvDynamic algo and free its eager SharedResource memory (#2467)#2467
saifhhasan wants to merge 1 commit intometa-pytorch:mainfrom
saifhhasan:export-D102687702

saifhhasan commented May 9, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

saifhhasan commented May 9, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

saifhhasan commented May 9, 2026 •

edited by meta-codesync Bot

Loading