Improve support for very large inputs.#105
Merged
romerojosh merged 3 commits intomainfrom Feb 17, 2026
Merged
Conversation
Collaborator
Author
|
/build |
|
🚀 Build workflow triggered! View run |
|
✅ Build workflow passed! View run |
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
…cks. Signed-off-by: Josh Romero <joshr@nvidia.com>
3272e1b to
c1feca0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Recent flagship GPU models have significantly increased HBM capacities (e.g, GB200 with 186 GB of HBM, GB300 with 279 GB of HBM), enabling users to run much larger per GPU problem sizes with cuDecomp than in the past.
cuDecomp still relies on standard MPI APIs, which have count/offset arguments limited to the maximum value
int32_t. We use appropriate MPI datatypes in the MPI communication routines so that this maximum limit applies to the number of elements rather than bytes. This results in the following problem size limitations, in terms of local pencil size:MPI_Alltoallwith anint32_tcount:(2^31-1) * <nranks>elements per local pencilMPI_Alltoallv-like patterns withint32_tcounts and offsets:(2^31-1)elements per local pencilThis leads to a maximum local pencil size of 8 GiB for our smallest supported type
float, up to 32 GiB forcomplex<double>. With workspace requirements (2x the local pencil size) and assuming most workloads do other things with GPU memory, these limitations were generally not an issue for most users using GPUs with 40 - 80 GiB capacities. With that said, the code currently will not inform users they have violated these limitations and will just silently fail, which is not ideal.This PR remedies this situation by:
int32_twithint64_tfor internal count/offset handling, only downcasting toint32_twhen needed for MPI APIs. This enables NCCL- and NVSHMEM-based backends to correctly run on very large inputs without these MPI specific limits.int32_tbefore downcasting, and if so, throwing a not supported error and informing the user the particular transpose/halo backend is not usable with their problem size.When "big count" support is more widely available in MPI, we can adopt those APIs (e.g.
MPI_Alltoall_c) but as of now, even the current OpenMPI 5.x release does not have these functions available.This PR does not address communication backend autotuning cases with these large input sizes. These cases can potentially error out when testing the MPI-based backends, even if the NCCL and NVSHMEM backends are viable candidates. This will be addressed in a follow up PR.