Skip to content

Improve support for very large inputs.#105

Merged
romerojosh merged 3 commits intomainfrom
large_count_guardrails
Feb 17, 2026
Merged

Improve support for very large inputs.#105
romerojosh merged 3 commits intomainfrom
large_count_guardrails

Conversation

@romerojosh
Copy link
Collaborator

Recent flagship GPU models have significantly increased HBM capacities (e.g, GB200 with 186 GB of HBM, GB300 with 279 GB of HBM), enabling users to run much larger per GPU problem sizes with cuDecomp than in the past.

cuDecomp still relies on standard MPI APIs, which have count/offset arguments limited to the maximum valueint32_t. We use appropriate MPI datatypes in the MPI communication routines so that this maximum limit applies to the number of elements rather than bytes. This results in the following problem size limitations, in terms of local pencil size:

  • For problems routed to MPI_Alltoall with an int32_t count: (2^31-1) * <nranks> elements per local pencil
  • For problems routed to other routines with MPI_Alltoallv-like patterns with int32_t counts and offsets: (2^31-1) elements per local pencil

This leads to a maximum local pencil size of 8 GiB for our smallest supported type float, up to 32 GiB for complex<double>. With workspace requirements (2x the local pencil size) and assuming most workloads do other things with GPU memory, these limitations were generally not an issue for most users using GPUs with 40 - 80 GiB capacities. With that said, the code currently will not inform users they have violated these limitations and will just silently fail, which is not ideal.

This PR remedies this situation by:

  1. Replacing int32_t with int64_t for internal count/offset handling, only downcasting to int32_t when needed for MPI APIs. This enables NCCL- and NVSHMEM-based backends to correctly run on very large inputs without these MPI specific limits.
  2. Adding checks for when the counts or offset arguments are larger than int32_t before downcasting, and if so, throwing a not supported error and informing the user the particular transpose/halo backend is not usable with their problem size.

When "big count" support is more widely available in MPI, we can adopt those APIs (e.g. MPI_Alltoall_c) but as of now, even the current OpenMPI 5.x release does not have these functions available.

This PR does not address communication backend autotuning cases with these large input sizes. These cases can potentially error out when testing the MPI-based backends, even if the NCCL and NVSHMEM backends are viable candidates. This will be addressed in a follow up PR.

@romerojosh
Copy link
Collaborator Author

/build

@github-actions
Copy link

🚀 Build workflow triggered! View run

@github-actions
Copy link

✅ Build workflow passed! View run

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
…cks.

Signed-off-by: Josh Romero <joshr@nvidia.com>
@romerojosh romerojosh force-pushed the large_count_guardrails branch from 3272e1b to c1feca0 Compare February 17, 2026 19:42
@romerojosh romerojosh merged commit a8f5668 into main Feb 17, 2026
4 checks passed
@romerojosh romerojosh deleted the large_count_guardrails branch February 24, 2026 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant