Skip to content

Use native NVSHMEM synchronization APIs in NVSHMEM backends#111

Open
romerojosh wants to merge 2 commits intomainfrom
nvshmem_barrier_signal_v2
Open

Use native NVSHMEM synchronization APIs in NVSHMEM backends#111
romerojosh wants to merge 2 commits intomainfrom
nvshmem_barrier_signal_v2

Conversation

@romerojosh
Copy link
Collaborator

This PR is a second attempt at #107.

The implementation is unchanged from the original PR, but this PR contains a workaround for an NVSHMEM bug impacting versions <= 3.2.5. During testing, it was determined that there are scenarios where remote (i.e. non-NVLINK) signaling operations can trigger segfaults due to a caching bug in the symmetric heap management. The workaround is to enforce a symmetric heap granularity of at least 1 GiB (up from the default of 512 MiB) for impacted NVSHMEM versions.

This can cause an increase of up to 512 MiB in memory usage for users currently using NVSHMEM backends and NVSHMEM version <= 3.2.5. To avoid this increase, users can upgrade to 3.3.24 or greater (included in NVHPC SDK 25.9+).

…ckends (#107)" (#110)"

This reverts commit e401242.

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
@romerojosh
Copy link
Collaborator Author

/build

@github-actions
Copy link

🚀 Build workflow triggered! View run

@github-actions
Copy link

✅ Build workflow passed! View run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant