transport: Implement EFA fabric type selection using NVSHMEM_DISABLE_LIBFABRIC_EFA_DIRECT by nmazzilli3 · Pull Request #35 · NVIDIA/nvshmem

nmazzilli3 · 2025-11-21T00:29:54Z

Adding environmental variable NVSHMEM_DISABLE_LIBFABRIC_EFA_DIRECT to toggle between efa-direct and efa fabrics
Implement EFA fabric type selection using NVSHMEM_DISABLE_LIBFABRIC_EFA_DIRECT

…BRIC_EFA_DIRECT to toggle between efa-direct and efa-proto fabrics Signed-off-by: Nick Mazzilli <nmazzill@amazon.com>

…M_DISABLE_LIBFABRIC_EFA_DIRECT Signed-off-by: Nick Mazzilli <nmazzill@amazon.com>

quanta42 · 2026-04-21T20:09:14Z

Hi @a-szegel @abrooks98, hit this exact issue today on AWS p5en with libfabric 2.4 + NVSHMEM 3.6.5: any host-side RMA from a non-symmetric-heap source buffer (e.g. nvshmemx_putmem_nbi_on_stream / nvshmemx_int64_p_on_stream with a regular CUDA tensor) crashes with EINVAL and "EFA direct requires FI_MR_LOCAL but application does not provide a valid desc".
Device-side collectives like nvshmem_broadcast work fine since their source buffers are in the symm heap and have valid MRs.

This patch is exactly what we need, any chance it could be prioritized for the next release?
Happy to test against any pre-release if helpful.

Thanks @nmazzilli3 for putting it up!

a-szegel · 2026-04-21T20:32:29Z

Hey @quanta42 ,

"EFA direct requires FI_MR_LOCAL but application does not provide a valid desc"

This sounds like a bug added to the NVSHMEM libfabric transport. Why are you passing in a NULL or invalid descriptor? The efa-direct provider should be the path NVSHMEM is taking on p5en with a recent EFA Installer around >1.44.0 (although we always recommend using latest EFA Installer).

This change was just put there so we could swap back and fourth between the protocol provider and the direct provider for testing purposes. This is not the correct work around for your issue.

rauteric · 2026-04-21T21:31:49Z

Hi @quanta42,

In your code, are you registering the source buffer with nvshmemx_buffer_register before using it with the put APIs? This is required for the put APIs if you are not using symmetric memory. EFA full fabric (not direct) may have supported using a buffer without registering it, even though registration is required by the NVSHMEM API.

Here is the relevant NVSHMEM API doc: link

quanta42 · 2026-04-21T22:23:02Z

Thanks @a-szegel @rauteric,

that's the answer. Our source was a plain cudaMalloc from PyTorch's caching allocator with no nvshmemx_buffer_register call. So this is on us, not a transport bug.

Will fix on our side by hooking PyTorch's caching allocator.

One thing the doc doesn't quantify, any rough order-of-magnitude on nvshmemx_buffer_register cost (per-call, and per-byte if it pins/maps), so we can size the hook correctly?

a-szegel · 2026-04-21T22:47:40Z

Memory registration is expensive, so it is best to alloc/register once during init and reuse the same buffers for the life of the program.

nmazzilli3 added 2 commits November 20, 2025 13:37

transport/common: Adding environmental variable NVSHMEM_DISABLE_LIBFA…

1c74a4e

…BRIC_EFA_DIRECT to toggle between efa-direct and efa-proto fabrics Signed-off-by: Nick Mazzilli <nmazzill@amazon.com>

transport/libfabric: Implement EFA fabric type selection using NVSHME…

642cffc

…M_DISABLE_LIBFABRIC_EFA_DIRECT Signed-off-by: Nick Mazzilli <nmazzill@amazon.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transport: Implement EFA fabric type selection using NVSHMEM_DISABLE_LIBFABRIC_EFA_DIRECT#35

transport: Implement EFA fabric type selection using NVSHMEM_DISABLE_LIBFABRIC_EFA_DIRECT#35
nmazzilli3 wants to merge 2 commits intoNVIDIA:develfrom
nmazzilli3:efa-direct-env-variable

nmazzilli3 commented Nov 21, 2025

Uh oh!

quanta42 commented Apr 21, 2026 •

edited

Loading

Uh oh!

a-szegel commented Apr 21, 2026 •

edited

Loading

Uh oh!

rauteric commented Apr 21, 2026

Uh oh!

quanta42 commented Apr 21, 2026

Uh oh!

a-szegel commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nmazzilli3 commented Nov 21, 2025

Uh oh!

quanta42 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a-szegel commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rauteric commented Apr 21, 2026

Uh oh!

quanta42 commented Apr 21, 2026

Uh oh!

a-szegel commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

quanta42 commented Apr 21, 2026 •

edited

Loading

a-szegel commented Apr 21, 2026 •

edited

Loading