Skip to content

transport: Implement EFA fabric type selection using NVSHMEM_DISABLE_LIBFABRIC_EFA_DIRECT#35

Open
nmazzilli3 wants to merge 2 commits intoNVIDIA:develfrom
nmazzilli3:efa-direct-env-variable
Open

transport: Implement EFA fabric type selection using NVSHMEM_DISABLE_LIBFABRIC_EFA_DIRECT#35
nmazzilli3 wants to merge 2 commits intoNVIDIA:develfrom
nmazzilli3:efa-direct-env-variable

Conversation

@nmazzilli3
Copy link
Copy Markdown

  • Adding environmental variable NVSHMEM_DISABLE_LIBFABRIC_EFA_DIRECT to toggle between efa-direct and efa fabrics
  • Implement EFA fabric type selection using NVSHMEM_DISABLE_LIBFABRIC_EFA_DIRECT

…BRIC_EFA_DIRECT to toggle between efa-direct and efa-proto fabrics

Signed-off-by: Nick Mazzilli <nmazzill@amazon.com>
…M_DISABLE_LIBFABRIC_EFA_DIRECT

Signed-off-by: Nick Mazzilli <nmazzill@amazon.com>
@quanta42
Copy link
Copy Markdown

quanta42 commented Apr 21, 2026

Hi @a-szegel @abrooks98, hit this exact issue today on AWS p5en with libfabric 2.4 + NVSHMEM 3.6.5: any host-side RMA from a non-symmetric-heap source buffer (e.g. nvshmemx_putmem_nbi_on_stream / nvshmemx_int64_p_on_stream with a regular CUDA tensor) crashes with EINVAL and "EFA direct requires FI_MR_LOCAL but application does not provide a valid desc".
Device-side collectives like nvshmem_broadcast work fine since their source buffers are in the symm heap and have valid MRs.

This patch is exactly what we need, any chance it could be prioritized for the next release?
Happy to test against any pre-release if helpful.

Thanks @nmazzilli3 for putting it up!

@a-szegel
Copy link
Copy Markdown
Contributor

a-szegel commented Apr 21, 2026

Hey @quanta42 ,

"EFA direct requires FI_MR_LOCAL but application does not provide a valid desc"

This sounds like a bug added to the NVSHMEM libfabric transport. Why are you passing in a NULL or invalid descriptor? The efa-direct provider should be the path NVSHMEM is taking on p5en with a recent EFA Installer around >1.44.0 (although we always recommend using latest EFA Installer).

This change was just put there so we could swap back and fourth between the protocol provider and the direct provider for testing purposes. This is not the correct work around for your issue.

@rauteric
Copy link
Copy Markdown
Contributor

Hi @quanta42,

In your code, are you registering the source buffer with nvshmemx_buffer_register before using it with the put APIs? This is required for the put APIs if you are not using symmetric memory. EFA full fabric (not direct) may have supported using a buffer without registering it, even though registration is required by the NVSHMEM API.

Here is the relevant NVSHMEM API doc: link

@quanta42
Copy link
Copy Markdown

Thanks @a-szegel @rauteric,

that's the answer. Our source was a plain cudaMalloc from PyTorch's caching allocator with no nvshmemx_buffer_register call. So this is on us, not a transport bug.

Will fix on our side by hooking PyTorch's caching allocator.

One thing the doc doesn't quantify, any rough order-of-magnitude on nvshmemx_buffer_register cost (per-call, and per-byte if it pins/maps), so we can size the hook correctly?

@a-szegel
Copy link
Copy Markdown
Contributor

Memory registration is expensive, so it is best to alloc/register once during init and reuse the same buffers for the life of the program.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants