transport: Implement EFA fabric type selection using NVSHMEM_DISABLE_LIBFABRIC_EFA_DIRECT#35
transport: Implement EFA fabric type selection using NVSHMEM_DISABLE_LIBFABRIC_EFA_DIRECT#35nmazzilli3 wants to merge 2 commits intoNVIDIA:develfrom
Conversation
nmazzilli3
commented
Nov 21, 2025
- Adding environmental variable NVSHMEM_DISABLE_LIBFABRIC_EFA_DIRECT to toggle between efa-direct and efa fabrics
- Implement EFA fabric type selection using NVSHMEM_DISABLE_LIBFABRIC_EFA_DIRECT
…BRIC_EFA_DIRECT to toggle between efa-direct and efa-proto fabrics Signed-off-by: Nick Mazzilli <nmazzill@amazon.com>
…M_DISABLE_LIBFABRIC_EFA_DIRECT Signed-off-by: Nick Mazzilli <nmazzill@amazon.com>
|
Hi @a-szegel @abrooks98, hit this exact issue today on AWS p5en with libfabric 2.4 + NVSHMEM 3.6.5: any host-side RMA from a non-symmetric-heap source buffer (e.g. nvshmemx_putmem_nbi_on_stream / nvshmemx_int64_p_on_stream with a regular CUDA tensor) crashes with EINVAL and "EFA direct requires FI_MR_LOCAL but application does not provide a valid desc". This patch is exactly what we need, any chance it could be prioritized for the next release? Thanks @nmazzilli3 for putting it up! |
|
Hey @quanta42 ,
This sounds like a bug added to the NVSHMEM libfabric transport. Why are you passing in a NULL or invalid descriptor? The efa-direct provider should be the path NVSHMEM is taking on p5en with a recent EFA Installer around >1.44.0 (although we always recommend using latest EFA Installer). This change was just put there so we could swap back and fourth between the protocol provider and the direct provider for testing purposes. This is not the correct work around for your issue. |
|
Hi @quanta42, In your code, are you registering the source buffer with Here is the relevant NVSHMEM API doc: link |
|
that's the answer. Our source was a plain cudaMalloc from PyTorch's caching allocator with no nvshmemx_buffer_register call. So this is on us, not a transport bug. Will fix on our side by hooking PyTorch's caching allocator. One thing the doc doesn't quantify, any rough order-of-magnitude on nvshmemx_buffer_register cost (per-call, and per-byte if it pins/maps), so we can size the hook correctly? |
|
Memory registration is expensive, so it is best to alloc/register once during init and reuse the same buffers for the life of the program. |