libfabric: fix multi-NIC RX imbalance for EFA transport#76
Open
crazyguitar wants to merge 2 commits intoNVIDIA:develfrom
Open
libfabric: fix multi-NIC RX imbalance for EFA transport#76crazyguitar wants to merge 2 commits intoNVIDIA:develfrom
crazyguitar wants to merge 2 commits intoNVIDIA:develfrom
Conversation
The multi-NIC round-robin in get_next_ep() only balanced TX by rotating the local sending EP. The remote target EP and MR key selection were coupled to the same local domain index, causing all incoming RDMA writes to land on a single NIC per GPU. Decouple remote NIC selection from local EP by introducing get_next_remote_domain(), which uses (my_pe + target_pe) % num_proxy_domains to distribute RX across all remote NICs. Different senders now target different NICs on the same destination PE. Before: TX distributed across 4 NICs, RX bottlenecked on 1 NIC After: TX and RX both distributed across 4 NICs
The static remote NIC mapping (my_pe + target_pe) % num_proxy_domains distributes RX across remote NICs when multiple senders target the same PE, but does not balance RX when a single sender repeatedly puts to the same destination — all traffic lands on the same remote NIC. Add a remote_ep_cntr that round-robins the remote domain selection on top of the per-sender base offset. This ensures RX is distributed even in single-sender-to-single-receiver patterns (e.g., two-PE case), while still spreading traffic from different senders across different remote NICs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi team,
I tested NVSHMEM 3.6.5 multi-NIC on an AWS p5.48xlarge Slurm cluster using the EFA libfabric transport. While TX traffic is evenly distributed across all 4 EFA NICs per GPU, RX traffic is funneled to a single NIC per GPU, creating a receive-side bottleneck. The following figure shows the issue by using rdmatop to monitor EFA traffic:

After further investigation, the root cause is that rma_impl couples the remote target_ep and MR key to the sender's local domain_idx, so all senders targeting the same destination PE route RX to the same remote NIC.
For example, when all PEs send to PE 8:
All PEs compute the same target_ep = 8*5 + 1 = 41, directing all traffic to PE 8's EP1 (same NIC). To fix this, we shift the remote domain index by my_pe so each sender targets a different remote NIC:
For example:
Test
I used the scripts here for testing. Before the fix, I observed that the throughput has no difference (or worse) compare to 3.5.21 which does not support multi-NIC
Result
After the fix, RX throughput is evenly distributed across all NICs:
Additionally, the alltoall benchmark shows higher bandwidth:
2PEs shmem_put_bw test
result