libfabric: fix multi-NIC RX imbalance for EFA transport by crazyguitar · Pull Request #76 · NVIDIA/nvshmem

crazyguitar · 2026-03-25T08:16:10Z

Hi team,

I tested NVSHMEM 3.6.5 multi-NIC on an AWS p5.48xlarge Slurm cluster using the EFA libfabric transport. While TX traffic is evenly distributed across all 4 EFA NICs per GPU, RX traffic is funneled to a single NIC per GPU, creating a receive-side bottleneck. The following figure shows the issue by using rdmatop to monitor EFA traffic:

After further investigation, the root cause is that rma_impl couples the remote target_ep and MR key to the sender's local domain_idx, so all senders targeting the same destination PE route RX to the same remote NIC.

target_ep = pe * libfabric_state->eps.size() + ep_idx;)

For example, when all PEs send to PE 8:

PE 0 (my_pe=0), proxy_ep_cntr starts at 0:
  req 1: PE 8  → ep_idx=1 → target_ep=8*5+1=41,  hdls[1].key → PE8's EP1 (NIC 1)
PE 1 (my_pe=1), proxy_ep_cntr ALSO starts at 0:
  req 1: PE 8  → ep_idx=1 → target_ep=8*5+1=41,  hdls[1].key → PE8's EP1 (NIC 1)
PE 2 (my_pe=2), proxy_ep_cntr ALSO starts at 0:
  req 1: PE 8  → ep_idx=1 → target_ep=8*5+1=41,  hdls[1].key → PE8's EP1 (NIC 1)
...

PE 8 receives (req 1):
    PE 0→8: ep_idx=1 → target_ep=41, hdls[1].key → NIC 1 ← same! 
    PE 1→8: ep_idx=1 → target_ep=41, hdls[1].key → NIC 1  ← same! 
    PE 2→8: ep_idx=1 → target_ep=41, hdls[1].key → NIC 1  ← same!
    PE 3→8: ep_idx=1 → target_ep=41, hdls[1].key → NIC 1  ← same!
    PE 4→8: ep_idx=1 → target_ep=41, hdls[1].key → NIC 1  ← same!
    PE 5→8: ep_idx=1 → target_ep=41, hdls[1].key → NIC 1  ← same!
    PE 6→8: ep_idx=1 → target_ep=41, hdls[1].key → NIC 1  ← same!
    PE 7→8: ep_idx=1 → target_ep=41, hdls[1].key → NIC 1  ← same!
    ALL on NIC 1

All PEs compute the same target_ep = 8*5 + 1 = 41, directing all traffic to PE 8's EP1 (same NIC). To fix this, we shift the remote domain index by my_pe so each sender targets a different remote NIC:

// remote_ep_cntr Rx counter
int base = (my_pe + pe) % state->num_proxy_domains;
int rr = (state->remote_ep_cntr++) % state->num_proxy_domains;
return ((base + rr) % state->num_proxy_domains) + state->num_host_domains;

For example:

PE 0 (my_pe=0), proxy_ep_cntr starts at 0:
  req 1: PE 8  → ep_idx=1, remote=(0+8)%4+1=1  → target_ep=8*5+1=41,  hdls[1].key → PE8's EP1 (NIC 1)

PE 1 (my_pe=1), proxy_ep_cntr ALSO starts at 0:
  req 1: PE 8  → ep_idx=1, remote=(1+8)%4+1=2  → target_ep=8*5+2=42,  hdls[2].key → PE8's EP2 (NIC 2)
...
PE 8 receives (req 1):
  PE 0→8: target_ep=41 (EP1), hdls[1].key → NIC 1
  PE 1→8: target_ep=42 (EP2), hdls[2].key → NIC 2
  PE 2→8: target_ep=43 (EP3), hdls[3].key → NIC 3
  PE 3→8: target_ep=44 (EP4), hdls[4].key → NIC 4
  PE 4→8: target_ep=41 (EP1), hdls[1].key → NIC 1
  PE 5→8: target_ep=42 (EP2), hdls[2].key → NIC 2
  PE 6→8: target_ep=43 (EP3), hdls[3].key → NIC 3
  PE 7→8: target_ep=44 (EP4), hdls[4].key → NIC 4
  2 senders per NIC, balanced ✓

Test

I used the scripts here for testing. Before the fix, I observed that the throughput has no difference (or worse) compare to 3.5.21 which does not support multi-NIC

salloc -N 4 bash examples/nvshmem/nvshmem.sbatch \
  /opt/nvshmem/bin/perftest/device/coll/alltoall_latency \
  -b 16 -e 16M -f 2 -n 2048 -s all

Result

3.6.5 (before fix):
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s)
128         32        32-bit    block     206.624061        0.001         0.001
256         64        32-bit    block     208.148360        0.001         0.001
512         128       32-bit    block     206.503510        0.002         0.002
1024        256       32-bit    block     210.159823        0.005         0.005
2048        512       32-bit    block     206.275970        0.010         0.010
4096        1024      32-bit    block     207.383484        0.020         0.019
8192        2048      32-bit    block     213.627428        0.038         0.037
16384       4096      32-bit    block     208.908468        0.078         0.076
32768       8192      32-bit    block     210.768104        0.155         0.151
65536       16384     32-bit    block     209.307849        0.313         0.303
131072      32768     32-bit    block     211.907744        0.619         0.599
262144      65536     32-bit    block     213.547572        1.228         1.189
524288      131072    32-bit    block     224.403918        2.336         2.263
1048576     262144    32-bit    block     243.535280        4.306         4.171
2097152     524288    32-bit    block     235.484257        8.906         8.627
4194304     1048576   32-bit    block     368.143827        11.393        11.037
8388608     2097152   32-bit    block     1047.248483       8.010         7.760
16777216    4194304   32-bit    block     1689.005733       9.933         9.623
33554432    8388608   32-bit    block     3266.027927       10.274        9.953
67108864    16777216  32-bit    block     6136.326790       10.936        10.595
134217728   33554432  32-bit    block     12551.132202      10.694        10.359
268435456   67108864  32-bit    block     25309.883118      10.606        10.275
536870912   134217728 32-bit    block     45084.136963      11.908        11.536
1073741824  268435456 32-bit    block     97204.849243      11.046        10.701

# 3.5.21
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s)
128         32        32-bit    block     203.268245        0.001         0.001
256         64        32-bit    block     207.915753        0.001         0.001
512         128       32-bit    block     207.727894        0.002         0.002
1024        256       32-bit    block     209.168151        0.005         0.005
2048        512       32-bit    block     208.447486        0.010         0.010
4096        1024      32-bit    block     209.432021        0.020         0.019
8192        2048      32-bit    block     209.919959        0.039         0.038
16384       4096      32-bit    block     205.063894        0.080         0.077
32768       8192      32-bit    block     213.067710        0.154         0.149
65536       16384     32-bit    block     209.287599        0.313         0.303
131072      32768     32-bit    block     213.387713        0.614         0.595
262144      65536     32-bit    block     211.623609        1.239         1.200
524288      131072    32-bit    block     220.823601        2.374         2.300
1048576     262144    32-bit    block     199.459255        5.257         5.093
2097152     524288    32-bit    block     262.414247        7.992         7.742
4194304     1048576   32-bit    block     423.305243        9.908         9.599
8388608     2097152   32-bit    block     770.919442        10.881        10.541
16777216    4194304   32-bit    block     1305.613399       12.850        12.448
33554432    8388608   32-bit    block     2475.131273       13.557        13.133
67108864    16777216  32-bit    block     5027.430534       13.349        12.931
134217728   33554432  32-bit    block     8755.047798       15.330        14.851
268435456   67108864  32-bit    block     19208.671570      13.975        13.538
536870912   134217728 32-bit    block     37267.414093      14.406        13.956
1073741824  268435456 32-bit    block     74889.076233      14.338        13.890

After the fix, RX throughput is evenly distributed across all NICs:

Additionally, the alltoall benchmark shows higher bandwidth:

size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
128         32        32-bit    block     225.921094        0.001         0.001       
256         64        32-bit    block     223.368153        0.001         0.001       
512         128       32-bit    block     225.843951        0.002         0.002       
1024        256       32-bit    block     223.616228        0.005         0.004       
2048        512       32-bit    block     226.259828        0.009         0.009        
4096        1024      32-bit    block     226.011723        0.018         0.018       
8192        2048      32-bit    block     225.235820        0.036         0.035       
16384       4096      32-bit    block     223.464191        0.073         0.071       
32768       8192      32-bit    block     227.559865        0.144         0.139       
65536       16384     32-bit    block     226.339921        0.290         0.280       
131072      32768     32-bit    block     229.552284        0.571         0.553                  
262144      65536     32-bit    block     223.636284        1.172         1.136       
524288      131072    32-bit    block     233.147785        2.249         2.178       
1048576     262144    32-bit    block     248.655841        4.217         4.085       
2097152     524288    32-bit    block     243.967786        8.596         8.327       
4194304     1048576   32-bit    block     238.519609        17.585        17.035      
8388608     2097152   32-bit    block     319.415510        26.262        25.442      
16777216    4194304   32-bit    block     489.671409        34.262        33.191      
33554432    8388608   32-bit    block     754.934430        44.447        43.058      
67108864    16777216  32-bit    block     1532.771945       43.783        42.414      
134217728   33554432  32-bit    block     3658.159018       36.690        35.543      
268435456   67108864  32-bit    block     7111.878395       37.745        36.565      
536870912   134217728  32-bit    block     13918.029785      38.574        37.368      
1073741824  268435456  32-bit    block     27586.933136      38.922        37.706

2PEs shmem_put_bw test

salloc -N 2 NTASKS_PER_NODE=1 bash examples/nvshmem/nvshmem.sbatch \
  /opt/nvshmem/install/bin/perftest/device/pt-to-pt/shmem_put_bw -b 8 -e 1G -f 2 -n 10000 -w 100

result

#shmem_put_bw_uni                                                                                                                                                                  size(B)     scope     BW (GB/sec)                                                                                                                                                    [7/1998]
8           None      0.004235                                                                
16          None      0.008460                                                                
32          None      0.016930                                                                
64          None      0.033860                                                                
128         None      0.067679                                                                
256         None      0.010519                                                                
512         None      0.021184                                                                
1024        None      0.042399        
2048        None      0.084520        
4096        None      0.170939        
8192        None      0.333568        
16384       None      0.655396        
32768       None      1.322301        
65536       None      2.732627        
131072      None      5.265629        
262144      None      10.510950       
524288      None      21.094450       
1048576     None      43.720306       
2097152     None      46.641781       
4194304     None      47.746071       
8388608     None      48.348083       
16777216    None      48.883083       
33554432    None      48.746395       
67108864    None      48.837250       
134217728   None      48.878212       
268435456   None      48.886234       
536870912   None      48.879814       
1073741824  None      48.886169

The multi-NIC round-robin in get_next_ep() only balanced TX by rotating the local sending EP. The remote target EP and MR key selection were coupled to the same local domain index, causing all incoming RDMA writes to land on a single NIC per GPU. Decouple remote NIC selection from local EP by introducing get_next_remote_domain(), which uses (my_pe + target_pe) % num_proxy_domains to distribute RX across all remote NICs. Different senders now target different NICs on the same destination PE. Before: TX distributed across 4 NICs, RX bottlenecked on 1 NIC After: TX and RX both distributed across 4 NICs

The static remote NIC mapping (my_pe + target_pe) % num_proxy_domains distributes RX across remote NICs when multiple senders target the same PE, but does not balance RX when a single sender repeatedly puts to the same destination — all traffic lands on the same remote NIC. Add a remote_ep_cntr that round-robins the remote domain selection on top of the per-sender base offset. This ensures RX is distributed even in single-sender-to-single-receiver patterns (e.g., two-PE case), while still spreading traffic from different senders across different remote NICs.

crazyguitar added 2 commits March 24, 2026 22:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libfabric: fix multi-NIC RX imbalance for EFA transport#76

libfabric: fix multi-NIC RX imbalance for EFA transport#76
crazyguitar wants to merge 2 commits intoNVIDIA:develfrom
crazyguitar:hotfix/efa-rx-round-robin

crazyguitar commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

crazyguitar commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

crazyguitar commented Mar 25, 2026 •

edited

Loading