mpirun noticed that process rank 1 with PID 40524 on node node02 exited on signal 6 (Aborted).
ibdev2netdev(const char indev[64], char netdev[64]) {
if (strncmp(indev, "mlx5_bond_", 10) == 0) {
strcpy(netdev, "bond");
strcat(netdev, indev + 10);
} else {
strcpy(netdev, indev);
}
}
这串代码逻辑看起来用的奇怪,感觉找到了内核的网口名不是ib设备的
mpirun --prefix /usr/local/openmpi-4.1.6 -np 2
-hostfile ./hostfile
--allow-run-as-root
--mca pml ob1
--mca btl self,tcp,vader
--mca btl_openib_warn_no_device_params_found 0
-x AS_LOG_LEVEL=0
./bin/SimAI_phynet ./hostlist -g 2
-w /home/SimAI/aicb/results/workload/None-gpt_13B-world_size2-tp2-pp1-ep1-gbs32-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt
-i 3
SimAI_phynet: /home/SimAI/astra-sim-alibabacloud/astra-sim/system/SimAiFlowModelRdma.cc:706: int FlowPhyRdma::ibv_init(): Assertion
g_ibv_ctx != NULL' failed. [test:3070717] *** Process received signal *** [test:3070717] Signal: Aborted (6) [test:3070717] Signal code: (-6) [test:3070717] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x78a877a45330] [test:3070717] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x78a877a9eb2c] [test:3070717] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x78a877a4527e] [test:3070717] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x78a877a288ff] [test:3070717] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2881b)[0x78a877a2881b] [test:3070717] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x3b517)[0x78a877a3b517] [test:3070717] [ 6] ./bin/SimAI_phynet(+0x314c7)[0x5885632614c7] [test:3070717] [ 7] ./bin/SimAI_phynet(+0x20c68)[0x588563250c68] [test:3070717] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x78a877a2a1ca] [test:3070717] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x78a877a2a28b] [test:3070717] [10] ./bin/SimAI_phynet(+0x1bf35)[0x58856324bf35] [test:3070717] *** End of error message *** SimAI_phynet: /home/SimAI/astra-sim-alibabacloud/astra-sim/system/SimAiFlowModelRdma.cc:706: int FlowPhyRdma::ibv_init(): Assertiong_ibv_ctx != NULL' failed.[test-Super-Server:40524] *** Process received signal ***
[test-Super-Server:40524] Signal: Aborted (6)
[test-Super-Server:40524] Signal code: (-6)
[test-Super-Server:40524] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7e33a3442520]
[test-Super-Server:40524] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7e33a34969fc]
[test-Super-Server:40524] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7e33a3442476]
[test-Super-Server:40524] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7e33a34287f3]
[test-Super-Server:40524] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7e33a342871b]
[test-Super-Server:40524] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7e33a3439e96]
[test-Super-Server:40524] [ 6] ./bin/SimAI_phynet(+0x30961)[0x59ba865f3961]
[test-Super-Server:40524] [ 7] ./bin/SimAI_phynet(+0x1fc95)[0x59ba865e2c95]
[test-Super-Server:40524] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7e33a3429d90]
[test-Super-Server:40524] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7e33a3429e40]
[test-Super-Server:40524] [10] ./bin/SimAI_phynet(+0x1b015)[0x59ba865de015]
[test-Super-Server:40524] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 1 with PID 40524 on node node02 exited on signal 6 (Aborted).