Skip to content

使用simai 模拟物理发包,但是没有办法running起来,一直显示打开设备失败 #280

@Tupperis

Description

@Tupperis

mpirun --prefix /usr/local/openmpi-4.1.6 -np 2
-hostfile ./hostfile
--allow-run-as-root
--mca pml ob1
--mca btl self,tcp,vader
--mca btl_openib_warn_no_device_params_found 0
-x AS_LOG_LEVEL=0
./bin/SimAI_phynet ./hostlist -g 2
-w /home/SimAI/aicb/results/workload/None-gpt_13B-world_size2-tp2-pp1-ep1-gbs32-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt
-i 3
SimAI_phynet: /home/SimAI/astra-sim-alibabacloud/astra-sim/system/SimAiFlowModelRdma.cc:706: int FlowPhyRdma::ibv_init(): Assertion g_ibv_ctx != NULL' failed. [test:3070717] *** Process received signal *** [test:3070717] Signal: Aborted (6) [test:3070717] Signal code: (-6) [test:3070717] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x78a877a45330] [test:3070717] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x78a877a9eb2c] [test:3070717] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x78a877a4527e] [test:3070717] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x78a877a288ff] [test:3070717] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2881b)[0x78a877a2881b] [test:3070717] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x3b517)[0x78a877a3b517] [test:3070717] [ 6] ./bin/SimAI_phynet(+0x314c7)[0x5885632614c7] [test:3070717] [ 7] ./bin/SimAI_phynet(+0x20c68)[0x588563250c68] [test:3070717] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x78a877a2a1ca] [test:3070717] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x78a877a2a28b] [test:3070717] [10] ./bin/SimAI_phynet(+0x1bf35)[0x58856324bf35] [test:3070717] *** End of error message *** SimAI_phynet: /home/SimAI/astra-sim-alibabacloud/astra-sim/system/SimAiFlowModelRdma.cc:706: int FlowPhyRdma::ibv_init(): Assertion g_ibv_ctx != NULL' failed.
[test-Super-Server:40524] *** Process received signal ***
[test-Super-Server:40524] Signal: Aborted (6)
[test-Super-Server:40524] Signal code: (-6)
[test-Super-Server:40524] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7e33a3442520]
[test-Super-Server:40524] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7e33a34969fc]
[test-Super-Server:40524] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7e33a3442476]
[test-Super-Server:40524] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7e33a34287f3]
[test-Super-Server:40524] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7e33a342871b]
[test-Super-Server:40524] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7e33a3439e96]
[test-Super-Server:40524] [ 6] ./bin/SimAI_phynet(+0x30961)[0x59ba865f3961]
[test-Super-Server:40524] [ 7] ./bin/SimAI_phynet(+0x1fc95)[0x59ba865e2c95]
[test-Super-Server:40524] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7e33a3429d90]
[test-Super-Server:40524] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7e33a3429e40]
[test-Super-Server:40524] [10] ./bin/SimAI_phynet(+0x1b015)[0x59ba865de015]
[test-Super-Server:40524] *** End of error message ***

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 1 with PID 40524 on node node02 exited on signal 6 (Aborted).

ibdev2netdev(const char indev[64], char netdev[64]) {
  if (strncmp(indev, "mlx5_bond_", 10) == 0) {
    strcpy(netdev, "bond");
    strcat(netdev, indev + 10);
  } else {
    strcpy(netdev, indev);
  }
}
这串代码逻辑看起来用的奇怪,感觉找到了内核的网口名不是ib设备的

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions