Skip to content

H100 NVL support #4

@vishnukarthikl

Description

@vishnukarthikl

Hi all, I want to check if the gpu-admin-tools supports H100 NVL version of GPUs. I tried to test this out but I am getting a few errors (gpu broken / operation not permitted). Considering I haven't used this tool before, so I might be using it wrong or there could be issues at the system level I am not aware of.

Machine setup:

  1. GPU: 2 H100 NVL GPUs
  2. OS: Rocky Linux 8.6 with SELinux in permissive mode
  3. Nvidia Driver: 535.183.01
nvidia-smi
Mon Aug 19 20:46:03 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 NVL                Off | 00000000:4E:00.0 Off |                    0 |
| N/A   37C    P0              69W / 400W |      0MiB / 95830MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 NVL                Off | 00000000:62:00.0 Off |                    0 |
| N/A   36C    P0              63W / 400W |      0MiB / 95830MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
nvidia-smi topo -m
  GPU0  GPU1  NIC0  CPU Affinity  NUMA Affinity GPU NUMA ID
GPU0   X  NV12  NODE  0-23,48-71  0   N/A
GPU1  NV12   X  NODE  0-23,48-71  0   N/A
NIC0  NODE  NODE   X        

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0

Run:

sudo ./nvidia_gpu_tools.py --gpu 1 --block-all-nvlinks
NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['./nvidia_gpu_tools.py', '--gpu', '1', '--block-all-nvlinks']
  File "./nvidia_gpu_tools.py", line 110, in find_gpus_sysfs
    dev = Gpu(dev_path=dev_path)
  File "./nvidia_gpu_tools.py", line 3657, in __init__
    self.set_command_memory(True)
  File "./nvidia_gpu_tools.py", line 1617, in set_command_memory
    self.command["MEMORY"] = 1 if enable else 0
  File "./nvidia_gpu_tools.py", line 1266, in __setitem__
    self._write()
  File "./nvidia_gpu_tools.py", line 1257, in _write
    self.dev.write(self.offset, self.value.raw, self.size)
  File "./nvidia_gpu_tools.py", line 154, in write
    os.write(self.fd, data_from_int(data, size))
2024-08-19,20:47:31.187 ERROR    GPU /sys/bus/pci/devices/0000:4e:00.0 broken: [Errno 1] Operation not permitted
2024-08-19,20:47:31.191 ERROR    Config space working True
  File "./nvidia_gpu_tools.py", line 110, in find_gpus_sysfs
    dev = Gpu(dev_path=dev_path)
  File "./nvidia_gpu_tools.py", line 3657, in __init__
    self.set_command_memory(True)
  File "./nvidia_gpu_tools.py", line 1617, in set_command_memory
    self.command["MEMORY"] = 1 if enable else 0
  File "./nvidia_gpu_tools.py", line 1266, in __setitem__
    self._write()
  File "./nvidia_gpu_tools.py", line 1257, in _write
    self.dev.write(self.offset, self.value.raw, self.size)
  File "./nvidia_gpu_tools.py", line 154, in write
    os.write(self.fd, data_from_int(data, size))
2024-08-19,20:47:31.206 ERROR    GPU /sys/bus/pci/devices/0000:62:00.0 broken: [Errno 1] Operation not permitted
2024-08-19,20:47:31.210 ERROR    Config space working True
GPUs:
  0 GPU 0000:4e:00.0 [broken, cfg space working 1 bars configured 1]
  1 GPU 0000:62:00.0 [broken, cfg space working 1 bars configured 1]
Other:
Topo:
  Intel root port 0000:4d:01.0
   GPU 0000:4e:00.0 ? 0x2321 BAR0 0x212002000000
   GPU 0000:4e:00.0 [broken, cfg space working 1 bars configured 1]
  Intel root port 0000:61:01.0
   GPU 0000:62:00.0 ? 0x2321 BAR0 0x216002000000
   GPU 0000:62:00.0 [broken, cfg space working 1 bars configured 1]
2024-08-19,20:47:31.210 INFO     Selected GPU 0000:62:00.0 [broken, cfg space working 1 bars configured 1]
2024-08-19,20:47:31.210 ERROR    GPU 0000:62:00.0 [broken, cfg space working 1 bars configured 1] is broken and --recover-broken-gpu was not specified, returning failure.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions