H100 NVL support

Hi all, I want to check if the gpu-admin-tools supports H100 NVL version of GPUs. I tried to test this out but I am getting a few errors (gpu broken / operation not permitted). Considering I haven't used this tool before, so I might be using it wrong or there could be issues at the system level I am not aware of.  

Machine setup:
1. GPU: 2 H100 NVL GPUs 
2. OS: Rocky Linux 8.6 with SELinux in permissive mode
3. Nvidia Driver: 535.183.01

```
nvidia-smi
Mon Aug 19 20:46:03 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 NVL                Off | 00000000:4E:00.0 Off |                    0 |
| N/A   37C    P0              69W / 400W |      0MiB / 95830MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 NVL                Off | 00000000:62:00.0 Off |                    0 |
| N/A   36C    P0              63W / 400W |      0MiB / 95830MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
```

```
nvidia-smi topo -m
  GPU0  GPU1  NIC0  CPU Affinity  NUMA Affinity GPU NUMA ID
GPU0   X  NV12  NODE  0-23,48-71  0   N/A
GPU1  NV12   X  NODE  0-23,48-71  0   N/A
NIC0  NODE  NODE   X        

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
```


Run:
```
sudo ./nvidia_gpu_tools.py --gpu 1 --block-all-nvlinks
NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['./nvidia_gpu_tools.py', '--gpu', '1', '--block-all-nvlinks']
  File "./nvidia_gpu_tools.py", line 110, in find_gpus_sysfs
    dev = Gpu(dev_path=dev_path)
  File "./nvidia_gpu_tools.py", line 3657, in __init__
    self.set_command_memory(True)
  File "./nvidia_gpu_tools.py", line 1617, in set_command_memory
    self.command["MEMORY"] = 1 if enable else 0
  File "./nvidia_gpu_tools.py", line 1266, in __setitem__
    self._write()
  File "./nvidia_gpu_tools.py", line 1257, in _write
    self.dev.write(self.offset, self.value.raw, self.size)
  File "./nvidia_gpu_tools.py", line 154, in write
    os.write(self.fd, data_from_int(data, size))
2024-08-19,20:47:31.187 ERROR    GPU /sys/bus/pci/devices/0000:4e:00.0 broken: [Errno 1] Operation not permitted
2024-08-19,20:47:31.191 ERROR    Config space working True
  File "./nvidia_gpu_tools.py", line 110, in find_gpus_sysfs
    dev = Gpu(dev_path=dev_path)
  File "./nvidia_gpu_tools.py", line 3657, in __init__
    self.set_command_memory(True)
  File "./nvidia_gpu_tools.py", line 1617, in set_command_memory
    self.command["MEMORY"] = 1 if enable else 0
  File "./nvidia_gpu_tools.py", line 1266, in __setitem__
    self._write()
  File "./nvidia_gpu_tools.py", line 1257, in _write
    self.dev.write(self.offset, self.value.raw, self.size)
  File "./nvidia_gpu_tools.py", line 154, in write
    os.write(self.fd, data_from_int(data, size))
2024-08-19,20:47:31.206 ERROR    GPU /sys/bus/pci/devices/0000:62:00.0 broken: [Errno 1] Operation not permitted
2024-08-19,20:47:31.210 ERROR    Config space working True
GPUs:
  0 GPU 0000:4e:00.0 [broken, cfg space working 1 bars configured 1]
  1 GPU 0000:62:00.0 [broken, cfg space working 1 bars configured 1]
Other:
Topo:
  Intel root port 0000:4d:01.0
   GPU 0000:4e:00.0 ? 0x2321 BAR0 0x212002000000
   GPU 0000:4e:00.0 [broken, cfg space working 1 bars configured 1]
  Intel root port 0000:61:01.0
   GPU 0000:62:00.0 ? 0x2321 BAR0 0x216002000000
   GPU 0000:62:00.0 [broken, cfg space working 1 bars configured 1]
2024-08-19,20:47:31.210 INFO     Selected GPU 0000:62:00.0 [broken, cfg space working 1 bars configured 1]
2024-08-19,20:47:31.210 ERROR    GPU 0000:62:00.0 [broken, cfg space working 1 bars configured 1] is broken and --recover-broken-gpu was not specified, returning failure.

```

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H100 NVL support #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

H100 NVL support #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions