-
Notifications
You must be signed in to change notification settings - Fork 21
Open
Description
Hi all, I want to check if the gpu-admin-tools supports H100 NVL version of GPUs. I tried to test this out but I am getting a few errors (gpu broken / operation not permitted). Considering I haven't used this tool before, so I might be using it wrong or there could be issues at the system level I am not aware of.
Machine setup:
- GPU: 2 H100 NVL GPUs
- OS: Rocky Linux 8.6 with SELinux in permissive mode
- Nvidia Driver: 535.183.01
nvidia-smi
Mon Aug 19 20:46:03 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 NVL Off | 00000000:4E:00.0 Off | 0 |
| N/A 37C P0 69W / 400W | 0MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 NVL Off | 00000000:62:00.0 Off | 0 |
| N/A 36C P0 63W / 400W | 0MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
nvidia-smi topo -m
GPU0 GPU1 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NODE 0-23,48-71 0 N/A
GPU1 NV12 X NODE 0-23,48-71 0 N/A
NIC0 NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_bond_0
Run:
sudo ./nvidia_gpu_tools.py --gpu 1 --block-all-nvlinks
NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['./nvidia_gpu_tools.py', '--gpu', '1', '--block-all-nvlinks']
File "./nvidia_gpu_tools.py", line 110, in find_gpus_sysfs
dev = Gpu(dev_path=dev_path)
File "./nvidia_gpu_tools.py", line 3657, in __init__
self.set_command_memory(True)
File "./nvidia_gpu_tools.py", line 1617, in set_command_memory
self.command["MEMORY"] = 1 if enable else 0
File "./nvidia_gpu_tools.py", line 1266, in __setitem__
self._write()
File "./nvidia_gpu_tools.py", line 1257, in _write
self.dev.write(self.offset, self.value.raw, self.size)
File "./nvidia_gpu_tools.py", line 154, in write
os.write(self.fd, data_from_int(data, size))
2024-08-19,20:47:31.187 ERROR GPU /sys/bus/pci/devices/0000:4e:00.0 broken: [Errno 1] Operation not permitted
2024-08-19,20:47:31.191 ERROR Config space working True
File "./nvidia_gpu_tools.py", line 110, in find_gpus_sysfs
dev = Gpu(dev_path=dev_path)
File "./nvidia_gpu_tools.py", line 3657, in __init__
self.set_command_memory(True)
File "./nvidia_gpu_tools.py", line 1617, in set_command_memory
self.command["MEMORY"] = 1 if enable else 0
File "./nvidia_gpu_tools.py", line 1266, in __setitem__
self._write()
File "./nvidia_gpu_tools.py", line 1257, in _write
self.dev.write(self.offset, self.value.raw, self.size)
File "./nvidia_gpu_tools.py", line 154, in write
os.write(self.fd, data_from_int(data, size))
2024-08-19,20:47:31.206 ERROR GPU /sys/bus/pci/devices/0000:62:00.0 broken: [Errno 1] Operation not permitted
2024-08-19,20:47:31.210 ERROR Config space working True
GPUs:
0 GPU 0000:4e:00.0 [broken, cfg space working 1 bars configured 1]
1 GPU 0000:62:00.0 [broken, cfg space working 1 bars configured 1]
Other:
Topo:
Intel root port 0000:4d:01.0
GPU 0000:4e:00.0 ? 0x2321 BAR0 0x212002000000
GPU 0000:4e:00.0 [broken, cfg space working 1 bars configured 1]
Intel root port 0000:61:01.0
GPU 0000:62:00.0 ? 0x2321 BAR0 0x216002000000
GPU 0000:62:00.0 [broken, cfg space working 1 bars configured 1]
2024-08-19,20:47:31.210 INFO Selected GPU 0000:62:00.0 [broken, cfg space working 1 bars configured 1]
2024-08-19,20:47:31.210 ERROR GPU 0000:62:00.0 [broken, cfg space working 1 bars configured 1] is broken and --recover-broken-gpu was not specified, returning failure.
Thanks!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels