-
Notifications
You must be signed in to change notification settings - Fork 21
Description
Hi there!
This tool has been the closest thing I've gotten to having a method for discovering whether my hypervisors have NVLinks installed on my H100 PCIe GPUs. I use VFIO, no MIG, so no nvidia driver on the hypervisor.
I noticed that neither the H100 PCIe or the A100 actually supports the --nvlink-debug-dump flag:
- H100 PCIe (0x2331) is not added to the list of devices in
read_module_id_h100()gpu-admin-tools/nvidia_gpu_tools.py
Line 4871 in e07d271
if self.device in [0x2330, 0x2336, 0x2324, 0x233f]: gpu-admin-tools/nvidia_gpu_tools.py
Line 4881 in e07d271
if self.device in [0x2330, 0x2336, 0x233f]:
- A100 doesn't have the
links_per_group,base_offsetorper_group_offsetpresent.gpu-admin-tools/nvidia_gpu_tools.py
Lines 718 to 729 in e07d271
"A100": { "name": "A100", "arch": "ampere", "pmu_reset_in_pmc": False, "memory_clear_supported": True, "forcing_ecc_on_after_reset_supported": True, "nvdec": [0, 1, 2, 3, 4], "nvenc": [], "other_falcons": ["sec", "gsp"], "nvlink": { "number": 12, },
I'm not a hardware engineer and I don't know the specs for the registers of the A100, but I was able to get the H100 PCIe working* by simply adding 0x2331 to the lists in the above lines.
As for the A100 I've been able to get it to work intermittently but using the same offsets as the H100, but it's really hit and miss.
I'm hoping to get some insight on whether it'd be possible to get formal support for this? Thank you!
* No clue if it's accurate or not, but it's proven reliable enough for me to know whether NVLink bridges are installed on GPU pairs.