Skip to content

NVLink debug data with Ampere A100 and H100 PCIe? #5

@Banshee1221

Description

@Banshee1221

Hi there!

This tool has been the closest thing I've gotten to having a method for discovering whether my hypervisors have NVLinks installed on my H100 PCIe GPUs. I use VFIO, no MIG, so no nvidia driver on the hypervisor.

I noticed that neither the H100 PCIe or the A100 actually supports the --nvlink-debug-dump flag:

  • H100 PCIe (0x2331) is not added to the list of devices in read_module_id_h100()
  • A100 doesn't have the links_per_group, base_offset or per_group_offset present.
    • "A100": {
      "name": "A100",
      "arch": "ampere",
      "pmu_reset_in_pmc": False,
      "memory_clear_supported": True,
      "forcing_ecc_on_after_reset_supported": True,
      "nvdec": [0, 1, 2, 3, 4],
      "nvenc": [],
      "other_falcons": ["sec", "gsp"],
      "nvlink": {
      "number": 12,
      },

I'm not a hardware engineer and I don't know the specs for the registers of the A100, but I was able to get the H100 PCIe working* by simply adding 0x2331 to the lists in the above lines.

As for the A100 I've been able to get it to work intermittently but using the same offsets as the H100, but it's really hit and miss.

I'm hoping to get some insight on whether it'd be possible to get formal support for this? Thank you!

* No clue if it's accurate or not, but it's proven reliable enough for me to know whether NVLink bridges are installed on GPU pairs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions