Skip to content

uvm_pmm_gpu_device_p2p_init() doesn't correctly refcount p2pdma pages #1023

@tim-day-387

Description

@tim-day-387

NVIDIA Open GPU Kernel Modules Version

590.48.01

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Amazon Linux 2023

Kernel Release

I've tested multiple custom built > 6.16 kernels

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

NVIDIA H100 80GB HBM3

Describe the bug

After:

commit b7e2823787735ca009e63f35f164b46df0ef096c
Author: Alistair Popple <[apopple@nvidia.com](mailto:apopple@nvidia.com)>
Date: Fri Feb 28 14:31:05 2025 +1100

mm/mm_init: move p2pdma page refcount initialisation to p2pdma

p2pdma pages are not being refcounted correctly. This causes CUDA to incorrectly conclude that p2pdma is not supported. Applying a change like:

diff --git a/kernel-open/nvidia-uvm/uvm_pmm_gpu.c b/kernel-open/nvidia-uvm/uvm_pmm_gpu.c
index 97ff13dc..9585ad0d 100644
--- a/kernel-open/nvidia-uvm/uvm_pmm_gpu.c
+++ b/kernel-open/nvidia-uvm/uvm_pmm_gpu.c
@@ -3352,8 +3352,10 @@ void uvm_pmm_gpu_device_p2p_init(uvm_parent_gpu_t *parent_gpu)
     // allocate PCI P2PDMA pages directly
     p2p_page = pfn_to_page(pci_start_pfn);
     page_pgmap(p2p_page)->ops = &uvm_device_p2p_pgmap_ops;
-    for (; page_to_pfn(p2p_page) < pci_end_pfn; p2p_page++)
+    for (; page_to_pfn(p2p_page) < pci_end_pfn; p2p_page++) {
         p2p_page->zone_device_data = NULL;
+        set_page_count(p2p_page, 1);
+    }
 
     parent_gpu->device_p2p_initialised = true;
 }

appears to fix the issue.

To Reproduce

CUFILE_USE_PCIP2PDMA=1 /usr/local/cuda/gds/tools/gdscheck -p will fail to return p2pdma as supported.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions