Skip to content

Kernel panic at shutdown due to unsafe use of nvkms-kapi.h:struct NvKmsKapiCallbacks #1033

@roman-zilka

Description

@roman-zilka

NVIDIA Open GPU Kernel Modules Version

590.48.01 and others

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Gentoo Linux

Kernel Release

6.18.12-gentoo, custom build

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

GeForce RTX 5070 and others

Describe the bug

In the downstream Gentoo Linux bug https://bugs.gentoo.org/969413 multiple users (incl. myself) have reported freezes / kernel panics during system shutdown with various versions of the GPU driver incl. 590.48.01 and various kernel versions incl. 6.18.12.

We've discovered that when the kernel is built with RANDSTRUCT_FULL or RANDSTRUCT_PERFORMANCE, the GPU driver causes the system to do one of the following during shutdown/reboot, just before the physical power-down:

  • nothing unusual
  • freeze
  • kernel panic

Image

Which one of the three behaviors occurs is tied to the kernel build. Removing RANDSTRUCT from the kernel fixes the issue, poweroff is always clean then. The message on the attached screenshot has led me to identify struct NvKmsKapiCallbacks declared in nvkms-kapi.h as the source of the bug: a function contained in it is called unsafely somewhere and when the contents of the struct become shuffled by RANDSTRUCT, a wrong function is called instead. In the screenshot, it is the suspend/resume function, which is out of place on my system - I don't use suspend. When a freeze happens instead of a panic, it is the probe function, presumably.

The following patch disables the randomization of the said struct and ensures the freeze/panic doesn't happen with RANDSTRUCT enabled, confirming the cause of the bug. The effect of the patch has been verified by me (590.48.01 open driver, 6.18.12, RTX 5070) and one other Gentoo user so far (https://bugs.gentoo.org/969413#c41, 580.126.09 proprietary driver, unknown GPU, unknown kernel). It cannot be considered a fix, of course, just a workaround.

diff -Naur work/kernel/common/inc/nvkms-kapi.h work-new/kernel/common/inc/nvkms-kapi.h
--- work/kernel/common/inc/nvkms-kapi.h	2025-12-08 13:50:32.000000000 +0100
+++ work-new/kernel/common/inc/nvkms-kapi.h	2026-02-17 17:13:43.980685382 +0100
@@ -603,7 +603,7 @@
     void (*suspendResume)(NvBool suspend);
     void (*remove)(NvU32 gpuId);
     void (*probe)(const struct NvKmsKapiGpuInfo *gpu_info);
-};
+} __attribute__((no_randomize_layout));
 
 struct NvKmsKapiFunctionsTable {
 
diff -Naur work/kernel-module-source/kernel-open/common/inc/nvkms-kapi.h work-new/kernel-module-source/kernel-open/common/inc/nvkms-kapi.h
--- work/kernel-module-source/kernel-open/common/inc/nvkms-kapi.h	2025-12-08 13:51:05.000000000 +0100
+++ work-new/kernel-module-source/kernel-open/common/inc/nvkms-kapi.h	2026-02-17 17:13:08.037511054 +0100
@@ -603,7 +603,7 @@
     void (*suspendResume)(NvBool suspend);
     void (*remove)(NvU32 gpuId);
     void (*probe)(const struct NvKmsKapiGpuInfo *gpu_info);
-};
+} __attribute__((no_randomize_layout));
 
 struct NvKmsKapiFunctionsTable {
 
diff -Naur work/kernel-module-source/src/nvidia-modeset/kapi/interface/nvkms-kapi.h work-new/kernel-module-source/src/nvidia-modeset/kapi/interface/nvkms-kapi.h
--- work/kernel-module-source/src/nvidia-modeset/kapi/interface/nvkms-kapi.h	2025-12-08 13:49:16.000000000 +0100
+++ work-new/kernel-module-source/src/nvidia-modeset/kapi/interface/nvkms-kapi.h	2026-02-17 17:13:23.621208451 +0100
@@ -603,7 +603,7 @@
     void (*suspendResume)(NvBool suspend);
     void (*remove)(NvU32 gpuId);
     void (*probe)(const struct NvKmsKapiGpuInfo *gpu_info);
-};
+} __attribute__((no_randomize_layout));
 
 struct NvKmsKapiFunctionsTable {
 
diff -Naur work/kernel-open/common/inc/nvkms-kapi.h work-new/kernel-open/common/inc/nvkms-kapi.h
--- work/kernel-open/common/inc/nvkms-kapi.h	2025-12-08 13:51:05.000000000 +0100
+++ work-new/kernel-open/common/inc/nvkms-kapi.h	2026-02-17 17:13:33.981717353 +0100
@@ -603,7 +603,7 @@
     void (*suspendResume)(NvBool suspend);
     void (*remove)(NvU32 gpuId);
     void (*probe)(const struct NvKmsKapiGpuInfo *gpu_info);
-};
+} __attribute__((no_randomize_layout));
 
 struct NvKmsKapiFunctionsTable {
 

To Reproduce

  1. Compile kernel 6.18.12 with CONFIG_RANDSTRUCT_FULL=y.
  2. Compile the open GPU driver 590.48.01 for that kernel.
  3. Boot it.
  4. Shut the system down. Observe one of: normal shutdown, freeze, kernel panic. Observe that that particular behavior occurs on each shutdown. There is a probability of each of the three options, determined at kernel build time. Repeat the whole process until a freeze / panic happens.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions