Skip to content

System hangs on reboot — NVIDIA open kernel module PCI shutdown handler hangs after PCIe bus error (RTX A1000) #1027

@ElCoyote27

Description

@ElCoyote27

NVIDIA Open GPU Kernel Modules Version

590.48.01

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Fedora 42 (Adams)

Kernel Release

6.18.9-100.fc42.x86_64

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

NVIDIA RTX A1000 (GA107GL) [10de:25b0] (rev a1)

Describe the bug

System hangs on reboot — NVIDIA open kernel module PCI shutdown handler hangs after PCIe bus error (RTX A1000)

System Information

Component Details
OS Fedora 42 (Adams)
Kernel 6.18.9-100.fc42.x86_64
Hardware Lenovo ThinkStation P350
GPU NVIDIA RTX A1000 (GA107GL) [10de:25b0] (rev a1)
Driver NVIDIA 590.48.01, open kernel modules (Dual MIT/GPL)
Driver source negativo17 repo, akmod-nvidia-590.48.01-3.fc42.x86_64
Display server Xorg with GDM (graphical.target, autologin enabled)

Problem Description

When issuing /sbin/reboot, the system hangs indefinitely during the shutdown sequence. The machine remains pingable (kernel and network stack are alive) but SSH is refused (sshd has already been stopped) and the reboot never completes. The only recovery is a physical power cycle.

Initially this appeared to occur only after kernel updates, but it has since become reproducible on every reboot.

Journal Evidence

The following was captured from journalctl -b -1 after a power-cycle recovery. During shutdown, after X/GDM stops, the NVIDIA GPU throws a PCIe bus error:

Feb 15 12:03:39 hidal /usr/libexec/gdm-x-session[2319]: (II) NVIDIA(GPU-0): Deleting GPU-0
Feb 15 12:03:39 hidal /usr/libexec/gdm-x-session[2319]: (WW) xf86CloseConsole: KDSETMODE failed: Input/output error
Feb 15 12:03:39 hidal /usr/libexec/gdm-x-session[2319]: (WW) xf86CloseConsole: VT_GETMODE failed: Input/output error
Feb 15 12:03:39 hidal kernel: nvidia 0000:01:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
Feb 15 12:03:39 hidal kernel: nvidia 0000:01:00.0: device [10de:25b0] error status/mask=00000001/0000a000
Feb 15 12:03:39 hidal kernel: nvidia 0000:01:00.0: [ 0] RxErr (First)

The system then proceeds through the shutdown sequence until systemd-shutdown takes over, where it hangs permanently:

Feb 15 12:03:40 hidal systemd[1]: Reached target shutdown.target - System Shutdown.
Feb 15 12:03:40 hidal systemd[1]: Reached target final.target - Late Shutdown Services.
Feb 15 12:03:41 hidal systemd-shutdown[1]: Syncing filesystems and block devices.
Feb 15 12:03:41 hidal systemd-shutdown[1]: Sending SIGTERM to remaining processes...

The journal ends here. The system never completes the reboot.

Diagnosis

Through testing, the following was confirmed:

  • reboot -f works — skipping systemd's shutdown sequence and calling reboot(2) directly always succeeds.
  • Normal reboot with NVIDIA modules unloaded works — after running systemctl isolate multi-user.target followed by rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia, a normal /sbin/reboot completes cleanly.
  • Normal reboot with NVIDIA modules loaded hangs — consistently, every time.

Conclusion: The NVIDIA open kernel module's PCI .shutdown callback hangs when called during the kernel's device shutdown path, likely because the GPU is in a bad state following the PCIe RxErr physical layer error.

Workaround

A systemd service that unloads all NVIDIA modules after services have stopped but before the final reboot resolves the issue:

# /etc/systemd/system/nvidia-unload.service

[Unit]
Description=Unload NVIDIA modules during shutdown
DefaultDependencies=no
After=shutdown.target
Before=systemd-reboot.service systemd-poweroff.service systemd-halt.service

[Service]
Type=oneshot
ExecStart=/bin/sh -c 'rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia 2>/dev/null; exit 0'
TimeoutStartSec=30

[Install]
WantedBy=reboot.target poweroff.target halt.target

To Reproduce

Install Fedora42 with an A1000 RTX GPU, use the open-gpu-kernel modules, try to reboot.
(system hangs)

Bug Incidence

Once

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions