Skip to content

Comments

nvidia-drm: handle -EDEADLK in nv_drm_reset_input_colorspace#1031

Open
jopamo wants to merge 1 commit intoNVIDIA:mainfrom
jopamo:fix/nvidia-drm-edeadlk-reset-colorspace-clean
Open

nvidia-drm: handle -EDEADLK in nv_drm_reset_input_colorspace#1031
jopamo wants to merge 1 commit intoNVIDIA:mainfrom
jopamo:fix/nvidia-drm-edeadlk-reset-colorspace-clean

Conversation

@jopamo
Copy link

@jopamo jopamo commented Feb 18, 2026

drm_atomic_get_plane_state() and drm_atomic_commit() can return -EDEADLK when ww-mutex deadlock avoidance triggers. The current
nv_drm_reset_input_colorspace() path drops locks and returns without running the required modeset backoff/retry flow.

Rework the function to retry the atomic sequence with drm_modeset_backoff(&ctx), rebuilding atomic state on each retry, and only finish once the sequence succeeds or another error is returned.

drm_atomic_get_plane_state() and drm_atomic_commit() can return -EDEADLK when
ww-mutex deadlock avoidance triggers. The current
nv_drm_reset_input_colorspace() path drops locks and returns without running
the required modeset backoff/retry flow.

Rework the function to retry the atomic sequence with drm_modeset_backoff(&ctx),
rebuilding atomic state on each retry, and only finish once the sequence
succeeds or another error is returned.

Signed-off-by: Paul Moses <p@1g4.org>
@CLAassistant
Copy link

CLAassistant commented Feb 18, 2026

CLA assistant check
All committers have signed the CLA.

@jopamo
Copy link
Author

jopamo commented Feb 18, 2026

[ 52.222066] WARNING: CPU: 16 PID: 2149 at drivers/gpu/drm/drm_modeset_lock.c:278 drm_modeset_drop_locks+0x72/0x80
[ 52.222074] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device snd_hda_codec_nvhdmi snd_hda_codec_hdmi
snd_hda_intel iwlmvm btusb snd_hda_codec btmtk btrtl btbcm btintel snd_hda_core ptp snd_intel_dspcfg bluetooth snd_hwdep bridge
snd_pcm iwlwifi snd_timer snd rapl wmi_bmof stp i2c_piix4 soundcore k10temp llc mousedev nvidia_uvm(OE) loop hid_logitech_hidpp
hid_multitouch hid_logitech_dj nvidia_drm(OE) nvidia_modeset(OE) nvidia(OE) video drm_ttm_helper ttm wmi hid_generic usbhid
[ 52.222106] CPU: 16 UID: 0 PID: 2149 Comm: systemd-logind Tainted: G OE 6.18.10 #4 PREEMPT(full)
f701ad6788475e32c7d0194fc1c6c2aa36fe3705
[ 52.222110] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 52.222112] RIP: 0010:drm_modeset_drop_locks+0x72/0x80
[ 52.222115] Code: 89 42 08 48 89 10 48 89 1b 48 8d bb 50 ff ff ff 48 89 5b 08 e8 8f c3 b6 00 48 8b 85 a8 00 00 00 4c 39 e0 75 c0
5b 5d 41 5c c3 <0f> 0b eb a4 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 0f 1f 44 00
[ 52.222117] RSP: 0018:ffffd07985213be0 EFLAGS: 00010282
[ 52.222119] RAX: 0000000000000004 RBX: 00000000ffffffdd RCX: 0000000000000001
[ 52.222121] RDX: 0000000000000010 RSI: ffffffff85feb679 RDI: ffffd07985213c00
[ 52.222122] RBP: ffffd07985213c00 R08: 000000000001e8f4 R09: 0000000000000000
[ 52.222123] R10: 0000000000000010 R11: 0000000000000000 R12: ffff8ee2de163898
[ 52.222124] R13: 0000000000000000 R14: ffffd07985213db8 R15: 000000000000641f
[ 52.222125] FS: 00007e340ef64d00(0000) GS:ffff8f01d3635000(0000) knlGS:0000000000000000
[ 52.222127] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 52.222129] CR2: 00007e097570ec70 CR3: 000000012368b000 CR4: 0000000000f50ef0
[ 52.222130] PKRU: 55555554
[ 52.222131] Call Trace:
[ 52.222133]
[ 52.222137] nv_drm_register_drm_device+0x1a2/0x1240 [nvidia_drm 0a7b62a05317cd80f21dac2767c3383f9905171d]
[ 52.222144] ? drm_setmaster_ioctl+0x190/0x190
[ 52.222148] nv_drm_register_drm_device+0xdd8/0x1240 [nvidia_drm 0a7b62a05317cd80f21dac2767c3383f9905171d]
[ 52.222150] drm_dropmaster_ioctl+0xa9/0x140
[ 52.222153] drm_ioctl_kernel+0xaa/0x110
[ 52.222157] drm_ioctl+0x260/0x510
[ 52.222159] ? drm_setmaster_ioctl+0x190/0x190
[ 52.222164] __x64_sys_ioctl+0x419/0x980
[ 52.222169] do_syscall_64+0x96/0xa80
[ 52.222173] ? trace_hardirqs_on_prepare+0x80/0xc0
[ 52.222177] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 52.222180] RIP: 0033:0x7e340e723089
[ 52.222182] Code: 00 00 00 48 89 44 24 18 48 8d 44 24 60 c7 04 24 18 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10
00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1e 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 52.222183] RSP: 002b:00007ffd23eb6410 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 52.222185] RAX: ffffffffffffffda RBX: 00005a7cc7e8a3b0 RCX: 00007e340e723089
[ 52.222187] RDX: 0000000000000000 RSI: 000000000000641f RDI: 0000000000000020
[ 52.222188] RBP: 00005a7cc7e8bef0 R08: 00005a7cc7e8a3b0 R09: 0000000000000000
[ 52.222189] R10: 0000000000000006 R11: 0000000000000202 R12: 00005a7cc7e81020
[ 52.222190] R13: 00005a7cc7e74190 R14: 00007ffd23eb6560 R15: 0000000000000000
[ 52.222194]
[ 52.222195] irq event stamp: 0
[ 52.222196] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[ 52.222198] hardirqs last disabled at (0): [] copy_process+0xa38/0x21c0
[ 52.222202] softirqs last enabled at (0): [] copy_process+0xa38/0x21c0
[ 52.222204] softirqs last disabled at (0): [<0000000000000000>] 0x0
[ 52.222205] ---[ end trace 0000000000000000 ]---
[ 52.222207] [drm] [nvidia-drm] [GPU ID 0x00002d00] nv_drm_reset_input_colorspace failed with error code: -35 !

@aritger
Copy link
Collaborator

aritger commented Feb 19, 2026

Thanks for the patch and backtrace. What are the steps you're using, or particular configuration, to trigger the problem?

@jopamo
Copy link
Author

jopamo commented Feb 19, 2026

GPU/driver: 5060 Ti — 590.48.01
Kernel: 6.18.10

This happened during boot while I was debugging an unrelated kernel module (act_gate). I wasn’t actively exercising the DRM stack; the warning showed up as part of early system bring-up with heavy lock debugging enabled.

From what I can tell, the relevant locking contract in the kernel source is very explicit about -EDEADLK handling in the DRM atomic + ww-mutex paths:

  • include/drm/drm_modeset_lock.h documents that callers must back off and retry on -EDEADLK.
  • drivers/gpu/drm/drm_modeset_lock.c makes it clear that drm_modeset_backoff() is required when ww deadlock avoidance triggers.
  • The same file also warns that attempting to acquire additional locks after -EDEADLK without first backing off is invalid.
  • drivers/gpu/drm/drm_atomic.c documents that both drm_atomic_get_plane_state() and drm_atomic_commit() may return -EDEADLK, and that the entire atomic sequence must be restarted.
  • include/linux/ww_mutex.h specifies that held ww locks must be released and the slowpath retry used when deadlock detection fires.
  • Documentation/locking/ww-mutex-design.rst shows the canonical retry pattern for -EDEADLK.
#
# Lock Debugging (spinlocks, mutexes, etc...)
#
CONFIG_LOCK_DEBUGGING_SUPPORT=y
CONFIG_PROVE_LOCKING=y
CONFIG_PROVE_RAW_LOCK_NESTING=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_LOCKDEP=y
CONFIG_LOCKDEP_BITS=15
CONFIG_LOCKDEP_CHAINS_BITS=16
CONFIG_LOCKDEP_STACK_TRACE_BITS=19
CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_ATOMIC_SLEEP=y
CONFIG_DEBUG_LOCKING_API_SELFTESTS=y
CONFIG_LOCK_TORTURE_TEST=m
CONFIG_WW_MUTEX_SELFTEST=m
# CONFIG_SCF_TORTURE_TEST is not set
CONFIG_CSD_LOCK_WAIT_DEBUG=y
# CONFIG_CSD_LOCK_WAIT_DEBUG_DEFAULT is not set
# end of Lock Debugging (spinlocks, mutexes, etc...)

CONFIG_TRACE_IRQFLAGS=y
CONFIG_TRACE_IRQFLAGS_NMI=y
# CONFIG_NMI_CHECK_CPU is not set
CONFIG_DEBUG_IRQFLAGS=y
CONFIG_STACKTRACE=y
# CONFIG_WARN_ALL_UNSEEDED_RANDOM is not set
# CONFIG_DEBUG_KOBJECT is not set
# CONFIG_DEBUG_KOBJECT_RELEASE is not set

#
# Debug kernel data structures
#
CONFIG_DEBUG_LIST=y
# CONFIG_DEBUG_PLIST is not set
CONFIG_DEBUG_SG=y
CONFIG_DEBUG_NOTIFIERS=y
# CONFIG_DEBUG_CLOSURES is not set
# CONFIG_DEBUG_MAPLE_TREE is not set
# end of Debug kernel data structures

#
# RCU Debugging
#
CONFIG_PROVE_RCU=y
CONFIG_PROVE_RCU_LIST=y
CONFIG_TORTURE_TEST=m
# CONFIG_RCU_SCALE_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_REF_SCALE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
# CONFIG_RCU_CPU_STALL_CPUTIME is not set
# CONFIG_RCU_CPU_STALL_NOTIFIER is not set
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set
# end of RCU Debugging

Given the above, it looks like the atomic path needs to follow the standard ww-mutex backoff/retry sequence when -EDEADLK is hit, rather than dropping out early.

I haven't had stability issues in conjunction with this, but this doesn't appear to be a false positive based on kernel docs.

@Binary-Eater
Copy link
Collaborator

I think we would need to explore Wound/Wait Deadlock Prevention conceptually more before being able to approach any sort of fix. My take on this is the issue seen only reproduces with kernel lock debugging enabled, and we have not seen any live issues of this that we are currently aware of. nvidia-drm does not make use of TTM or any of the upstream GPU resource managers, so it could be that the design difference is falsely triggering the deadlock detector.

https://docs.kernel.org/locking/ww-mutex-design.html

@jopamo
Copy link
Author

jopamo commented Feb 19, 2026

With CONFIG_PROVE_LOCKING and CONFIG_DEBUG_WW_MUTEX_SLOWPATH enabled lockdep models the ww_mutex graph and flags the violation. That is not lockdep confusion. It is the retry protocol not being followed.

The fact that this only reproduces with lock debugging enabled is expected. Lockdep is designed to expose latent ordering bugs that depend on timing. Not seeing a production deadlock does not establish correctness. The atomic helpers assume drivers implement the documented backoff pattern.

On TTM, its absence is not directly relevant. The ww_mutex and modeset locking rules apply to any driver using DRM atomic helpers regardless of memory manager. The expectations come from DRM core, not TTM. If anything, diverging from common upstream patterns makes strict adherence more important.

This is not a lockdep false positive due to design differences. It is a missing retry path in a ww_mutex context and lock debugging simply makes it visible.

@jopamo
Copy link
Author

jopamo commented Feb 20, 2026

Took a closer look. The GSP-RM path still leads to GSP RPC timeouts (Xid 119). In my testing the DRM-side error path is effectively masked because GSP wedges first. I can reproduce a GPU hang/reset by racing DRM atomic commits with DROP_MASTER/SET_MASTER. The failure manifests as a GSP RPC timeout: fn 76 (GSP_RM_CONTROL) data0=0x20800a6a data1=0x0, followed by Xid 62/109/119 and “GPU reset required”. Adding a small delay between DROP_MASTER and SET_MASTER reduces the reproduction rate which suggests a tight timing window.

Feb 19 21:55:12 localhost kernel: NVRM: Xid (PCI:0000:2d:00): 62, 32344000 0000b670 00000000 206a7a8a 206a6c4a 206a6db8 206a52ae 206a5aca
Feb 19 21:55:12 localhost kernel: NVRM: Xid (PCI:0000:2d:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
Feb 19 21:55:22 localhost kernel: nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
Feb 19 21:55:24 localhost kernel: NVRM: krcWatchdog_IMPL: RC watchdog: GPU is probably locked! Notify Timeout Seconds: 7
Feb 19 21:55:26 localhost kernel: [drm:nv_drm_atomic_commit [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00002d00] Flip event timeout on head 0
Feb 19 21:55:26 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 19 21:55:27 localhost kernel: NVRM: Xid (PCI:0000:2d:00): 109, pid=587, name=(udev-worker), channel 0x00000001, errorString CTX SWITCH TIMEOUT, Info 0x4000
Feb 19 21:55:28 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 19 21:55:30 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 19 21:55:30 localhost kernel: NVRM: _kgspLogXid119: ********************************* GSP Timeout **********************************
Feb 19 21:55:30 localhost kernel: NVRM: _kgspLogXid119: Note: Please also check logs above.
Feb 19 21:55:30 localhost kernel: NVRM: Xid (PCI:0000:2d:00): 119, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 1043 (0x20800a6a 0x0).
Feb 19 21:55:30 localhost kernel: NVRM: kgspPrintGspBinBuildId_IMPL: GSP bin buildId: cf812cb3f2f1e8c8209dc2e446fdf536ba9ec88f
Feb 19 21:55:30 localhost kernel: NVRM: task watchdog timeout @ pc:0x15261ee, partition:4#0, task:3
Feb 19 21:55:30 localhost kernel: NVRM: Reported by libos partition:4#5 kernel v3.1 [0] @ ts:5181887
Feb 19 21:55:30 localhost kernel: NVRM: RISC-V CSR State:
Feb 19 21:55:30 localhost kernel: NVRM: sstatus:0x0000000200000020 sscratch:0xffffffffa3013960 sie:0x0000000000000220 sip:0x0000000000000020
Feb 19 21:55:30 localhost kernel: NVRM: sepc:0x00000000015261ee stval:0x0000000000000000 scause:0x8000000000000005
Feb 19 21:55:30 localhost kernel: NVRM: RISC-V GPR State:
Feb 19 21:55:30 localhost kernel: NVRM: ra:0x0000000001526ce0 sp:0x00000007f780f250 gp:0x0000000000000000 tp:0x00000007f7c00000
Feb 19 21:55:30 localhost kernel: NVRM: a0:0x001268e635282380 a1:0x0000000000000000 a2:0x00000007f780f2b0 a3:0x0000000000000002
Feb 19 21:55:30 localhost kernel: NVRM: a4:0x0000000000001630 a5:0x00000000009502f9 a6:0x000fffffffffffff a7:0x0000000000000000
Feb 19 21:55:30 localhost kernel: NVRM: s0:0x00000007f780f2a0 s1:0x00000007f097b698 s2:0x0000000000000059 s3:0x00000000000000b4
Feb 19 21:55:30 localhost kernel: NVRM: s4:0x001268e635282380 s5:0x00000002540be400 s6:0x00000000000f4240 s7:0x00000000202bd840
Feb 19 21:55:30 localhost kernel: NVRM: s8:0x0000000004164180 s9:0x0000000004164180 s10:0x001268e5f6cc9a60 s11:0x00000007f0a02e10
Feb 19 21:55:30 localhost kernel: NVRM: t0:0x0000000000000005 t1:0x0000000000000003 t2:0x0000000000000000 t3:0x000fffffffffffff
Feb 19 21:55:30 localhost kernel: NVRM: t4:0x0000000000000000 t5:0x00000007f780f401 t6:0x0000000000000020
Feb 19 21:55:30 localhost kernel: NVRM: Stack Trace:
Feb 19 21:55:30 localhost kernel: NVRM: 0x00000000015261ee
Feb 19 21:55:30 localhost kernel: NVRM: 0x0000000001526ce0
Feb 19 21:55:30 localhost kernel: NVRM: 0x0000000001aa1678
Feb 19 21:55:30 localhost kernel: NVRM: 0x0000000001acc430
Feb 19 21:55:30 localhost kernel: NVRM: 0x000000000169f602
Feb 19 21:55:30 localhost kernel: NVRM: 0x0000000001684dec
Feb 19 21:55:30 localhost kernel: NVRM: 0x0000000001b120ee
Feb 19 21:55:30 localhost kernel: NVRM: 0x000000000143a254
Feb 19 21:55:30 localhost kernel: NVRM: 0x000000000143a3d2
Feb 19 21:55:30 localhost kernel: NVRM: 0x000000000143a6de
Feb 19 21:55:30 localhost kernel: NVRM: 0x000000000179e67c
Feb 19 21:55:30 localhost kernel: NVRM: 0x000000000179ffae
Feb 19 21:55:30 localhost kernel: NVRM: 0x00000000017a0e0c
Feb 19 21:55:30 localhost kernel: NVRM: 0x0000000001aabf30
Feb 19 21:55:30 localhost kernel: NVRM: 0x0000000001393af6
Feb 19 21:55:30 localhost kernel: NVRM: 0x000000000139131a
Feb 19 21:55:30 localhost kernel: NVRM: 0x00000000015264b4
Feb 19 21:55:30 localhost kernel: NVRM: 0x0000000001ad71dc
Feb 19 21:55:30 localhost kernel: NVRM: 0x0000000001ad8ed2
Feb 19 21:55:30 localhost kernel: NVRM: 0x00000000019eb3f0
Feb 19 21:55:30 localhost kernel: NVRM: 0x0000000001b36596
Feb 19 21:55:30 localhost kernel: NVRM: 0x00000000019dab2a
Feb 19 21:55:30 localhost kernel: NVRM: PC Trace:
Feb 19 21:55:30 localhost kernel: NVRM: 0x00000000015261ee 0x000000000010013e 0x00000000015261ee 0x0000000001526cdc 0x0000000001526456
Feb 19 21:55:30 localhost kernel: NVRM: 0x0000000001526cc8 0x000000000181d2d8 0x0000000001526cb8 0x0000000001a9aa3a 0x0000000001b015f4
Feb 19 21:55:30 localhost kernel: NVRM: 0x000000000126de06 0x00000000013e73e2 0x0000000001b00b82 0x00000000013e73f2 0x0000000001418630
Feb 19 21:55:30 localhost kernel: NVRM: 0x00000000013e7602 0x000000000141853a 0x00000000013e741e 0x0000000001a9309a 0x00000000013e7400
Feb 19 21:55:30 localhost kernel: NVRM: 0x000000000126ddf8 0x0000000001b01420 0x00000000019da9d0 0xffffffff93004490 0x00000000019da9c6
Feb 19 21:55:30 localhost kernel: NVRM: 0x0000000001b013f8 0xffffffff9300444e 0x0000000001b013e6 0x00000000019da9d0 0xffffffff93004490
Feb 19 21:55:30 localhost kernel: NVRM: 0x00000000019da9c6 0x0000000001b0169e 0xffffffff9300444e 0x0000000001b0168c 0x000000000181d1c0
Feb 19 21:55:30 localhost kernel: NVRM: 0x0000000001b014c6 0x00000000017ce930 0x0000000001b01494 0x0000000001a9aa08 0x00000000010a4e3c
Feb 19 21:55:30 localhost kernel: NVRM: 0x0000000001a9a99c 0x00000000010a4e1e 0x0000000001a9a99c 0x00000000010a4e1e 0x0000000001a9a99c
Feb 19 21:55:30 localhost kernel: NVRM: 0x00000000010a4e1e 0x0000000001a9a99c 0x00000000010a4e1e 0x0000000001a9a99c 0x00000000010a4e1e
Feb 19 21:55:30 localhost kernel: NVRM: 0x0000000001a9a99c 0x00000000010a4e1e 0x0000000001a9a99c 0x00000000010a4e1e
Feb 19 21:55:30 localhost kernel: NVRM: Local I/O Register State:
Feb 19 21:55:30 localhost kernel: NVRM: 0x01450800:0x00000000 0x01450900:0xbadf5041 0x01450a00:0x00000000 0x01450c00:0x00000000
Feb 19 21:55:30 localhost kernel: NVRM: 0x01454a00:0x810490d2 0x01454b00:0x010800d0 0x01454c00:0x00080000 0x01400200:0x00000040
Feb 19 21:55:30 localhost kernel: NVRM: GPU0 GSP RPC buffer contains function 4100 (RC_TRIGGERED) sequence 0 and data 0x0000000000000001 0x000000000000006d.
Feb 19 21:55:30 localhost kernel: NVRM: GPU0 RPC history (CPU -> GSP):
Feb 19 21:55:30 localhost kernel: NVRM: entry function sequence data0 data1 ts_start ts_end duration actively_polling
Feb 19 21:55:30 localhost kernel: NVRM: 0 76 GSP_RM_CONTROL 1043 0x0000000020800a6a 0x0000000000000000 0x00064b396416b93c 0x0000000000000000 y
Feb 19 21:55:30 localhost kernel: NVRM: -1 76 GSP_RM_CONTROL 1042 0x00000000007302a4 0x0000000000000010 0x00064b3964030186 0x00064b3964030322 412us
Feb 19 21:55:30 localhost kernel: NVRM: -2 76 GSP_RM_CONTROL 1041 0x0000000000731144 0x0000000000000078 0x00064b3964026e1e 0x00064b3964030157 37689us
Feb 19 21:55:30 localhost kernel: NVRM: -3 10 FREE 1040 0x00000000000100bc 0x0000000000000000 0x00064b3964026ad0 0x00064b3964026de3 787us
Feb 19 21:55:30 localhost kernel: NVRM: -4 10 FREE 1039 0x00000000000100bb 0x0000000000000000 0x00064b3963f316e3 0x00064b3964026a50 1004ms
Feb 19 21:55:30 localhost kernel: NVRM: -5 76 GSP_RM_CONTROL 1038 0x0000000000730275 0x000000000000000c 0x00064b3963c54e70 0x00064b3963c54f84 276us
Feb 19 21:55:30 localhost kernel: NVRM: -6 76 GSP_RM_CONTROL 1037 0x0000000000730288 0x000000000000003c 0x00064b3963c54ca1 0x00064b3963c54e5a 441us
Feb 19 21:55:30 localhost kernel: NVRM: -7 76 GSP_RM_CONTROL 1036 0x0000000000731144 0x0000000000000078 0x00064b3963c548f4 0x00064b3963c54c79 901us
Feb 19 21:55:30 localhost kernel: NVRM: GPU0 RPC event history (CPU <- GSP):
Feb 19 21:55:30 localhost kernel: NVRM: entry function sequence data0 data1 ts_start ts_end duration during_incomplete_rpc
Feb 19 21:55:30 localhost kernel: NVRM: 0 4100 RC_TRIGGERED 0 0x0000000000000001 0x000000000000006d 0x00064b39644054d5 0x00064b39644054e6 17us y
Feb 19 21:55:30 localhost kernel: NVRM: -1 4102 OS_ERROR_LOG 0 0x0000000000000000 0x0000000000000000 0x00064b3964403a06 0x00064b3964403a2a 36us y
Feb 19 21:55:30 localhost kernel: NVRM: -2 4130 RECOVERY_ACTION 0 0x0000000000000000 0x0000000000000000 0x00064b39635e3bcd 0x00064b39635e3bd6 9us
Feb 19 21:55:30 localhost kernel: NVRM: -3 4102 OS_ERROR_LOG 0 0x0000000000000000 0x0000000000000000 0x00064b39635e2d6f 0x00064b39635e3bcc 3677us
Feb 19 21:55:30 localhost kernel: NVRM: -4 4108 UCODE_LIBOS_PRINT 0 0x0000000000000000 0x0000000000000000 0x00064b39635e1943 0x00064b39635e1943
Feb 19 21:55:30 localhost kernel: NVRM: -5 4108 UCODE_LIBOS_PRINT 0 0x0000000000000000 0x0000000000000000 0x00064b39635e18fc 0x00064b39635e18fd 1us
Feb 19 21:55:30 localhost kernel: NVRM: -6 4108 UCODE_LIBOS_PRINT 0 0x0000000000000000 0x0000000000000000 0x00064b39635e18fc 0x00064b39635e18fc
Feb 19 21:55:30 localhost kernel: NVRM: -7 4108 UCODE_LIBOS_PRINT 0 0x0000000000000000 0x0000000000000000 0x00064b39635e18d9 0x00064b39635e18da 1us
Feb 19 21:55:30 localhost kernel: CPU: 6 UID: 0 PID: 0 Comm: swapper/6 Tainted: G OE 6.18.10 #6 PREEMPT(full) f615b8cac33edf2a38e4a72344c6b408eacc033d
Feb 19 21:55:30 localhost kernel: Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Feb 19 21:55:30 localhost kernel: Call Trace:
Feb 19 21:55:30 localhost kernel:
Feb 19 21:55:30 localhost kernel: dump_stack_lvl+0x4d/0x70
Feb 19 21:55:30 localhost kernel: kgspPrintGspBinBuildId_IMPL+0xa91/0xdd0 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 19 21:55:30 localhost kernel: vgpuIsCallingContextPlugin+0x26a5/0x30b0 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 19 21:55:30 localhost kernel: ? osGetCurrentThread+0x26/0x60 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 19 21:55:30 localhost kernel: ? rmDeviceGpuLockIsOwner+0x29/0x90 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 19 21:55:30 localhost kernel: rpcRmApiControl_GSP+0x76f/0x940 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 19 21:55:30 localhost kernel: krcWatchdog_IMPL+0x4d0/0x530 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 19 21:55:30 localhost kernel: ? os_get_monotonic_time_ns_hr+0xf0/0xf0 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 19 21:55:30 localhost kernel: ? try_to_wake_up+0x131/0x14b0
Feb 19 21:55:30 localhost kernel: krcWatchdogTimerProc+0x48/0x70 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 19 21:55:30 localhost kernel: osGetNvGlobalRegistryDword+0x8b/0x100 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 19 21:55:30 localhost kernel: osRun1HzCallbacksNow+0xa4/0x130 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 19 21:55:30 localhost kernel: rm_run_rc_callback+0x6c/0x90 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 19 21:55:30 localhost kernel: ? _raw_write_lock_irq+0xd0/0xd0
Feb 19 21:55:30 localhost kernel: ? __x86_indirect_thunk_r15+0xd/0x26d [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 19 21:55:30 localhost kernel: nvidia_isr+0x1633/0x1a40 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 19 21:55:30 localhost kernel: call_timer_fn+0x2c/0x1e0
Feb 19 21:55:30 localhost kernel: __run_timers+0x575/0x870
Feb 19 21:55:30 localhost kernel: ? __x86_indirect_thunk_r15+0xd/0x26d [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 19 21:55:30 localhost kernel: ? call_timer_fn+0x1e0/0x1e0
Feb 19 21:55:30 localhost kernel: ? _raw_spin_lock_irq+0x84/0xe0
Feb 19 21:55:30 localhost kernel: ? _raw_spin_lock_bh+0xe0/0xe0
Feb 19 21:55:30 localhost kernel: ? sched_balance_rq+0x4a9/0x3110
Feb 19 21:55:30 localhost kernel: ? sched_clock+0x10/0x20
Feb 19 21:55:30 localhost kernel: timer_expire_remote+0xf2/0x190
Feb 19 21:55:30 localhost kernel: ? timer_base_is_idle+0x20/0x20
Feb 19 21:55:30 localhost kernel: tmigr_handle_remote+0x5a8/0xbe0
Feb 19 21:55:30 localhost kernel: ? tmigr_cpu_activate+0x150/0x150
Feb 19 21:55:30 localhost kernel: ? _raw_spin_lock_irq+0x84/0xe0
Feb 19 21:55:30 localhost kernel: ? __hrtimer_run_queues+0x379/0x7a0
Feb 19 21:55:30 localhost kernel: ? sched_clock+0x10/0x20
Feb 19 21:55:30 localhost kernel: ? sched_clock_cpu+0x69/0x5a0
Feb 19 21:55:30 localhost kernel: run_timer_softirq+0x1f7/0x280
Feb 19 21:55:30 localhost kernel: ? __run_timers+0x870/0x870
Feb 19 21:55:30 localhost kernel: ? ktime_get+0x5e/0x150
Feb 19 21:55:30 localhost kernel: handle_softirqs+0x198/0x580
Feb 19 21:55:30 localhost kernel: ? tasklet_unlock_wait+0x50/0x50
Feb 19 21:55:30 localhost kernel: ? irqtime_account_irq+0x44/0x2b0
Feb 19 21:55:30 localhost kernel: irq_exit_rcu+0xb8/0xf0
Feb 19 21:55:30 localhost kernel: sysvec_apic_timer_interrupt+0x7f/0xc0
Feb 19 21:55:30 localhost kernel:
Feb 19 21:55:30 localhost kernel:
Feb 19 21:55:30 localhost kernel: asm_sysvec_apic_timer_interrupt+0x1a/0x20
Feb 19 21:55:30 localhost kernel: RIP: 0010:cpuidle_enter_state+0xc9/0x4b0
Feb 19 21:55:30 localhost kernel: Code: c5 fb e8 9a f5 ff ff 8b 73 04 bf ff ff ff ff 49 89 c4 e8 da 9a e6 fe 31 ff e8 53 fe c1 fb 45 84 ff 0f 85 78 01 00 00 fb 85 ed <0f> 88 4b 01 00 00 48 8b 3c 24 e8 08 87 e6 fe 4c 89 e7 4c 63 e5 49
Feb 19 21:55:30 localhost kernel: RSP: 0018:ffff888100ecfd88 EFLAGS: 00000202
Feb 19 21:55:30 localhost kernel: RAX: dffffc0000000000 RBX: ffff88810ab82800 RCX: 0000000000000000
Feb 19 21:55:30 localhost kernel: RDX: ffff889f5db3f700 RSI: 1ffff113ebb6809b RDI: ffff889f5db404d8
Feb 19 21:55:30 localhost kernel: RBP: 0000000000000002 R08: 0000000000000002 R09: ffffffffa1dd3e19
Feb 19 21:55:30 localhost kernel: R10: ffff889f5db3abeb R11: ffff889f5db30260 R12: 00000027dbc421e9
Feb 19 21:55:30 localhost kernel: R13: ffffffffa473bee0 R14: 0000000000000002 R15: 0000000000000000
Feb 19 21:55:30 localhost kernel: ? ct_kernel_enter.isra.0+0x59/0xb0
Feb 19 21:55:30 localhost kernel: ? cpuidle_enter_state+0xbd/0x4b0
Feb 19 21:55:30 localhost kernel: cpuidle_enter+0x4c/0xa0
Feb 19 21:55:30 localhost kernel: do_idle+0x2b7/0x3c0
Feb 19 21:55:30 localhost kernel: ? arch_cpu_idle_exit+0x40/0x40
Feb 19 21:55:30 localhost kernel: ? __switch_to+0xb2b/0x10f0
Feb 19 21:55:30 localhost kernel: cpu_startup_entry+0x53/0x70
Feb 19 21:55:30 localhost kernel: start_secondary+0x200/0x2b0
Feb 19 21:55:30 localhost kernel: ? set_cpu_sibling_map+0x2360/0x2360
Feb 19 21:55:30 localhost kernel: common_startup_64+0x13e/0x141
Feb 19 21:55:30 localhost kernel:
Feb 19 21:55:30 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(0) = 0x00000000
Feb 19 21:55:30 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(1) = 0x00000000
Feb 19 21:55:30 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(2) = 0x00000000
Feb 19 21:55:30 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(3) = 0x00000000
Feb 19 21:55:30 localhost kernel: NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x110624, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
Feb 19 21:55:30 localhost kernel: NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x11062c, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
Feb 19 21:55:30 localhost kernel: NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x111404, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
Feb 19 21:55:30 localhost kernel: NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x111408, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
Feb 19 21:55:30 localhost kernel: NVRM: kflcnDumpTracepc_GA102: Trace buffer blocked, skipping.
Feb 19 21:55:30 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPc : 0181d23e
Feb 19 21:55:30 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvCpuctl : 00000180
Feb 19 21:55:30 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvIrqmask : 810490d2
Feb 19 21:55:30 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvIrqdest : 010800d0
Feb 19 21:55:30 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPrivErrStat : 00000000
Feb 19 21:55:30 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPrivErrInfo : badf5041
Feb 19 21:55:30 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPrivErrAddr : 0000000000000000
Feb 19 21:55:30 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvHubErrStat : 00000000
Feb 19 21:55:30 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: falconMailbox : 0:00000438 1:00000438
Feb 19 21:55:30 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: falconIrqstat : 00400050
Feb 19 21:55:30 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: falconIrqmode : ffacfc24
Feb 19 21:55:30 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifInstblk : 00000000
Feb 19 21:55:30 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifCtl : badf5720
Feb 19 21:55:30 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifThrottle : badf5720
Feb 19 21:55:30 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifAchkBlk : 0:00000000 1:00000000
Feb 19 21:55:30 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifAchkCtl : 0:00000000 1:00000000
Feb 19 21:55:30 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifCg1 : 0000000f
Feb 19 21:55:30 localhost kernel: NVRM: _kgspLogXid119: ********************************************************************************
Feb 19 21:55:30 localhost kernel: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 1043!
Feb 19 21:55:35 localhost kernel: NVRM: GPU0 _kgspProcessRpcEvent: Unexpected RPC event from GPU0: 0x4c (GSP_RM_CONTROL), sequence: 1043
Feb 19 21:55:43 localhost kernel: NVRM: krcWatchdog_IMPL: RC watchdog: GPU is probably locked! Notify Timeout Seconds: 7
Feb 19 21:55:43 localhost kernel: NVRM: Xid (PCI:0000:2d:00): 109, pid=587, name=(udev-worker), channel 0x00000001, errorString CTX SWITCH TIMEOUT, Info 0x4000
Feb 19 21:55:45 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 19 21:55:47 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 19 21:55:49 localhost kernel: NVRM: Xid (PCI:0000:2d:00): 119, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 1051 (0x20800a6a 0x0).
Feb 19 21:55:49 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(0) = 0x00000000
Feb 19 21:55:49 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(1) = 0x00000000
Feb 19 21:55:49 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(2) = 0x00000000
Feb 19 21:55:49 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(3) = 0x00000000
Feb 19 21:55:49 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPc : 019da9ba
Feb 19 21:55:49 localhost kernel: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 1051!
Feb 19 21:55:49 localhost kernel: clocksource: Long readout interval, skipping watchdog check: cs_nsec: 4966685104 wd_nsec: 4966688922
Feb 19 21:55:57 localhost kernel: NVRM: krcWatchdog_IMPL: RC watchdog: GPU is probably locked! Notify Timeout Seconds: 7
Feb 19 21:55:57 localhost kernel: NVRM: _kgspProcessRpcEvent: Unexpected RPC event from GPU0: 0x4c (GSP_RM_CONTROL), sequence: 1051
Feb 19 21:55:59 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 19 21:56:00 localhost kernel: NVRM: Xid (PCI:0000:2d:00): 109, pid=587, name=(udev-worker), channel 0x00000001, errorString CTX SWITCH TIMEOUT, Info 0x4000
Feb 19 21:56:01 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 19 21:56:03 localhost kernel: NVRM: Xid (PCI:0000:2d:00): 119, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 1052 (0x20800a6a 0x0).
Feb 19 21:56:03 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(0) = 0x00000000
Feb 19 21:56:03 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(1) = 0x00000000
Feb 19 21:56:03 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(2) = 0x00000000
Feb 19 21:56:03 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(3) = 0x00000000
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPc : 010a4e2e
Feb 19 21:56:03 localhost kernel: NVRM: nvAssertFailedNoLog: Assertion failed: Back to back GSP RPC timeout detected! GPU marked for reset @ kernel_gsp.c:2428
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: Core is booted.
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: RSTAT0 0x0000000000000004
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: RSTAT3 0x0000000000000000
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: RSTAT4 0x0000000000000000
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: PC = 0x00000000010a4d50
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: ra:0x0000000001a9a976 sp:0x00000007f780f180 gp:0x0000000000000005 tp:0x00000007f7c00000
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: a0:0x0000000000000005 a1:0x0000000000000003 a2:0x0000000000000000 a3:0x00000007f780f1c0
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: a4:0x00000007f780f1d0 a5:0x00000007f0997a10 a6:0x0000000000000023 a7:0x00000007f0997a10
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: s0:0x0000000000000001 s1:0x0000000000000022 s2:0x0000000000000000 s3:0x000fffffffffffff
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: s4:0x0000000000000000 s5:0x00000007f0997a10 s6:0x00000007f0979ab8 s7:0x00000007f078b5f0
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: s8:0x000000000000000b s9:0x00000007f780f2b0 s10:0x000000000000003e s11:0x0000000001526c96
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: t0:0x0000000004164180 t1:0x0000000000000000 t2:0x00000007f0a02e10 t3:0x000fffffffffffff
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: t4:0x0000000000000000 t5:0x00000007f780f401 t6:0x0000000000000020
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: csr[803] = 0x0000000000000000
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: csr[895] = 0x0000000000000000
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: csr[897] = 0x0000000000000005
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: csr[899] = 0x0000000000000001
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: csr[89a] = 0x0000000000000000
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: csr[8b4] = 0x0000000000000000
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: csr[8b5] = 0x0000000000000000
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: csr[c00] = 0x0000004079ec0446
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: csr[c01] = 0x001268ee54037fc0
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind00: 0x0000000001a9a976
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind01: 0x0000000001526c96
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind02: 0x0000000001aa1678
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind03: 0x0000000001acc430
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind04: 0x000000000169f602
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind05: 0x0000000001684dec
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind06: 0x0000000001b120ee
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind07: 0x000000000143a254
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind08: 0x000000000143a3d2
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind09: 0x000000000143a6de
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind10: 0x000000000179e67c
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind11: 0x000000000179ffae
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind12: 0x00000000017a0e0c
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind13: 0x0000000001aabf30
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind14: 0x0000000001393af6
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind15: 0x000000000139131a
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind16: 0x00000000015264b4
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind17: 0x0000000001ad71dc
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind18: 0x0000000001ad8ed2
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind19: 0x00000000019eb3f0
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind20: 0x0000000001b36596
Feb 19 21:56:03 localhost kernel: NVRM: kflcnCoreDumpDestructive_IMPL: ICD: unwind complete.
Feb 19 21:56:03 localhost kernel: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 1052!

@jopamo
Copy link
Author

jopamo commented Feb 20, 2026

nevermind, I can hit both with lock debugging off.

Feb 20 01:34:20 localhost kernel: ------------[ cut here ]------------
Feb 20 01:34:20 localhost kernel: WARNING: CPU: 29 PID: 19780 at drivers/gpu/drm/drm_modeset_lock.c:278 drm_modeset_drop_locks+0x155/0x2a0
Feb 20 01:34:20 localhost kernel: Modules linked in: tun snd_seq_dummy snd_hrtimer snd_seq snd_seq_device iwlmvm btusb snd_hda_codec_nvhdmi snd_hda_codec_hdmi btmtk btrtl snd_hda_intel btbcm btintel snd_hda_codec ptp snd_hda_core bluetooth snd_intel_dspcfg snd_hwdep iwlwifi snd_pcm snd_timer snd rapl soundcore wmi_bmof k10temp i2c_piix4 mousedev nvidia_uvm(OE) bridge stp llc loop hid_logitech_hidpp hid_multitouch hid_logitech_dj nvidia_drm(OE) nvidia_modeset(OE) nvidia(OE) video drm_ttm_helper ttm wmi hid_generic usbhid
Feb 20 01:34:20 localhost kernel: CPU: 29 UID: 1000 PID: 19780 Comm: poc Tainted: G OE 6.18.10 #6 PREEMPT(full) f615b8cac33edf2a38e4a72344c6b408eacc033d
Feb 20 01:34:20 localhost kernel: Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Feb 20 01:34:20 localhost kernel: RIP: 0010:drm_modeset_drop_locks+0x155/0x2a0
Feb 20 01:34:20 localhost kernel: Code: 89 52 08 48 8d 7a d8 e8 49 3f 27 02 80 7d 00 00 75 2c 49 8b 45 28 4c 39 e0 0f 85 36 ff ff ff 48 83 c4 20 5b 5d 41 5c 41 5d c3 <0f> 0b e9 e4 fe ff ff 4c 89 e7 e8 3c d0 63 fe e9 21 ff ff ff 4c 89
Feb 20 01:34:20 localhost kernel: RSP: 0018:ffff8881e263f830 EFLAGS: 00010282
Feb 20 01:34:20 localhost kernel: RAX: dffffc0000000000 RBX: ffff8881992b39e8 RCX: 0000000000000001
Feb 20 01:34:20 localhost kernel: RDX: 1ffff1103c4c7f18 RSI: 0000000000000004 RDI: ffff8881e263f8c0
Feb 20 01:34:20 localhost kernel: RBP: 1ffff1103c4c7f11 R08: 0000000000000001 R09: ffffffff8751957b
Feb 20 01:34:20 localhost kernel: R10: ffff8881062dd007 R11: ffff889f5e6c881c R12: 00000000ffffffdd
Feb 20 01:34:20 localhost kernel: R13: ffff8881e263f8a8 R14: ffff8881062dd2b0 R15: ffff88816fef7000
Feb 20 01:34:20 localhost kernel: FS: 000074f5349fe6c0(0000) GS:ffff889fd0c19000(0000) knlGS:0000000000000000
Feb 20 01:34:20 localhost kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 20 01:34:20 localhost kernel: CR2: 000074f5349fdf78 CR3: 000000016c4c5000 CR4: 0000000000f50ef0
Feb 20 01:34:20 localhost kernel: PKRU: 55555554
Feb 20 01:34:20 localhost kernel: Call Trace:
Feb 20 01:34:20 localhost kernel:
Feb 20 01:34:20 localhost kernel: ? __drm_atomic_state_free+0x13f/0x290
Feb 20 01:34:20 localhost kernel: nv_drm_register_drm_device+0x1fc5/0x3280 [nvidia_drm c72ee0042a96708f46fc4d939ac115febc3c3d09]
Feb 20 01:34:20 localhost kernel: ? nv_drm_register_drm_device+0x1cb0/0x3280 [nvidia_drm c72ee0042a96708f46fc4d939ac115febc3c3d09]
Feb 20 01:34:20 localhost kernel: ? __mutex_lock_slowpath+0x10/0x10
Feb 20 01:34:20 localhost kernel: drm_dropmaster_ioctl+0x2d6/0x500
Feb 20 01:34:20 localhost kernel: ? drm_setmaster_ioctl+0x660/0x660
Feb 20 01:34:20 localhost kernel: drm_ioctl_kernel+0x15f/0x2f0
Feb 20 01:34:20 localhost kernel: ? drm_setversion+0x810/0x810
Feb 20 01:34:20 localhost kernel: ? page_counter_cancel+0x1f/0x150
Feb 20 01:34:20 localhost kernel: drm_ioctl+0x496/0xaf0
Feb 20 01:34:20 localhost kernel: ? drm_setmaster_ioctl+0x660/0x660
Feb 20 01:34:20 localhost kernel: ? __css_rstat_lock.isra.0+0x1f0/0x1f0
Feb 20 01:34:20 localhost kernel: ? drm_ioctl_kernel+0x2f0/0x2f0
Feb 20 01:34:20 localhost kernel: ? try_charge_memcg+0x967/0xdf0
Feb 20 01:34:20 localhost kernel: ? __lruvec_stat_mod_folio+0x15c/0x240
Feb 20 01:34:20 localhost kernel: ? __pte_offset_map_lock+0x10a/0x200
Feb 20 01:34:20 localhost kernel: nv_drm_register_drm_device+0x87f/0x3280 [nvidia_drm c72ee0042a96708f46fc4d939ac115febc3c3d09]
Feb 20 01:34:20 localhost kernel: ? __folio_batch_add_and_move+0x132/0x1c0
Feb 20 01:34:20 localhost kernel: ? nv_drm_register_drm_device+0x7e0/0x3280 [nvidia_drm c72ee0042a96708f46fc4d939ac115febc3c3d09]
Feb 20 01:34:20 localhost kernel: ? __handle_mm_fault+0x133a/0x1e90
Feb 20 01:34:20 localhost kernel: ? fdget+0x2e1/0x4a0
Feb 20 01:34:20 localhost kernel: __x64_sys_ioctl+0x738/0x1120
Feb 20 01:34:20 localhost kernel: ? ioctl_file_clone+0xb0/0xb0
Feb 20 01:34:20 localhost kernel: ? _raw_spin_lock_irq+0x84/0xe0
Feb 20 01:34:20 localhost kernel: ? _raw_spin_lock_bh+0xe0/0xe0
Feb 20 01:34:20 localhost kernel: ? __rseq_handle_notify_resume+0x4b7/0xae0
Feb 20 01:34:20 localhost kernel: ? recalc_sigpending+0x131/0x200
Feb 20 01:34:20 localhost kernel: ? __x64_sys_rt_sigprocmask+0x241/0x400
Feb 20 01:34:20 localhost kernel: ? __x64_sys_rseq+0x6b0/0x6b0
Feb 20 01:34:20 localhost kernel: ? __x64_sys_sigprocmask+0x330/0x330
Feb 20 01:34:20 localhost kernel: ? _raw_spin_lock_irq+0x84/0xe0
Feb 20 01:34:20 localhost kernel: do_syscall_64+0x7c/0xb30
Feb 20 01:34:20 localhost kernel: ? exc_page_fault+0x6e/0xc0
Feb 20 01:34:20 localhost kernel: entry_SYSCALL_64_after_hwframe+0x4b/0x53
Feb 20 01:34:20 localhost kernel: RIP: 0033:0x74f535323089
Feb 20 01:34:20 localhost kernel: Code: 00 00 00 48 89 44 24 18 48 8d 44 24 60 c7 04 24 18 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1e 48 8b 44 24 18 64 48 2b 04 25 28 00 00
Feb 20 01:34:20 localhost kernel: RSP: 002b:000074f5349fde10 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
Feb 20 01:34:20 localhost kernel: RAX: ffffffffffffffda RBX: 000000005d07fa88 RCX: 000074f535323089
Feb 20 01:34:20 localhost kernel: RDX: 0000000000000000 RSI: 000000000000641f RDI: 0000000000000003
Feb 20 01:34:20 localhost kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Feb 20 01:34:20 localhost kernel: R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000001
Feb 20 01:34:20 localhost kernel: R13: 00007ffe11c4d8b6 R14: 000074f5349fecdc R15: 00007ffe11c4d8b8
Feb 20 01:34:20 localhost kernel:
Feb 20 01:34:20 localhost kernel: ---[ end trace 0000000000000000 ]---
Feb 20 01:34:20 localhost kernel: [drm] [nvidia-drm] [GPU ID 0x00002d00] nv_drm_reset_input_colorspace failed with error code: -35 !
Feb 20 01:34:20 localhost kernel: NVRM: GPU at PCI:0000:2d:00: GPU-e6514a33-4652-a527-3b8e-05f66dd304d8
Feb 20 01:34:20 localhost kernel: NVRM: GPU Board Serial Number: 0
Feb 20 01:34:20 localhost kernel: NVRM: Xid (PCI:0000:2d:00): 62, 32344000 0000b670 00000000 206a7a8a 206a6c4a 206a6db8 206a52ae 206a5aca
Feb 20 01:34:20 localhost kernel: NVRM: Xid (PCI:0000:2d:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
Feb 20 01:34:29 localhost kernel: NVRM: krcWatchdog_IMPL: RC watchdog: GPU is probably locked! Notify Timeout Seconds: 7
Feb 20 01:34:30 localhost kernel: NVRM: Xid (PCI:0000:2d:00): 109, pid=619, name=(udev-worker), channel 0x00000001, errorString CTX SWITCH TIMEOUT, Info 0x4000
Feb 20 01:34:31 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 20 01:34:33 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 20 01:34:34 localhost kernel: ------------[ cut here ]------------
Feb 20 01:34:34 localhost kernel: WARNING: CPU: 28 PID: 19780 at drivers/gpu/drm/drm_modeset_lock.c:278 drm_modeset_drop_locks+0x155/0x2a0
Feb 20 01:34:34 localhost kernel: Modules linked in: tun snd_seq_dummy snd_hrtimer snd_seq snd_seq_device iwlmvm btusb snd_hda_codec_nvhdmi snd_hda_codec_hdmi btmtk btrtl snd_hda_intel btbcm btintel snd_hda_codec ptp snd_hda_core bluetooth snd_intel_dspcfg snd_hwdep iwlwifi snd_pcm snd_timer snd rapl soundcore wmi_bmof k10temp i2c_piix4 mousedev nvidia_uvm(OE) bridge stp llc loop hid_logitech_hidpp hid_multitouch hid_logitech_dj nvidia_drm(OE) nvidia_modeset(OE) nvidia(OE) video drm_ttm_helper ttm wmi hid_generic usbhid
Feb 20 01:34:34 localhost kernel: CPU: 28 UID: 1000 PID: 19780 Comm: poc Tainted: G W OE 6.18.10 #6 PREEMPT(full) f615b8cac33edf2a38e4a72344c6b408eacc033d
Feb 20 01:34:34 localhost kernel: Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Feb 20 01:34:34 localhost kernel: RIP: 0010:drm_modeset_drop_locks+0x155/0x2a0
Feb 20 01:34:34 localhost kernel: Code: 89 52 08 48 8d 7a d8 e8 49 3f 27 02 80 7d 00 00 75 2c 49 8b 45 28 4c 39 e0 0f 85 36 ff ff ff 48 83 c4 20 5b 5d 41 5c 41 5d c3 <0f> 0b e9 e4 fe ff ff 4c 89 e7 e8 3c d0 63 fe e9 21 ff ff ff 4c 89
Feb 20 01:34:34 localhost kernel: RSP: 0018:ffff8881e263f830 EFLAGS: 00010282
Feb 20 01:34:34 localhost kernel: RAX: dffffc0000000000 RBX: ffff888150b5f9e8 RCX: 0000000000000001
Feb 20 01:34:34 localhost kernel: RDX: 1ffff1103c4c7f18 RSI: 0000000000000004 RDI: ffff8881e263f8c0
Feb 20 01:34:34 localhost kernel: RBP: 1ffff1103c4c7f11 R08: 0000000000000001 R09: ffffffff8751957b
Feb 20 01:34:34 localhost kernel: R10: ffff8881062dd007 R11: ffff889f5e64881c R12: 00000000ffffffdd
Feb 20 01:34:34 localhost kernel: R13: ffff8881e263f8a8 R14: ffff8881062dd2b0 R15: ffff88816fef7000
Feb 20 01:34:34 localhost kernel: FS: 000074f5349fe6c0(0000) GS:ffff889fd0b99000(0000) knlGS:0000000000000000
Feb 20 01:34:34 localhost kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 20 01:34:34 localhost kernel: CR2: 000076064e6003e8 CR3: 000000016c4c5000 CR4: 0000000000f50ef0
Feb 20 01:34:34 localhost kernel: PKRU: 55555554
Feb 20 01:34:34 localhost kernel: Call Trace:
Feb 20 01:34:34 localhost kernel:
Feb 20 01:34:34 localhost kernel: ? __drm_atomic_state_free+0x13f/0x290
Feb 20 01:34:34 localhost kernel: nv_drm_register_drm_device+0x1fc5/0x3280 [nvidia_drm c72ee0042a96708f46fc4d939ac115febc3c3d09]
Feb 20 01:34:34 localhost kernel: ? nv_drm_register_drm_device+0x1cb0/0x3280 [nvidia_drm c72ee0042a96708f46fc4d939ac115febc3c3d09]
Feb 20 01:34:34 localhost kernel: ? __mutex_lock_slowpath+0x10/0x10
Feb 20 01:34:34 localhost kernel: ? _raw_spin_lock+0x83/0xe0
Feb 20 01:34:34 localhost kernel: drm_dropmaster_ioctl+0x2d6/0x500
Feb 20 01:34:34 localhost kernel: ? drm_setmaster_ioctl+0x660/0x660
Feb 20 01:34:34 localhost kernel: drm_ioctl_kernel+0x15f/0x2f0
Feb 20 01:34:34 localhost kernel: ? drm_setversion+0x810/0x810
Feb 20 01:34:34 localhost kernel: ? update_entity_lag+0x116/0x180
Feb 20 01:34:34 localhost kernel: ? sched_clock+0x10/0x20
Feb 20 01:34:34 localhost kernel: ? sched_clock_cpu+0x69/0x5a0
Feb 20 01:34:34 localhost kernel: ? dequeue_entities+0x452/0x2ed0
Feb 20 01:34:34 localhost kernel: drm_ioctl+0x496/0xaf0
Feb 20 01:34:34 localhost kernel: ? drm_setmaster_ioctl+0x660/0x660
Feb 20 01:34:34 localhost kernel: ? drm_ioctl_kernel+0x2f0/0x2f0
Feb 20 01:34:34 localhost kernel: ? finish_task_switch.isra.0+0x1a1/0x710
Feb 20 01:34:34 localhost kernel: ? io_schedule_timeout+0x130/0x130
Feb 20 01:34:34 localhost kernel: nv_drm_register_drm_device+0x87f/0x3280 [nvidia_drm c72ee0042a96708f46fc4d939ac115febc3c3d09]
Feb 20 01:34:34 localhost kernel: ? _raw_spin_lock_irqsave+0x89/0xe0
Feb 20 01:34:34 localhost kernel: ? _raw_write_unlock_irqrestore+0x70/0x70
Feb 20 01:34:34 localhost kernel: ? nv_drm_register_drm_device+0x7e0/0x3280 [nvidia_drm c72ee0042a96708f46fc4d939ac115febc3c3d09]
Feb 20 01:34:34 localhost kernel: ? debug_object_free+0x27a/0x5a0
Feb 20 01:34:34 localhost kernel: ? schedule+0x74/0x290
Feb 20 01:34:34 localhost kernel: ? debug_object_init_on_stack+0x30/0x30
Feb 20 01:34:34 localhost kernel: ? console_conditional_schedule+0x20/0x20
Feb 20 01:34:34 localhost kernel: ? __hrtimer_setup+0x33/0x220
Feb 20 01:34:34 localhost kernel: ? fdget+0x2e1/0x4a0
Feb 20 01:34:34 localhost kernel: __x64_sys_ioctl+0x738/0x1120
Feb 20 01:34:34 localhost kernel: ? ioctl_file_clone+0xb0/0xb0
Feb 20 01:34:34 localhost kernel: ? __hrtimer_setup+0x220/0x220
Feb 20 01:34:34 localhost kernel: ? __asan_memset+0x27/0x50
Feb 20 01:34:34 localhost kernel: ? __rseq_handle_notify_resume+0x4b7/0xae0
Feb 20 01:34:34 localhost kernel: ? __x64_sys_rseq+0x6b0/0x6b0
Feb 20 01:34:34 localhost kernel: ? __x64_sys_clock_adjtime+0x80/0x80
Feb 20 01:34:34 localhost kernel: do_syscall_64+0x7c/0xb30
Feb 20 01:34:34 localhost kernel: ? exc_page_fault+0x6e/0xc0
Feb 20 01:34:34 localhost kernel: entry_SYSCALL_64_after_hwframe+0x4b/0x53
Feb 20 01:34:34 localhost kernel: RIP: 0033:0x74f535323089
Feb 20 01:34:34 localhost kernel: Code: 00 00 00 48 89 44 24 18 48 8d 44 24 60 c7 04 24 18 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1e 48 8b 44 24 18 64 48 2b 04 25 28 00 00
Feb 20 01:34:34 localhost kernel: RSP: 002b:000074f5349fde10 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
Feb 20 01:34:34 localhost kernel: RAX: ffffffffffffffda RBX: 0000000003ef850b RCX: 000074f535323089
Feb 20 01:34:34 localhost kernel: RDX: 0000000000000000 RSI: 000000000000641f RDI: 0000000000000003
Feb 20 01:34:34 localhost kernel: RBP: 000074f5349fde80 R08: 0000000000000000 R09: 0000000000000000
Feb 20 01:34:34 localhost kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000120
Feb 20 01:34:34 localhost kernel: R13: 00007ffe11c4d8b6 R14: 000074f5349fecdc R15: 00007ffe11c4d8b8
Feb 20 01:34:34 localhost kernel:
Feb 20 01:34:34 localhost kernel: ---[ end trace 0000000000000000 ]---
Feb 20 01:34:34 localhost kernel: [drm] [nvidia-drm] [GPU ID 0x00002d00] nv_drm_reset_input_colorspace failed with error code: -35 !
Feb 20 01:34:35 localhost kernel: NVRM: _kgspLogXid119: ********************************* GSP Timeout **********************************
Feb 20 01:34:35 localhost kernel: NVRM: _kgspLogXid119: Note: Please also check logs above.
Feb 20 01:34:35 localhost kernel: NVRM: Xid (PCI:0000:2d:00): 119, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 3325 (0x20800a6a 0x0).
Feb 20 01:34:35 localhost kernel: NVRM: kgspPrintGspBinBuildId_IMPL: GSP bin buildId: cf812cb3f2f1e8c8209dc2e446fdf536ba9ec88f
Feb 20 01:34:35 localhost kernel: NVRM: task watchdog timeout @ pc:0x1b0128c, partition:4#0, task:3
Feb 20 01:34:35 localhost kernel: NVRM: Reported by libos partition:4#5 kernel v3.1 [0] @ ts:5195030
Feb 20 01:34:35 localhost kernel: NVRM: RISC-V CSR State:
Feb 20 01:34:35 localhost kernel: NVRM: sstatus:0x0000000200000020 sscratch:0xffffffffa3013960 sie:0x0000000000000220 sip:0x0000000000000020
Feb 20 01:34:35 localhost kernel: NVRM: sepc:0x0000000001b0128c stval:0x0000000000000000 scause:0x8000000000000005
Feb 20 01:34:35 localhost kernel: NVRM: RISC-V GPR State:
Feb 20 01:34:35 localhost kernel: NVRM: ra:0x0000000001a9aa0c sp:0x00000007f780f1c0 gp:0x0000000000000000 tp:0x00000007f7c00000
Feb 20 01:34:35 localhost kernel: NVRM: a0:0x00000007f078b5f0 a1:0x00000007f780f2b0 a2:0x0000000000000001 a3:0x0000000000000000
Feb 20 01:34:35 localhost kernel: NVRM: a4:0x0000000000000000 a5:0xffffffffffffffff a6:0x000fffffffffffff a7:0x0000000000000000
Feb 20 01:34:35 localhost kernel: NVRM: s0:0x00000007f780f2a0 s1:0xffffffffffffffff s2:0x0000000000000001 s3:0x000000000419f1d8
Feb 20 01:34:35 localhost kernel: NVRM: s4:0x00000007f078b5f0 s5:0x0000000020358e18 s6:0x00000007f780f2b0 s7:0x000000000000003e
Feb 20 01:34:35 localhost kernel: NVRM: s8:0x0000000001526c96 s9:0x0000000004164180 s10:0x0000000020358e00 s11:0x00000007f0a02e10
Feb 20 01:34:35 localhost kernel: NVRM: t0:0x0000000000000005 t1:0x0000000000000003 t2:0x0000000000000000 t3:0x000fffffffffffff
Feb 20 01:34:35 localhost kernel: NVRM: t4:0x0000000000000000 t5:0x00000007f780f401 t6:0x0000000000000020
Feb 20 01:34:35 localhost kernel: NVRM: Stack Trace:
Feb 20 01:34:35 localhost kernel: NVRM: 0x0000000001b0128c
Feb 20 01:34:35 localhost kernel: NVRM: 0x0000000001526c96
Feb 20 01:34:35 localhost kernel: NVRM: 0x0000000001aa1678
Feb 20 01:34:35 localhost kernel: NVRM: 0x0000000001acc430
Feb 20 01:34:35 localhost kernel: NVRM: 0x000000000169f602
Feb 20 01:34:35 localhost kernel: NVRM: 0x0000000001684dec
Feb 20 01:34:35 localhost kernel: NVRM: 0x0000000001b120ee
Feb 20 01:34:35 localhost kernel: NVRM: 0x000000000143a254
Feb 20 01:34:35 localhost kernel: NVRM: 0x000000000143a3d2
Feb 20 01:34:35 localhost kernel: NVRM: 0x000000000143a6de
Feb 20 01:34:35 localhost kernel: NVRM: 0x000000000179e67c
Feb 20 01:34:35 localhost kernel: NVRM: 0x000000000179ffae
Feb 20 01:34:35 localhost kernel: NVRM: 0x00000000017a0e0c
Feb 20 01:34:35 localhost kernel: NVRM: 0x0000000001aabf30
Feb 20 01:34:35 localhost kernel: NVRM: 0x0000000001393af6
Feb 20 01:34:35 localhost kernel: NVRM: 0x000000000139131a
Feb 20 01:34:35 localhost kernel: NVRM: 0x00000000015264b4
Feb 20 01:34:35 localhost kernel: NVRM: 0x0000000001ad71dc
Feb 20 01:34:35 localhost kernel: NVRM: 0x0000000001ad8ed2
Feb 20 01:34:35 localhost kernel: NVRM: 0x00000000019eb3f0
Feb 20 01:34:35 localhost kernel: NVRM: 0x0000000001b36596
Feb 20 01:34:35 localhost kernel: NVRM: 0x00000000019dab2a
Feb 20 01:34:35 localhost kernel: NVRM: PC Trace:
Feb 20 01:34:35 localhost kernel: NVRM: 0x0000000001b0128c 0x000000000010013e 0x0000000001b0128c 0x0000000001a9aa08 0x00000000010a4e3c
Feb 20 01:34:35 localhost kernel: NVRM: 0x0000000001a9a99c 0x00000000010a4e1e 0x0000000001a9a99c 0x00000000010a4e1e 0x0000000001a9a99c
Feb 20 01:34:35 localhost kernel: NVRM: 0x00000000010a4e1e 0x0000000001a9a99c 0x00000000010a4e1e 0x0000000001a9a99c 0x00000000010a4e1e
Feb 20 01:34:35 localhost kernel: NVRM: 0x0000000001a9a99c 0x00000000010a4e1e 0x0000000001a9a99c 0x00000000010a4e1e 0x0000000001a9a99c
Feb 20 01:34:35 localhost kernel: NVRM: 0x00000000010a4e1e 0x0000000001a9a99c 0x00000000010a4e1e 0x0000000001a9a99c 0x00000000010a4e1e
Feb 20 01:34:35 localhost kernel: NVRM: 0x0000000001a9a99c 0x00000000010a4e1e 0x0000000001a9a99c 0x00000000010a4e1e 0x0000000001a9a99c
Feb 20 01:34:35 localhost kernel: NVRM: 0x00000000010a4e1e 0x0000000001a9a99c 0x00000000010a4e1e 0x0000000001a9a99c 0x00000000010a4e1e
Feb 20 01:34:35 localhost kernel: NVRM: 0x0000000001a9a99c 0x00000000010a4e1e 0x0000000001a9a99c 0x00000000010a4e1e 0x0000000001a9a99c
Feb 20 01:34:35 localhost kernel: NVRM: 0x00000000010a4e1e 0x0000000001a9a99c 0x00000000010a4e1e
Feb 20 01:34:35 localhost kernel: NVRM: Local I/O Register State:
Feb 20 01:34:35 localhost kernel: NVRM: 0x01450800:0x00000000 0x01450900:0xbadf5100 0x01450a00:0x00000000 0x01450c00:0x00000000
Feb 20 01:34:35 localhost kernel: NVRM: 0x01454a00:0x810490d2 0x01454b00:0x010800d0 0x01454c00:0x00080000 0x01400200:0x00000040
Feb 20 01:34:35 localhost kernel: NVRM: GPU0 GSP RPC buffer contains function 4100 (RC_TRIGGERED) sequence 0 and data 0x0000000000000001 0x000000000000006d.
Feb 20 01:34:35 localhost kernel: NVRM: GPU0 RPC history (CPU -> GSP):
Feb 20 01:34:35 localhost kernel: NVRM: entry function sequence data0 data1 ts_start ts_end duration actively_polling
Feb 20 01:34:35 localhost kernel: NVRM: 0 76 GSP_RM_CONTROL 3325 0x0000000020800a6a 0x0000000000000000 0x00064b3c738dfecf 0x0000000000000000 y
Feb 20 01:34:35 localhost kernel: NVRM: -1 76 GSP_RM_CONTROL 3324 0x0000000020800a56 0x000000000000005c 0x00064b3c730a9eea 0x00064b3c730ab453 5481us
Feb 20 01:34:35 localhost kernel: NVRM: -2 76 GSP_RM_CONTROL 3323 0x0000000020802801 0x0000000000000004 0x00064b3c7309be2a 0x00064b3c7309cc97 3693us
Feb 20 01:34:35 localhost kernel: NVRM: -3 76 GSP_RM_CONTROL 3322 0x0000000020802802 0x0000000000000004 0x00064b3c73024d47 0x00064b3c73025657 2320us
Feb 20 01:34:35 localhost kernel: NVRM: -4 76 GSP_RM_CONTROL 3321 0x0000000020802802 0x0000000000000004 0x00064b3c72f2f001 0x00064b3c72f2fa28 2599us
Feb 20 01:34:35 localhost kernel: NVRM: -5 76 GSP_RM_CONTROL 3320 0x0000000020802802 0x0000000000000004 0x00064b3c72e39579 0x00064b3c72e39e2b 2226us
Feb 20 01:34:35 localhost kernel: NVRM: -6 76 GSP_RM_CONTROL 3319 0x0000000020802802 0x0000000000000004 0x00064b3c72d438bc 0x00064b3c72d441b3 2295us
Feb 20 01:34:35 localhost kernel: NVRM: -7 76 GSP_RM_CONTROL 3318 0x0000000020802802 0x0000000000000004 0x00064b3c72c4ddd3 0x00064b3c72c4e560 1933us
Feb 20 01:34:35 localhost kernel: NVRM: GPU0 RPC event history (CPU <- GSP):
Feb 20 01:34:35 localhost kernel: NVRM: entry function sequence data0 data1 ts_start ts_end duration during_incomplete_rpc
Feb 20 01:34:35 localhost kernel: NVRM: 0 4100 RC_TRIGGERED 0 0x0000000000000001 0x000000000000006d 0x00064b3c73a08c7a 0x00064b3c73a08c8d 19us y
Feb 20 01:34:35 localhost kernel: NVRM: -1 4102 OS_ERROR_LOG 0 0x0000000000000000 0x0000000000000000 0x00064b3c73a070a3 0x00064b3c73a07966 2243us y
Feb 20 01:34:35 localhost kernel: NVRM: -2 4130 RECOVERY_ACTION 0 0x0000000000000000 0x0000000000000000 0x00064b3c730a9e68 0x00064b3c730a9e74 12us
Feb 20 01:34:35 localhost kernel: NVRM: -3 4102 OS_ERROR_LOG 0 0x0000000000000000 0x0000000000000000 0x00064b3c730a9e3b 0x00064b3c730a9e67 44us
Feb 20 01:34:35 localhost kernel: NVRM: -4 4108 UCODE_LIBOS_PRINT 0 0x0000000000000000 0x0000000000000000 0x00064b3c730a874d 0x00064b3c730a874d
Feb 20 01:34:35 localhost kernel: NVRM: -5 4108 UCODE_LIBOS_PRINT 0 0x0000000000000000 0x0000000000000000 0x00064b3c730a874d 0x00064b3c730a874d
Feb 20 01:34:35 localhost kernel: NVRM: -6 4108 UCODE_LIBOS_PRINT 0 0x0000000000000000 0x0000000000000000 0x00064b3c730a86cc 0x00064b3c730a86cc
Feb 20 01:34:35 localhost kernel: NVRM: -7 4108 UCODE_LIBOS_PRINT 0 0x0000000000000000 0x0000000000000000 0x00064b3c730a86cb 0x00064b3c730a86cb
Feb 20 01:34:35 localhost kernel: CPU: 6 UID: 0 PID: 0 Comm: swapper/6 Tainted: G W OE 6.18.10 #6 PREEMPT(full) f615b8cac33edf2a38e4a72344c6b408eacc033d
Feb 20 01:34:35 localhost kernel: Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Feb 20 01:34:35 localhost kernel: Call Trace:
Feb 20 01:34:35 localhost kernel:
Feb 20 01:34:35 localhost kernel: dump_stack_lvl+0x4d/0x70
Feb 20 01:34:35 localhost kernel: kgspPrintGspBinBuildId_IMPL+0xa91/0xdd0 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 20 01:34:35 localhost kernel: vgpuIsCallingContextPlugin+0x26a5/0x30b0 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 20 01:34:35 localhost kernel: ? osGetCurrentThread+0x26/0x60 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 20 01:34:35 localhost kernel: ? rmDeviceGpuLockIsOwner+0x29/0x90 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 20 01:34:35 localhost kernel: rpcRmApiControl_GSP+0x76f/0x940 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 20 01:34:35 localhost kernel: krcWatchdog_IMPL+0x4d0/0x530 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 20 01:34:35 localhost kernel: ? os_get_monotonic_time_ns_hr+0xf0/0xf0 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 20 01:34:35 localhost kernel: krcWatchdogTimerProc+0x48/0x70 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 20 01:34:35 localhost kernel: osGetNvGlobalRegistryDword+0x8b/0x100 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 20 01:34:35 localhost kernel: osRun1HzCallbacksNow+0xa4/0x130 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 20 01:34:35 localhost kernel: rm_run_rc_callback+0x6c/0x90 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 20 01:34:35 localhost kernel: ? __x86_indirect_thunk_r15+0xd/0x26d [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 20 01:34:35 localhost kernel: nvidia_isr+0x1633/0x1a40 [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 20 01:34:35 localhost kernel: call_timer_fn+0x2c/0x1e0
Feb 20 01:34:35 localhost kernel: __run_timers+0x575/0x870
Feb 20 01:34:35 localhost kernel: ? __x86_indirect_thunk_r15+0xd/0x26d [nvidia d77db9053aae9deb06d908cd2319f68dd09798a6]
Feb 20 01:34:35 localhost kernel: ? sched_clock+0x10/0x20
Feb 20 01:34:35 localhost kernel: ? call_timer_fn+0x1e0/0x1e0
Feb 20 01:34:35 localhost kernel: ? sched_autogroup_create_attach+0x340/0x340
Feb 20 01:34:35 localhost kernel: ? _raw_spin_lock_irq+0x84/0xe0
Feb 20 01:34:35 localhost kernel: ? _raw_spin_lock_bh+0xe0/0xe0
Feb 20 01:34:35 localhost kernel: ? psi_group_change+0x3df/0x840
Feb 20 01:34:35 localhost kernel: ? sched_clock+0x10/0x20
Feb 20 01:34:35 localhost kernel: timer_expire_remote+0xf2/0x190
Feb 20 01:34:35 localhost kernel: ? timer_base_is_idle+0x20/0x20
Feb 20 01:34:35 localhost kernel: tmigr_handle_remote+0x5a8/0xbe0
Feb 20 01:34:35 localhost kernel: ? tmigr_cpu_activate+0x150/0x150
Feb 20 01:34:35 localhost kernel: ? debug_object_destroy+0x3f0/0x3f0
Feb 20 01:34:35 localhost kernel: ? _raw_spin_lock_irq+0x84/0xe0
Feb 20 01:34:35 localhost kernel: ? __hrtimer_run_queues+0x3b0/0x7a0
Feb 20 01:34:35 localhost kernel: ? sched_clock+0x10/0x20
Feb 20 01:34:35 localhost kernel: ? sched_clock_cpu+0x69/0x5a0
Feb 20 01:34:35 localhost kernel: run_timer_softirq+0x1f7/0x280
Feb 20 01:34:35 localhost kernel: ? __run_timers+0x870/0x870
Feb 20 01:34:35 localhost kernel: ? ktime_get+0x5e/0x150
Feb 20 01:34:35 localhost kernel: handle_softirqs+0x198/0x580
Feb 20 01:34:35 localhost kernel: ? tasklet_unlock_wait+0x50/0x50
Feb 20 01:34:35 localhost kernel: ? irqtime_account_irq+0x44/0x2b0
Feb 20 01:34:35 localhost kernel: irq_exit_rcu+0xb8/0xf0
Feb 20 01:34:35 localhost kernel: sysvec_apic_timer_interrupt+0x7f/0xc0
Feb 20 01:34:35 localhost kernel:
Feb 20 01:34:35 localhost kernel:
Feb 20 01:34:35 localhost kernel: asm_sysvec_apic_timer_interrupt+0x1a/0x20
Feb 20 01:34:35 localhost kernel: RIP: 0010:cpuidle_enter_state+0xc9/0x4b0
Feb 20 01:34:35 localhost kernel: Code: c5 fb e8 9a f5 ff ff 8b 73 04 bf ff ff ff ff 49 89 c4 e8 da 9a e6 fe 31 ff e8 53 fe c1 fb 45 84 ff 0f 85 78 01 00 00 fb 85 ed <0f> 88 4b 01 00 00 48 8b 3c 24 e8 08 87 e6 fe 4c 89 e7 4c 63 e5 49
Feb 20 01:34:35 localhost kernel: RSP: 0018:ffff888100eefd88 EFLAGS: 00000202
Feb 20 01:34:35 localhost kernel: RAX: dffffc0000000000 RBX: ffff88810a943800 RCX: 0000000000000000
Feb 20 01:34:35 localhost kernel: RDX: ffff889f5db3f700 RSI: 1ffff113ebb6809b RDI: ffff889f5db404d8
Feb 20 01:34:35 localhost kernel: RBP: 0000000000000002 R08: 0000000000000002 R09: ffffffff897d3e19
Feb 20 01:34:35 localhost kernel: R10: ffff889f5db3abeb R11: 00000000000001ef R12: 000007c911aa77e5
Feb 20 01:34:35 localhost kernel: R13: ffffffff8c13bee0 R14: 0000000000000002 R15: 0000000000000000
Feb 20 01:34:35 localhost kernel: ? ct_kernel_enter.isra.0+0x59/0xb0
Feb 20 01:34:35 localhost kernel: ? cpuidle_enter_state+0xbd/0x4b0
Feb 20 01:34:35 localhost kernel: cpuidle_enter+0x4c/0xa0
Feb 20 01:34:35 localhost kernel: do_idle+0x2b7/0x3c0
Feb 20 01:34:35 localhost kernel: ? arch_cpu_idle_exit+0x40/0x40
Feb 20 01:34:35 localhost kernel: ? __switch_to+0xb10/0x10f0
Feb 20 01:34:35 localhost kernel: cpu_startup_entry+0x53/0x70
Feb 20 01:34:35 localhost kernel: start_secondary+0x200/0x2b0
Feb 20 01:34:35 localhost kernel: ? set_cpu_sibling_map+0x2360/0x2360
Feb 20 01:34:35 localhost kernel: common_startup_64+0x13e/0x141
Feb 20 01:34:35 localhost kernel:
Feb 20 01:34:35 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(0) = 0x00000000
Feb 20 01:34:35 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(1) = 0x00000000
Feb 20 01:34:35 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(2) = 0x00000000
Feb 20 01:34:35 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(3) = 0x00000000
Feb 20 01:34:35 localhost kernel: NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x110624, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
Feb 20 01:34:35 localhost kernel: NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x11062c, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
Feb 20 01:34:35 localhost kernel: NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x111404, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
Feb 20 01:34:35 localhost kernel: NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x111408, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
Feb 20 01:34:35 localhost kernel: NVRM: kflcnDumpTracepc_GA102: Trace buffer blocked, skipping.
Feb 20 01:34:35 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPc : 015261f2
Feb 20 01:34:35 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvCpuctl : 00000180
Feb 20 01:34:35 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvIrqmask : 810490d2
Feb 20 01:34:35 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvIrqdest : 010800d0
Feb 20 01:34:35 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPrivErrStat : 00000000
Feb 20 01:34:35 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPrivErrInfo : badf5100
Feb 20 01:34:35 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPrivErrAddr : 0000000000000000
Feb 20 01:34:35 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvHubErrStat : 00000000
Feb 20 01:34:35 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: falconMailbox : 0:000003e0 1:000003e0
Feb 20 01:34:35 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: falconIrqstat : 00400050
Feb 20 01:34:35 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: falconIrqmode : ffacfc24
Feb 20 01:34:35 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifInstblk : 00000000
Feb 20 01:34:35 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifCtl : badf5720
Feb 20 01:34:35 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifThrottle : badf5720
Feb 20 01:34:35 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifAchkBlk : 0:00000000 1:00000000
Feb 20 01:34:35 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifAchkCtl : 0:00000000 1:00000000
Feb 20 01:34:35 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: fbifCg1 : 0000000f
Feb 20 01:34:35 localhost kernel: NVRM: _kgspLogXid119: ********************************************************************************
Feb 20 01:34:35 localhost kernel: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 3325!
Feb 20 01:34:38 localhost kernel: NVRM: GPU0 _kgspProcessRpcEvent: Unexpected RPC event from GPU0: 0x4c (GSP_RM_CONTROL), sequence: 3325
Feb 20 01:34:46 localhost kernel: NVRM: krcWatchdog_IMPL: RC watchdog: GPU is probably locked! Notify Timeout Seconds: 7
Feb 20 01:34:46 localhost kernel: NVRM: Xid (PCI:0000:2d:00): 109, pid=619, name=(udev-worker), channel 0x00000001, errorString CTX SWITCH TIMEOUT, Info 0x4000
Feb 20 01:34:48 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 20 01:34:50 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 20 01:34:52 localhost kernel: NVRM: Xid (PCI:0000:2d:00): 119, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 3347 (0x20800a6a 0x0).
Feb 20 01:34:52 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(0) = 0x00000000
Feb 20 01:34:52 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(1) = 0x00000000
Feb 20 01:34:52 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(2) = 0x00000000
Feb 20 01:34:52 localhost kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(3) = 0x00000000
Feb 20 01:34:52 localhost kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPc : 01a98e76
Feb 20 01:34:52 localhost kernel: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 3347!
Feb 20 01:34:58 localhost kernel: NVRM: GPU0 _kgspProcessRpcEvent: Unexpected RPC event from GPU0: 0x4c (GSP_RM_CONTROL), sequence: 3347
Feb 20 01:35:00 localhost kernel: NVRM: krcWatchdog_IMPL: RC watchdog: GPU is probably locked! Notify Timeout Seconds: 7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants