You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix for nvidia-smi hanging after approximately 66 days of uptime.
The function os_get_monotonic_time_ns() in kernel-open/nvidia/os-interface.c uses jiffies_to_timespec64(jiffies, &ts) to obtain
monotonic time. The jiffies counter is an unsigned 32-bit value that wraps at 2^32 ticks. At HZ=750 this occurs after 2^32 / 750 / 86400 = 66.3 days. When the wrap
occurs, jiffies_to_timespec64() returns a near-zero value, causing time to appear to jump backwards.
This breaks timeout comparisons throughout the driver. Code in thread_state.c, locks.c,
and gpu_timeout.c stores a start time and later checks if currentTime >= startTime + timeout. After the wrap, currentTime is
suddenly much smaller than startTime, so these comparisons behave incorrectly and all operations appear to have timed out immediately.
The fix replaces the jiffies-based implementation with ktime_get_raw_ts64(), which reads from hardware timers and provides a monotonic 64-bit nanosecond timestamp that won't
wrap for centuries. This matches the implementation already used by os_get_monotonic_time_ns_hr() in the same file.
Yeah, on 64bit systems (the only ones supported by this codebase), jiffies will be a regular 64bit variable, so this won't overflow. Also, as mentioned in #971 (comment) the nvidia-smi hang bug is not part of this codebase at all. (Also, who uses CONFIG_HZ=750?)
That said, I think moving from jiffies to ktime here might be worthwhile anyway. However, AFAICT it was added in 4.18, and we still support 4.4 and later, so it would need some extra conftests to decide, and given that I don't think it worth it to have diverging behavior based on the kernel. Maybe wait until we drop those kernels and then move unconditionally....
After the wrap, currentTime will suddenly smaller than startTime, then currentTime will smaller than startTime + timeout too. So, why is there a timeout occur in all operations?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix for
nvidia-smihanging after approximately 66 days of uptime.The function
os_get_monotonic_time_ns()inkernel-open/nvidia/os-interface.cusesjiffies_to_timespec64(jiffies, &ts)to obtainmonotonic time. The
jiffiescounter is an unsigned 32-bit value that wraps at 2^32 ticks. AtHZ=750this occurs after2^32 / 750 / 86400 = 66.3 days. When the wrapoccurs,
jiffies_to_timespec64()returns a near-zero value, causing time to appear to jump backwards.This breaks timeout comparisons throughout the driver. Code in
thread_state.c,locks.c,and
gpu_timeout.cstores a start time and later checks ifcurrentTime >= startTime + timeout. After the wrap,currentTimeissuddenly much smaller than
startTime, so these comparisons behave incorrectly and all operations appear to have timed out immediately.The fix replaces the jiffies-based implementation with
ktime_get_raw_ts64(), which reads from hardware timers and provides a monotonic 64-bit nanosecond timestamp that won'twrap for centuries. This matches the implementation already used by
os_get_monotonic_time_ns_hr()in the same file.