fix: surface deep_gemm load failures instead of silently disabling FP8 by S1ro1 · Pull Request #2803 · PrimeIntellect-ai/prime-rl

S1ro1 · 2026-06-13T22:55:18Z

Problem

fp8_linear.py wraps the deep_gemm import in a bare except ImportError:

try:
    import deep_gemm
except ImportError:
    deep_gemm = None

This is meant to support CPU-only environments where deep_gemm isn't installed. But ImportError also fires when deep_gemm is installed and its native extension fails to load — e.g. a wheel that links libcudart.so.13 running on a host with only CUDA 12. In that case the symbol is silently set to None, and the real failure doesn't surface until the first FP8 forward pass, as a misleading:

AttributeError: 'NoneType' object has no attribute 'fp8_gemm_nt'

We hit this on a multi-node GLM-5.1 run: the deep_gemm release wheel linked CUDA 13 libs while the nodes shipped CUDA 12.9, so every FP8 matmul died with the NoneType error deep in the trainer — hours of misdirection before tracing it back to a swallowed libcudart.so.13: cannot open shared object file at import.

Fix

Narrow the except to ModuleNotFoundError:

Not installed (CPU-only) → still raises ModuleNotFoundError, falls back to None. Unchanged behavior.
Installed but failed to load (missing/mismatched CUDA libs) → raises a plain ImportError, which now propagates with its real cause, making the misconfiguration obvious at import time.

Aligns with the "errors should never pass silently" principle — the CPU fallback stays, but a broken native install no longer masquerades as a missing-attribute bug.

`import deep_gemm` was wrapped in a bare `except ImportError: deep_gemm = None`. That correctly handles CPU-only environments (module not installed), but it also swallows a genuine load failure — deep_gemm present but its CUDA libs unresolved (e.g. a wheel linking libcudart.so.13 on a CUDA 12 host). The import silently sets None and the failure resurfaces much later, in the first FP8 forward, as a misleading `AttributeError: 'NoneType' object has no attribute 'fp8_gemm_nt'`. Narrow the except to `ModuleNotFoundError` so "not installed" still falls back to None, while a broken native load propagates with its real cause (the missing .so), making the misconfiguration obvious at import time. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: surface deep_gemm load failures instead of silently disabling FP8#2803

fix: surface deep_gemm load failures instead of silently disabling FP8#2803
S1ro1 wants to merge 1 commit into
mainfrom
fix/deep-gemm-import-surface

S1ro1 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

S1ro1 commented Jun 13, 2026

Problem

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants