Skip to content

fix: surface deep_gemm load failures instead of silently disabling FP8#2803

Draft
S1ro1 wants to merge 1 commit into
mainfrom
fix/deep-gemm-import-surface
Draft

fix: surface deep_gemm load failures instead of silently disabling FP8#2803
S1ro1 wants to merge 1 commit into
mainfrom
fix/deep-gemm-import-surface

Conversation

@S1ro1

@S1ro1 S1ro1 commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Problem

fp8_linear.py wraps the deep_gemm import in a bare except ImportError:

try:
    import deep_gemm
except ImportError:
    deep_gemm = None

This is meant to support CPU-only environments where deep_gemm isn't installed. But ImportError also fires when deep_gemm is installed and its native extension fails to load — e.g. a wheel that links libcudart.so.13 running on a host with only CUDA 12. In that case the symbol is silently set to None, and the real failure doesn't surface until the first FP8 forward pass, as a misleading:

AttributeError: 'NoneType' object has no attribute 'fp8_gemm_nt'

We hit this on a multi-node GLM-5.1 run: the deep_gemm release wheel linked CUDA 13 libs while the nodes shipped CUDA 12.9, so every FP8 matmul died with the NoneType error deep in the trainer — hours of misdirection before tracing it back to a swallowed libcudart.so.13: cannot open shared object file at import.

Fix

Narrow the except to ModuleNotFoundError:

  • Not installed (CPU-only) → still raises ModuleNotFoundError, falls back to None. Unchanged behavior.
  • Installed but failed to load (missing/mismatched CUDA libs) → raises a plain ImportError, which now propagates with its real cause, making the misconfiguration obvious at import time.

Aligns with the "errors should never pass silently" principle — the CPU fallback stays, but a broken native install no longer masquerades as a missing-attribute bug.

`import deep_gemm` was wrapped in a bare `except ImportError: deep_gemm = None`.
That correctly handles CPU-only environments (module not installed), but it also
swallows a genuine load failure — deep_gemm present but its CUDA libs unresolved
(e.g. a wheel linking libcudart.so.13 on a CUDA 12 host). The import silently
sets None and the failure resurfaces much later, in the first FP8 forward, as a
misleading `AttributeError: 'NoneType' object has no attribute 'fp8_gemm_nt'`.

Narrow the except to `ModuleNotFoundError` so "not installed" still falls back to
None, while a broken native load propagates with its real cause (the missing
.so), making the misconfiguration obvious at import time.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants