-
Notifications
You must be signed in to change notification settings - Fork 537
Description
Fix for Black Image Issue on Tesla V100 (Volta Architecture)
Issue Description
When using Q4_0 quantized models on Tesla V100 (Volta architecture), the generated images are completely black. This issue does not occur on newer GPUs like RTX 4060 Ti (Ada Lovelace architecture).
Root Cause Analysis
The V100 (Volta architecture) Tensor Cores can experience FP16 overflow when using CUBLAS_COMPUTE_16F (FP16 accumulation) for matrix multiplication. The intermediate calculation values can exceed the FP16 representable range (max ~65504), causing overflow and resulting in black images.
Newer architectures (Turing, Ampere, Ada Lovelace) have better hardware-level overflow handling mechanisms, which is why they don't exhibit this issue.
Proposed Fix
File Location
ggml/src/ggml-cuda/ggml-cuda.cu
Modification (around line 1310)
Add GGML_CUDA_CC_VOLTA to the condition for FP32 accumulation:
// Before:
if (GGML_CUDA_CC_IS_CDNA(cc) || GGML_CUDA_CC_IS_RDNA4(cc)) {
// FP32 accumulation path
...
}
// After:
if (GGML_CUDA_CC_IS_CDNA(cc) || GGML_CUDA_CC_IS_RDNA4(cc) || cc == GGML_CUDA_CC_VOLTA) {
// FP32 accumulation path
...
}Technical Details
| Setting | Input Format | Accumulator Precision | Output Format |
|---|---|---|---|
| CUBLAS_COMPUTE_16F | FP16 | FP16 | FP16 |
| CUBLAS_COMPUTE_32F | FP16 | FP32 | FP16/FP32 |
V100 Tensor Cores support FP16 input + FP32 accumulation with the same throughput (125 TFLOPS) as FP16 accumulation.
Performance Comparison
Tested on Tesla V100-SXM2-32GB with Z-Image Turbo (Q4_0), 512x512, 4 steps:
| Metric | Before Fix (with --type bf16) | After Fix (default) |
|---|---|---|
| Model Loading | 7.44s | 2.48s |
| Text Encoding | 213ms | 68ms |
| Sampling | 5.68s | 1.63s |
| VAE Decoding | 0.26s | 0.26s |
| Total Generation | 6.23s | 1.99s |
| VRAM Usage | 20.3 GB | 7.7 GB |
The fix provides:
- 3x faster generation time
- 62% less VRAM usage
- No need for
--type bf16workaround
GPU Architecture Reference
| GPU | Compute Capability | Architecture | BF16 Support |
|---|---|---|---|
| Tesla V100 | 7.0 | Volta | No |
| RTX 2080 Ti | 7.5 | Turing | No |
| RTX 3090 | 8.6 | Ampere | Yes |
| RTX 4060 Ti | 8.9 | Ada Lovelace | Yes |
Disclaimer
Important: This fix was identified and implemented with the assistance of AI (Claude/GPT). I am not a CUDA programming expert and cannot guarantee that this modification:
- Is the optimal solution for the problem
- Does not have unintended side effects on other operations
- Is compatible with all use cases and hardware configurations
I am sharing this information for reference purposes only. The maintainers should evaluate whether this approach is appropriate for inclusion in the main codebase. If there are better solutions or if this modification could cause issues elsewhere, please disregard this suggestion.
Related Information
- The
--type bf16workaround forces full BF16 computation, which works but significantly increases memory usage and computation time - This fix allows V100 users to use quantized models (Q4_0, Q8_0, etc.) efficiently without the BF16 workaround
- The modification only affects Volta architecture (compute capability 7.0)