Fix for Black Image Issue on Tesla V100 (Volta Architecture)

# Fix for Black Image Issue on Tesla V100 (Volta Architecture)

## Issue Description

When using Q4_0 quantized models on Tesla V100 (Volta architecture), the generated images are completely black. This issue does not occur on newer GPUs like RTX 4060 Ti (Ada Lovelace architecture).

## Root Cause Analysis

The V100 (Volta architecture) Tensor Cores can experience FP16 overflow when using `CUBLAS_COMPUTE_16F` (FP16 accumulation) for matrix multiplication. The intermediate calculation values can exceed the FP16 representable range (max ~65504), causing overflow and resulting in black images.

Newer architectures (Turing, Ampere, Ada Lovelace) have better hardware-level overflow handling mechanisms, which is why they don't exhibit this issue.

## Proposed Fix

### File Location
```
ggml/src/ggml-cuda/ggml-cuda.cu
```

### Modification (around line 1310)

Add `GGML_CUDA_CC_VOLTA` to the condition for FP32 accumulation:

```cpp
// Before:
if (GGML_CUDA_CC_IS_CDNA(cc) || GGML_CUDA_CC_IS_RDNA4(cc)) {
    // FP32 accumulation path
    ...
}

// After:
if (GGML_CUDA_CC_IS_CDNA(cc) || GGML_CUDA_CC_IS_RDNA4(cc) || cc == GGML_CUDA_CC_VOLTA) {
    // FP32 accumulation path
    ...
}
```

### Technical Details

| Setting | Input Format | Accumulator Precision | Output Format |
|---------|--------------|----------------------|---------------|
| CUBLAS_COMPUTE_16F | FP16 | FP16 | FP16 |
| CUBLAS_COMPUTE_32F | FP16 | **FP32** | FP16/FP32 |

V100 Tensor Cores support FP16 input + FP32 accumulation with the same throughput (125 TFLOPS) as FP16 accumulation.

## Performance Comparison

Tested on Tesla V100-SXM2-32GB with Z-Image Turbo (Q4_0), 512x512, 4 steps:

| Metric | Before Fix (with --type bf16) | After Fix (default) |
|--------|-------------------------------|---------------------|
| Model Loading | 7.44s | 2.48s |
| Text Encoding | 213ms | 68ms |
| Sampling | 5.68s | 1.63s |
| VAE Decoding | 0.26s | 0.26s |
| **Total Generation** | **6.23s** | **1.99s** |
| VRAM Usage | 20.3 GB | 7.7 GB |

The fix provides:
- **3x faster** generation time
- **62% less** VRAM usage
- No need for `--type bf16` workaround

## GPU Architecture Reference

| GPU | Compute Capability | Architecture | BF16 Support |
|-----|-------------------|--------------|--------------|
| Tesla V100 | 7.0 | Volta | No |
| RTX 2080 Ti | 7.5 | Turing | No |
| RTX 3090 | 8.6 | Ampere | Yes |
| RTX 4060 Ti | 8.9 | Ada Lovelace | Yes |

## Disclaimer

**Important**: This fix was identified and implemented with the assistance of AI (Claude/GPT). I am not a CUDA programming expert and cannot guarantee that this modification:

1. Is the optimal solution for the problem
2. Does not have unintended side effects on other operations
3. Is compatible with all use cases and hardware configurations

I am sharing this information for reference purposes only. The maintainers should evaluate whether this approach is appropriate for inclusion in the main codebase. If there are better solutions or if this modification could cause issues elsewhere, please disregard this suggestion.

## Related Information

- The `--type bf16` workaround forces full BF16 computation, which works but significantly increases memory usage and computation time
- This fix allows V100 users to use quantized models (Q4_0, Q8_0, etc.) efficiently without the BF16 workaround
- The modification only affects Volta architecture (compute capability 7.0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for Black Image Issue on Tesla V100 (Volta Architecture) #1292

Fix for Black Image Issue on Tesla V100 (Volta Architecture)

Issue Description

Root Cause Analysis

Proposed Fix

File Location

Modification (around line 1310)

Technical Details

Performance Comparison

GPU Architecture Reference

Disclaimer

Related Information

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Setting	Input Format	Accumulator Precision	Output Format
CUBLAS_COMPUTE_16F	FP16	FP16	FP16
CUBLAS_COMPUTE_32F	FP16	FP32	FP16/FP32

Metric	Before Fix (with --type bf16)	After Fix (default)
Model Loading	7.44s	2.48s
Text Encoding	213ms	68ms
Sampling	5.68s	1.63s
VAE Decoding	0.26s	0.26s
Total Generation	6.23s	1.99s
VRAM Usage	20.3 GB	7.7 GB

GPU	Compute Capability	Architecture	BF16 Support
Tesla V100	7.0	Volta	No
RTX 2080 Ti	7.5	Turing	No
RTX 3090	8.6	Ampere	Yes
RTX 4060 Ti	8.9	Ada Lovelace	Yes

Fix for Black Image Issue on Tesla V100 (Volta Architecture) #1292

Description

Fix for Black Image Issue on Tesla V100 (Volta Architecture)

Issue Description

Root Cause Analysis

Proposed Fix

File Location

Modification (around line 1310)

Technical Details

Performance Comparison

GPU Architecture Reference

Disclaimer

Related Information

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions