UPSTREAM PR #1313: fix: ucache: normalize reuse error #73
UPSTREAM PR #1313: fix: ucache: normalize reuse error #73
Conversation
OverviewAnalysis of commit e082a01 ("ucache: normalize error scaling") across 49,737 functions (60 modified, 0 new, 0 removed, 49,677 unchanged) reveals minor, localized performance regressions justified by algorithmic improvements to the U-Cache system. Binaries Analyzed:
Power consumption remains essentially unchanged, indicating energy-neutral algorithmic enhancements. Function AnalysisUCacheState::before_condition() (both binaries) - Performance-critical cache decision function called per denoising step. Throughput time increased +38.5% (+265ns absolute, 689ns → 955ns), response time +0.67-0.79% (+508-592ns, ~75,530ns → ~76,039ns). Changes implement sophisticated EMA-based error normalization with consecutive skip penalties, replacing simpler static reference-based logic. The 265ns self-time increase is negligible compared to the 75.8μs total execution dominated by tensor operations, and vastly outweighed by potential millisecond-scale savings from improved cache hit rates. UCacheState::get_adaptive_threshold() (both binaries) - Threshold calculation function showing +32.3% response time (+116ns, 359ns → 475ns) and +11.6% throughput time (+37ns, 322ns → 359ns). Added progress clamping (std::max/std::min) and enhanced fallback logic improve correctness and prevent invalid threshold multipliers. std::_Hashtable::end() (sd-cli) - Standard library iterator accessor showing +136.7% response time (+162ns, 119ns → 281ns) and +194.7% throughput time (+162ns, 83ns → 245ns). Regression stems from increased call frequency and reduced inlining opportunities due to expanded cache state complexity, not direct code changes. UCacheConfig constructors (both binaries) - Configuration struct initialization showing +10% (+8.35-8.39ns, 83ns → 92ns) from adding Other analyzed functions showed improvements (vector::back -54.5%, __shared_count constructors -17.9%) or minor regressions in standard library operations with negligible absolute impact. Additional FindingsThe U-Cache improvements directly optimize ML inference by gating GPU-accelerated U-Net evaluations. The 10-46 microsecond cumulative overhead per image (0.001-0.023% of typical 200-5000ms inference time) is negligible compared to 10-100ms saved per avoided U-Net evaluation. Changes prioritize inference quality and numerical stability, representing mature optimization practices for production ML systems. 🔎 Full breakdown: Loci Inspector |
dd19ab8 to
98460a7
Compare
Note
Source pull request: leejet/stable-diffusion.cpp#1313
This PR updates uache skip math so reuse decisions are normalized by runtime signal dynamics instead of raw model scale, which made behavior inconsistent across checkpoints