Skip to content

UPSTREAM PR #1313: fix: ucache: normalize reuse error #73

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1313-ucache-fix
Open

UPSTREAM PR #1313: fix: ucache: normalize reuse error #73
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1313-ucache-fix

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Mar 4, 2026

Note

Source pull request: leejet/stable-diffusion.cpp#1313

This PR updates uache skip math so reuse decisions are normalized by runtime signal dynamics instead of raw model scale, which made behavior inconsistent across checkpoints

@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod March 4, 2026 04:15 — with GitHub Actions Inactive
@loci-review
Copy link

loci-review bot commented Mar 4, 2026

Overview

Analysis of commit e082a01 ("ucache: normalize error scaling") across 49,737 functions (60 modified, 0 new, 0 removed, 49,677 unchanged) reveals minor, localized performance regressions justified by algorithmic improvements to the U-Cache system.

Binaries Analyzed:

  • build.bin.sd-cli: 491,394.58 nJ → 491,419.51 nJ (+0.005%)
  • build.bin.sd-server: 527,270.52 nJ → 526,865.36 nJ (-0.077%)

Power consumption remains essentially unchanged, indicating energy-neutral algorithmic enhancements.

Function Analysis

UCacheState::before_condition() (both binaries) - Performance-critical cache decision function called per denoising step. Throughput time increased +38.5% (+265ns absolute, 689ns → 955ns), response time +0.67-0.79% (+508-592ns, ~75,530ns → ~76,039ns). Changes implement sophisticated EMA-based error normalization with consecutive skip penalties, replacing simpler static reference-based logic. The 265ns self-time increase is negligible compared to the 75.8μs total execution dominated by tensor operations, and vastly outweighed by potential millisecond-scale savings from improved cache hit rates.

UCacheState::get_adaptive_threshold() (both binaries) - Threshold calculation function showing +32.3% response time (+116ns, 359ns → 475ns) and +11.6% throughput time (+37ns, 322ns → 359ns). Added progress clamping (std::max/std::min) and enhanced fallback logic improve correctness and prevent invalid threshold multipliers.

std::_Hashtable::end() (sd-cli) - Standard library iterator accessor showing +136.7% response time (+162ns, 119ns → 281ns) and +194.7% throughput time (+162ns, 83ns → 245ns). Regression stems from increased call frequency and reduced inlining opportunities due to expanded cache state complexity, not direct code changes.

UCacheConfig constructors (both binaries) - Configuration struct initialization showing +10% (+8.35-8.39ns, 83ns → 92ns) from adding relative_norm_gain parameter. One-time initialization overhead is negligible.

Other analyzed functions showed improvements (vector::back -54.5%, __shared_count constructors -17.9%) or minor regressions in standard library operations with negligible absolute impact.

Additional Findings

The U-Cache improvements directly optimize ML inference by gating GPU-accelerated U-Net evaluations. The 10-46 microsecond cumulative overhead per image (0.001-0.023% of typical 200-5000ms inference time) is negligible compared to 10-100ms saved per avoided U-Net evaluation. Changes prioritize inference quality and numerical stability, representing mature optimization practices for production ML systems.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 3 times, most recently from dd19ab8 to 98460a7 Compare March 10, 2026 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants