UPSTREAM PR #1313: fix: ucache: normalize reuse error by loci-dev · Pull Request #73 · auroralabs-loci/stable-diffusion.cpp

loci-dev · 2026-03-04T04:15:14Z

Note

Source pull request: leejet/stable-diffusion.cpp#1313

This PR updates uache skip math so reuse decisions are normalized by runtime signal dynamics instead of raw model scale, which made behavior inconsistent across checkpoints

loci-review · 2026-03-04T05:14:46Z

Overview

Analysis of commit e082a01 ("ucache: normalize error scaling") across 49,737 functions (60 modified, 0 new, 0 removed, 49,677 unchanged) reveals minor, localized performance regressions justified by algorithmic improvements to the U-Cache system.

Binaries Analyzed:

build.bin.sd-cli: 491,394.58 nJ → 491,419.51 nJ (+0.005%)
build.bin.sd-server: 527,270.52 nJ → 526,865.36 nJ (-0.077%)

Power consumption remains essentially unchanged, indicating energy-neutral algorithmic enhancements.

Function Analysis

UCacheState::before_condition() (both binaries) - Performance-critical cache decision function called per denoising step. Throughput time increased +38.5% (+265ns absolute, 689ns → 955ns), response time +0.67-0.79% (+508-592ns, ~75,530ns → ~76,039ns). Changes implement sophisticated EMA-based error normalization with consecutive skip penalties, replacing simpler static reference-based logic. The 265ns self-time increase is negligible compared to the 75.8μs total execution dominated by tensor operations, and vastly outweighed by potential millisecond-scale savings from improved cache hit rates.

UCacheState::get_adaptive_threshold() (both binaries) - Threshold calculation function showing +32.3% response time (+116ns, 359ns → 475ns) and +11.6% throughput time (+37ns, 322ns → 359ns). Added progress clamping (std::max/std::min) and enhanced fallback logic improve correctness and prevent invalid threshold multipliers.

std::_Hashtable::end() (sd-cli) - Standard library iterator accessor showing +136.7% response time (+162ns, 119ns → 281ns) and +194.7% throughput time (+162ns, 83ns → 245ns). Regression stems from increased call frequency and reduced inlining opportunities due to expanded cache state complexity, not direct code changes.

UCacheConfig constructors (both binaries) - Configuration struct initialization showing +10% (+8.35-8.39ns, 83ns → 92ns) from adding relative_norm_gain parameter. One-time initialization overhead is negligible.

Other analyzed functions showed improvements (vector::back -54.5%, __shared_count constructors -17.9%) or minor regressions in standard library operations with negligible absolute impact.

Additional Findings

The U-Cache improvements directly optimize ML inference by gating GPU-accelerated U-Net evaluations. The 10-46 microsecond cumulative overhead per image (0.001-0.023% of typical 200-5000ms inference time) is negligible compared to 10-100ms saved per avoided U-Net evaluation. Changes prioritize inference quality and numerical stability, representing mature optimization practices for production ML systems.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

ucache: normalize error scaling

e082a01

loci-dev temporarily deployed to stable-diffusion-cpp-prod March 4, 2026 04:15 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 3 times, most recently from dd19ab8 to 98460a7 Compare March 10, 2026 04:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #1313: fix: ucache: normalize reuse error #73

UPSTREAM PR #1313: fix: ucache: normalize reuse error #73
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1313-ucache-fix

loci-dev commented Mar 4, 2026

Uh oh!

loci-review bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Mar 4, 2026

Uh oh!

loci-review bot commented Mar 4, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants