Skip to content

UPSTREAM PR #1307: reset weight adapter for models if no loras in request (fix 'sticky loras')#72

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1307-master
Open

UPSTREAM PR #1307: reset weight adapter for models if no loras in request (fix 'sticky loras')#72
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1307-master

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Mar 3, 2026

Note

Source pull request: leejet/stable-diffusion.cpp#1307

Currently, weight_adapter remains unchanged if there are no loras in the query.
Therefore, after a calculation with a given loras, all subsequent queries without a loras specified will use the last specified loras.

@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod March 3, 2026 04:16 — with GitHub Actions Inactive
@loci-review
Copy link

loci-review bot commented Mar 3, 2026

Overview

Analysis of commit 659c150 ("reset weight adapter for models if no loras in request") across two binaries shows minimal performance impact. Of 49,737 total functions, 45 were modified (0.09%), with no new or removed functions.

Binaries analyzed:

  • build.bin.sd-cli: +0.036% power consumption (+176.40 nJ)
  • build.bin.sd-server: -0.057% power consumption (-299.06 nJ)

Both power consumption changes are negligible, indicating no meaningful energy impact.

Function Analysis

apply_loras_at_runtime (both binaries) received the intentional code change, adding four null pointer assignments to reset weight adapters and fix "sticky LoRAs" bug:

  • sd-cli: Response time +31,222ns (+0.10%), throughput time +160ns (+10.39%)
  • sd-server: Response time -8,212ns (-0.027%), throughput time +166ns (+10.87%)

The 160-166ns throughput increase directly corresponds to the four added pointer assignments. This overhead is negligible within the function's 30.8ms execution time and represents less than 0.001% of typical multi-second inference workloads. The correctness improvement (preventing incorrect model outputs from persistent LoRA state) fully justifies this minimal cost.

Standard library functions show compiler-driven variations without source changes:

  • std::vector::back (sd-server): Improved by 188ns (-55.9% throughput), beneficial optimization
  • std::_Rb_tree::begin (sd-cli): Regressed by 182ns (+289% throughput), but absolute impact negligible
  • std::_Rb_tree::_S_key (sd-server): Regressed by 186ns (+311% throughput), but absolute impact negligible

GGML functions show minor regressions likely from indirect effects:

  • ggml_new_tensor_impl (sd-cli): +27ns (+4.0% throughput)
  • apply_unary_op (sd-server): +71ns (+9.95% throughput)

Other analyzed functions saw negligible changes, primarily reflecting compiler optimization variance rather than algorithmic modifications.

Additional Findings

The modified function (apply_loras_at_runtime) manages LoRA (Low-Rank Adaptation) application for ML model customization. The fix ensures clean model state between inference requests, preventing artifacts where previous adaptations incorrectly influenced subsequent generations. No GPU functions were directly modified. The changes do not affect the inference critical path (tensor operations, diffusion loop), which operates at millisecond to second timescales, making the nanosecond-level changes immaterial to overall performance.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 44ec1be to 682032b Compare March 6, 2026 04:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants