Skip to content

UPSTREAM PR #1261: refactor: move VAE tiling parameters to SDGenerationParams#63

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1261-sd_refactor_vae_tiling
Open

UPSTREAM PR #1261: refactor: move VAE tiling parameters to SDGenerationParams#63
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1261-sd_refactor_vae_tiling

Conversation

@loci-dev
Copy link

Note

Source pull request: leejet/stable-diffusion.cpp#1261

@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod February 20, 2026 04:17 — with GitHub Actions Inactive
@loci-review
Copy link

loci-review bot commented Feb 20, 2026

Overview

Analysis of stable-diffusion.cpp refactoring commit (2367bc7: "move VAE tiling parameters to SDGenerationParams") across 48,374 functions shows minimal performance impact. Modified: 78 functions; New: 80; Removed: 80; Unchanged: 48,136.

Binaries Analyzed:

  • build.bin.sd-server: +0.614% power consumption (515,491.29 nJ → 518,655.31 nJ)
  • build.bin.sd-cli: +0.706% power consumption (480,109.60 nJ → 483,500.24 nJ)

The refactoring successfully moves VAE tiling parameters from context initialization to per-generation configuration, enabling flexible memory management with acceptable performance trade-offs.

Function Analysis

Configuration Parsing (Initialization Only):

SDContextParams::get_options() improved across both binaries: response time -6.6% (sd-server: 279,572ns → 261,119ns; sd-cli: 280,187ns → 261,795ns), throughput time -7.6% to -9.6% due to removing 4 VAE tiling options. This simplification reduced branching and parsing overhead.

SDGenerationParams::get_options() regressed consistently: response time +5.95-5.96% (sd-server: 306,582ns → 324,830ns; sd-cli: 307,317ns → 325,643ns), throughput time +6.11% due to adding the same 4 options with complex parsing logic. The ~200ns self-time increase reflects additional option registration overhead.

SDGenerationParams::to_string() (sd-cli) regressed +17.4% throughput time (1,714ns → 2,012ns) from serializing 6 additional vae_tiling_params fields—expected for a diagnostic function.

GGML Backend (Model Loading/Inference):

make_block_q4_Kx8 (sd-server) regressed +7.9% (8,126ns → 8,768ns) in both response and throughput time, indicating intrinsic overhead in quantization repacking. Affects model loading, not inference hot path.

forward_mul_mat for block_iq4_nl (sd-server) shows +5.38% response time regression (12,916ns → 13,611ns) while throughput time remains stable (2,390ns), indicating child function slowdown rather than direct implementation changes. This matrix multiplication function is inference-critical, though stable self-time suggests indirect impact.

Standard Library Optimizations:

Multiple functions improved significantly: std::make_move_iterator -58.6% response time (287ns → 119ns), __gnu_cxx::__normal_iterator::operator+ -42.1% (165ns → 95ns), std::swap -11% (112ns → 100ns), std::__unique -5.8% response time. These compiler optimizations partially offset regressions.

Other analyzed functions (JSON access, regex compilation, vector reallocation) showed minor self-time variations with negligible total execution impact.

Additional Findings

The architectural refactoring achieves its goal of enabling per-generation VAE tiling control with minimal cost. Configuration parsing improvements offset regressions, resulting in balanced initialization performance. Most performance changes affect initialization rather than inference hot paths. The forward_mul_mat regression warrants monitoring in production, though stable self-time suggests the function's implementation is unchanged with slowdown in GGML dependencies. Power consumption increases (<1%) are negligible for image generation workloads taking seconds to minutes per image.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 3 times, most recently from 2cf1d7d to 44ec1be Compare March 4, 2026 04:14
@loci-dev loci-dev force-pushed the loci/pr-1261-sd_refactor_vae_tiling branch from 2367bc7 to 065e498 Compare March 5, 2026 04:16
@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod March 5, 2026 04:16 — with GitHub Actions Inactive
@loci-review
Copy link

loci-review bot commented Mar 5, 2026

Overview

Analysis of stable-diffusion.cpp across 49,806 functions (103 modified, 80 new, 80 removed, 49,543 unchanged) reveals minimal overall performance impact from a single commit refactoring VAE tiling parameters. Power consumption changes are negligible: build.bin.sd-server decreased 0.032% (527,129.70 → 526,960.94 nJ) and build.bin.sd-cli increased 0.091% (491,105.58 → 491,553.48 nJ).

Function Analysis

Most Significant Regression:

  • neon_compute_fp16_to_fp32 (sd-server): Throughput time increased 122% (+94.38ns, 77.22ns → 171.60ns), response time increased 112% (+94.38ns). This GGML SIMD function for FP16-to-FP32 conversion shows concerning regression for mixed-precision inference on ARM. No application code changes detected; regression originates from GGML submodule.

Most Significant Improvements:

  • std::vector<sd_lora_t>::end() (sd-server): Throughput time decreased 75% (-183.30ns), response time decreased 69% (-183.30ns)
  • std::vector::begin() (sd-cli): Throughput time decreased 74% (-180.81ns), response time decreased 68% (-180.81ns)
  • copy_data_to_backend_tensor (sd-server): Throughput time decreased 57% (-197.69ns), improving GPU data transfer efficiency
  • ggml_compute_forward_map_custom3 (sd-server): Throughput time decreased 35% (-76.84ns), benefiting custom tensor operations

Expected Overhead:

  • SDGenerationParams::get_options() (both binaries): Throughput time increased ~11% (+340ns) from added VAE tiling parameter parsing. This one-time CLI initialization cost is justified by architectural improvements enabling per-generation VAE memory control.

Other analyzed functions showed minor compiler-induced variations in standard library templates with negligible practical impact.

Additional Findings

The SIMD regression in neon_compute_fp16_to_fp32 could impact ARM-based deployments using mixed-precision inference. If called millions of times per inference, the cumulative effect may add measurable latency (estimated 40-50ms per image for 500,000 conversions). However, improvements in backend data transfer and custom operations provide offsetting benefits. The refactoring successfully achieves architectural goals without meaningful energy penalties, with actual inference impact highly dependent on model precision requirements and hardware backend configuration.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants