Skip to content

UPSTREAM PR #1217: feat(server): add generation metadata to png images#41

Open
loci-dev wants to merge 2 commits intomainfrom
loci/pr-1217-sd_server_png_metadata
Open

UPSTREAM PR #1217: feat(server): add generation metadata to png images#41
loci-dev wants to merge 2 commits intomainfrom
loci/pr-1217-sd_server_png_metadata

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Feb 2, 2026

Note

Source pull request: leejet/stable-diffusion.cpp#1217

@loci-review
Copy link

loci-review bot commented Feb 2, 2026

No summary available at this time. Visit Loci Inspector to review detailed analysis.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from 68f62a5 to 342c73d Compare February 9, 2026 04:49
@loci-dev loci-dev force-pushed the main branch 3 times, most recently from 3ad80c4 to 74d69ae Compare February 12, 2026 04:47
@loci-dev loci-dev force-pushed the loci/pr-1217-sd_server_png_metadata branch from 9533c5e to be6f95b Compare February 21, 2026 04:12
@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod February 21, 2026 04:12 — with GitHub Actions Inactive
@loci-review
Copy link

loci-review bot commented Feb 21, 2026

Overview

Analysis of 48,320 functions across two binaries reveals minimal performance impact. Modified functions: 111 (0.23%), new: 11, removed: 6, unchanged: 48,192 (99.73%).

Binaries analyzed:

  • build.bin.sd-cli: +0.708% power consumption (+3,398.65 nJ)
  • build.bin.sd-server: +0.721% power consumption (+3,717.22 nJ)

Changes stem from PNG metadata embedding feature additions across 5 files. Performance impacts are concentrated in C++ standard library functions rather than application code, likely due to compiler optimization differences between builds.

Function Analysis

Significant regressions (200-316% throughput increases):

  • __iter_equals_val (sd-cli): +316.56% throughput (+184.66ns), +233.86% response (+184.65ns). Used in std::find operations during tokenization and parameter validation. No source changes; STL implementation affected by compiler differences.

  • std::_Rb_tree::end/begin (both binaries, 3 instances): +289-307% throughput (+182-183ns), +222-228% response. Used in std::map iterations for configuration, embeddings, and parameter lookups. No source changes; red-black tree accessor functions affected by inlining decisions.

  • std::vector::end for MountPointEntry (sd-server): +306.60% throughput (+183.29ns), +227.57% response. Used in HTTP file request handling. Likely lost inlining optimization.

  • __val_comp_iter (sd-server): +260.22% throughput (+221.99ns), +186.75% response. Compiler-generated comparator for HTTP range coalescing. No source changes.

  • _M_bucket_index (sd-cli): +54.48% throughput (+40.52ns), +20.86% response. Hash table operations for CacheDitConditionState::cache_diffs.

  • make_shared<Conv2d> (sd-cli): +51.56% throughput (+44.10ns), +1.92% response. Affects model initialization, not inference.

Significant improvements:

  • std::vector<std::thread>::end (sd-cli): -75.41% throughput (-183.30ns), -69.13% response. Improves thread synchronization during model loading.

  • make_move_iterator (sd-server): -68.40% throughput (-168.52ns), -58.61% response. Better move semantics optimization.

  • Iterator operator+ for LoraModel (sd-server): -48.19% throughput (-69.31ns), -42.12% response. Improves LoRA weight patching.

Other analyzed functions showed negligible changes.

Additional Findings

All affected functions are in initialization, configuration, or post-processing paths—not in the critical ML inference loop. Core GPU operations (GGML tensor computations, diffusion steps, VAE decoding) remain unaffected. Cumulative worst-case overhead across all regressions is ~1µs, negligible compared to typical inference time (2-10 seconds). The 0.7% power increase is acceptable for the added PNG metadata embedding functionality. Changes justify performance trade-offs as they enable reproducibility features without impacting inference quality or speed.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 3 times, most recently from 2cf1d7d to 44ec1be Compare March 4, 2026 04:15
@loci-dev loci-dev force-pushed the loci/pr-1217-sd_server_png_metadata branch from be6f95b to fdbebe1 Compare March 6, 2026 04:58
@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod March 6, 2026 04:58 — with GitHub Actions Inactive
@loci-review
Copy link

loci-review bot commented Mar 6, 2026

Overview

Analysis of 49,745 functions across two binaries revealed 103 modified, 13 new, and 6 removed functions. Power consumption changed minimally: build.bin.sd-cli increased 0.099% (+485 nJ), while build.bin.sd-server decreased 0.013% (-68 nJ). Changes implemented metadata embedding features without performance optimization intent.

Function Analysis

Critical Regression:

  • neon_compute_fp16_to_fp32 (sd-cli): Response time increased 110% (+94ns), throughput time increased 122% (+94ns). This NEON SIMD function performs FP16-to-FP32 conversion, a hot-path operation in ML inference potentially called thousands of times per generation. The regression stems from GGML library changes, not application code, but could significantly impact inference latency.

Notable Improvements:

  • ggml_compute_forward_map_custom3 (sd-server): Response time decreased 33% (-77ns), throughput time decreased 35% (-77ns). GGML custom operation dispatch optimization benefits tensor computations.
  • copy_data_to_backend_tensor (sd-server): Response time decreased 12% (-199ns), throughput time decreased 57% (-198ns). Improved tensor transfer efficiency benefits model initialization.
  • vector::end() (sd-server) and vector::begin() (sd-cli): Both improved 68-75% (-181ns), benefiting LoRA configuration iteration and command-line parsing.

Other Regressions:
STL functions showed mixed results with several showing 50-180% throughput increases but minimal absolute impact (60-190ns), primarily in initialization and cleanup code rather than inference hot paths. These stem from compiler and standard library variations.

Additional Findings

The neon_compute_fp16_to_fp32 regression is the primary concern for ML workloads. If called frequently during inference (e.g., 10,000 times per forward pass across 50 diffusion steps), the cumulative impact could reach 40+ milliseconds per image. GGML improvements partially offset this, but profiling real workloads is recommended to quantify actual inference impact. Most other changes affect initialization/cleanup phases with negligible end-to-end impact.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants