UPSTREAM PR #1217: feat(server): add generation metadata to png images by loci-dev · Pull Request #41 · auroralabs-loci/stable-diffusion.cpp

loci-dev · 2026-02-02T10:47:16Z

Note

Source pull request: leejet/stable-diffusion.cpp#1217

loci-review · 2026-02-02T11:52:23Z

No summary available at this time. Visit Loci Inspector to review detailed analysis.

loci-review · 2026-02-21T05:06:55Z

Overview

Analysis of 48,320 functions across two binaries reveals minimal performance impact. Modified functions: 111 (0.23%), new: 11, removed: 6, unchanged: 48,192 (99.73%).

Binaries analyzed:

build.bin.sd-cli: +0.708% power consumption (+3,398.65 nJ)
build.bin.sd-server: +0.721% power consumption (+3,717.22 nJ)

Changes stem from PNG metadata embedding feature additions across 5 files. Performance impacts are concentrated in C++ standard library functions rather than application code, likely due to compiler optimization differences between builds.

Function Analysis

Significant regressions (200-316% throughput increases):

__iter_equals_val (sd-cli): +316.56% throughput (+184.66ns), +233.86% response (+184.65ns). Used in std::find operations during tokenization and parameter validation. No source changes; STL implementation affected by compiler differences.
std::_Rb_tree::end/begin (both binaries, 3 instances): +289-307% throughput (+182-183ns), +222-228% response. Used in std::map iterations for configuration, embeddings, and parameter lookups. No source changes; red-black tree accessor functions affected by inlining decisions.
std::vector::end for MountPointEntry (sd-server): +306.60% throughput (+183.29ns), +227.57% response. Used in HTTP file request handling. Likely lost inlining optimization.
__val_comp_iter (sd-server): +260.22% throughput (+221.99ns), +186.75% response. Compiler-generated comparator for HTTP range coalescing. No source changes.
_M_bucket_index (sd-cli): +54.48% throughput (+40.52ns), +20.86% response. Hash table operations for CacheDitConditionState::cache_diffs.
make_shared<Conv2d> (sd-cli): +51.56% throughput (+44.10ns), +1.92% response. Affects model initialization, not inference.

Significant improvements:

std::vector<std::thread>::end (sd-cli): -75.41% throughput (-183.30ns), -69.13% response. Improves thread synchronization during model loading.
make_move_iterator (sd-server): -68.40% throughput (-168.52ns), -58.61% response. Better move semantics optimization.
Iterator operator+ for LoraModel (sd-server): -48.19% throughput (-69.31ns), -42.12% response. Improves LoRA weight patching.

Other analyzed functions showed negligible changes.

Additional Findings

All affected functions are in initialization, configuration, or post-processing paths—not in the critical ML inference loop. Core GPU operations (GGML tensor computations, diffusion steps, VAE decoding) remain unaffected. Cumulative worst-case overhead across all regressions is ~1µs, negligible compared to typical inference time (2-10 seconds). The 0.7% power increase is acceptable for the added PNG metadata embedding functionality. Changes justify performance trade-offs as they enable reproducibility features without impacting inference quality or speed.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

loci-review · 2026-03-06T06:04:28Z

Overview

Analysis of 49,745 functions across two binaries revealed 103 modified, 13 new, and 6 removed functions. Power consumption changed minimally: build.bin.sd-cli increased 0.099% (+485 nJ), while build.bin.sd-server decreased 0.013% (-68 nJ). Changes implemented metadata embedding features without performance optimization intent.

Function Analysis

Critical Regression:

neon_compute_fp16_to_fp32 (sd-cli): Response time increased 110% (+94ns), throughput time increased 122% (+94ns). This NEON SIMD function performs FP16-to-FP32 conversion, a hot-path operation in ML inference potentially called thousands of times per generation. The regression stems from GGML library changes, not application code, but could significantly impact inference latency.

Notable Improvements:

ggml_compute_forward_map_custom3 (sd-server): Response time decreased 33% (-77ns), throughput time decreased 35% (-77ns). GGML custom operation dispatch optimization benefits tensor computations.
copy_data_to_backend_tensor (sd-server): Response time decreased 12% (-199ns), throughput time decreased 57% (-198ns). Improved tensor transfer efficiency benefits model initialization.
vector::end() (sd-server) and vector::begin() (sd-cli): Both improved 68-75% (-181ns), benefiting LoRA configuration iteration and command-line parsing.

Other Regressions:
STL functions showed mixed results with several showing 50-180% throughput increases but minimal absolute impact (60-190ns), primarily in initialization and cleanup code rather than inference hot paths. These stem from compiler and standard library variations.

Additional Findings

The neon_compute_fp16_to_fp32 regression is the primary concern for ML workloads. If called frequently during inference (e.g., 10,000 times per forward pass across 50 diffusion steps), the cumulative impact could reach 40+ milliseconds per image. GGML improvements partially offset this, but profiling real workloads is recommended to quantify actual inference impact. Most other changes affect initialization/cleanup phases with negligible end-to-end impact.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

loci-dev temporarily deployed to stable-diffusion-cpp-prod February 2, 2026 10:47 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from c0dc6dd to 473a170 Compare February 2, 2026 11:20

loci-dev force-pushed the main branch 27 times, most recently from 68f62a5 to 342c73d Compare February 9, 2026 04:49

loci-dev force-pushed the main branch 3 times, most recently from 3ad80c4 to 74d69ae Compare February 12, 2026 04:47

loci-dev force-pushed the main branch from 74d69ae to 10ea7dd Compare February 20, 2026 04:16

loci-dev force-pushed the loci/pr-1217-sd_server_png_metadata branch from 9533c5e to be6f95b Compare February 21, 2026 04:12

loci-dev temporarily deployed to stable-diffusion-cpp-prod February 21, 2026 04:12 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 3 times, most recently from 2cf1d7d to 44ec1be Compare March 4, 2026 04:15

wbruna added 2 commits March 4, 2026 07:15

feat(server): add generation metadata to png images

26e3c05

feat: add flag to disable the embedding of generation metadata

fdbebe1

loci-dev force-pushed the main branch from 44ec1be to 682032b Compare March 6, 2026 04:14

loci-dev force-pushed the loci/pr-1217-sd_server_png_metadata branch from be6f95b to fdbebe1 Compare March 6, 2026 04:58

loci-dev temporarily deployed to stable-diffusion-cpp-prod March 6, 2026 04:58 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #1217: feat(server): add generation metadata to png images#41

UPSTREAM PR #1217: feat(server): add generation metadata to png images#41
loci-dev wants to merge 2 commits intomainfrom
loci/pr-1217-sd_server_png_metadata

loci-dev commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 21, 2026

Uh oh!

loci-review bot commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 21, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

loci-review bot commented Mar 6, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants