Skip to content

UPSTREAM PR #1316: fix: sd-server memory leak#75

Open
loci-dev wants to merge 2 commits intomainfrom
loci/pr-1316-master
Open

UPSTREAM PR #1316: fix: sd-server memory leak#75
loci-dev wants to merge 2 commits intomainfrom
loci/pr-1316-master

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Mar 4, 2026

Note

Source pull request: leejet/stable-diffusion.cpp#1316

free results (sd_images_t array) from generate_images() after write_image_to_vector()

@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod March 4, 2026 04:57 — with GitHub Actions Inactive
@loci-review
Copy link

loci-review bot commented Mar 4, 2026

Overview

Analysis of stable-diffusion.cpp compared 49,758 functions across two versions, identifying 56 modified functions, 1 new function, and 49,701 unchanged. The target version introduces a memory leak fix in sd-server (commit 85676b7) with minimal performance impact.

Binaries analyzed:

  • build.bin.sd-server: -0.028% power consumption (527,129.70 nJ → 526,980.49 nJ)
  • build.bin.sd-cli: 0.0% power consumption change (491,105.58 nJ → 491,105.69 nJ)

Overall impact is positive, with hot-path optimizations offsetting minor regressions in non-critical functions.

Function Analysis

Performance-critical improvements:

  • ggml_compute_forward_map_custom3: Response time decreased 32.85% (-76.86ns: 233.99ns → 157.13ns); throughput time decreased 35.05% (-76.85ns: 219.25ns → 142.40ns). This GGML tensor operation is called thousands of times per inference, compounding to meaningful savings.

  • GGMLRunner::copy_data_to_backend_tensor: Response time decreased 11.45% (-197.67ns: 1,725.58ns → 1,527.91ns); throughput time decreased 56.89% (-197.69ns: 347.49ns → 149.80ns). Critical for GPU inference, reducing data transfer overhead between host and backend memory.

  • std::vector<sd_lora_t>::end: Response time decreased 69.44% (-183.29ns: 263.94ns → 80.65ns); throughput time decreased 75.41% (-183.29ns: 243.07ns → 59.78ns). Benefits LoRA configuration processing during request handling.

Non-critical regressions:

  • std::vector::begin and std::vector::begin: Both show ~289% throughput increases (+180.81ns: 62.49ns → 243.30ns), but absolute impact is negligible for these STL accessors used during initialization.

  • nlohmann::json::lexer::scan_string: Response time increased 6.01% (+2,530ns: 42,105.09ns → 44,635.09ns) with stable throughput time (-0.42%). Affects request parsing at boundary, not inference hot path.

Other analyzed functions showed minor variations in standard library operations with negligible real-world impact.

Additional Findings

The memory leak fix (free_results() function added at three cleanup sites) prevents unbounded memory growth in long-running servers without measurable energy overhead. Indirect benefits include reduced heap fragmentation improving allocator performance, evident in the 50% throughput improvement for std::_Construct string operations. The combined hot-path optimizations provide approximately 0.79ms savings per image generation, while the memory stability improvements are essential for production deployments handling sustained workloads.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants