Add Boogu-Image generation, editing, and turbo pipelines by Boogu-Team · Pull Request #14040 · huggingface/diffusers

Boogu-Team · 2026-06-22T15:55:13Z

What does this PR do?

Adds the Boogu-Image family of pipelines to diffusers:

BooguImagePipeline — text-to-image generation and instruction-based image editing.
BooguImageTurboPipeline — few-step DMD distilled generation.
fp8 quantized inference examples for both, targeting the published fp8 checkpoints.

The integration is purely additive: it introduces new files only and does not modify any existing upstream module.

Published checkpoints (Hugging Face Hub, Boogu/…): Boogu-Image-0.1-Base, Boogu-Image-0.1-Edit, Boogu-Image-0.1-Turbo, and their -fp8 variants.

Pipelines & model

Component	File
Generation / edit pipeline	`src/diffusers/pipelines/boogu/pipeline_boogu.py`
Turbo (DMD few-step) pipeline	`src/diffusers/pipelines/boogu/pipeline_boogu_turbo.py`
Transformer backbone (model, attention module + processors, RoPE)	`src/diffusers/models/transformers/transformer_boogu.py`
Image processor	`src/diffusers/pipelines/boogu/image_processor.py`

Docs: docs/source/en/api/pipelines/boogu.md. Runnable examples: examples/boogu/ (base / edit / turbo + fp8, with README.md).

Convention compliance

Implemented against the repo's .ai rules:

Single-file model. The transformer backbone, its attention module (BooguImageAttention) + stateless processors, and the RoPE helper all live in transformer_boogu.py (one model = one file).
Attention. Routed through dispatch_attention_fn (no direct F.scaled_dot_product_attention in the forward path); attention masks always materialized as [B, 1, 1, L] bool to stay bit-exact with the trained checkpoints under the native bf16 backend. BooguImageAttention (modelled on Flux2Attention) holds the to_q/to_k/to_v/norm_q/norm_k/to_out layers and dispatches to a stateless processor.
Inference-only. No weight-init (initialize_weights) — the model loads pretrained weights. __init__ follows the lazy convention.
Turbo pipeline. BooguImageTurboPipeline is a standalone DiffusionPipeline (not a subclass of BooguImagePipeline), with shared methods carried via # Copied from and kept in sync by make fix-copies. Device placement / offloading / component registration reuse the DiffusionPipeline base class.
Standard CFG. The pipeline exposes standard classifier-free guidance only. No dead code paths, no silent except Exception fallbacks, no unused "API-consistency" parameters — training / ablation / prompt-tuning code from the research repo was removed; only the inference path is integrated.

Verification

All 6 examples/boogu/ scripts run end-to-end and produce correct images (base / edit / turbo + fp8).
Every refactor verified against a pre-refactor reference. The base path (T2I + CFG, which exercises both attention processors) is bit-exact (maxdiff = 0) on GPU; the edit path (TI2I double-guidance) matches within that path's inherent GPU nondeterminism (~8–9e-2, confirmed equal to a same-code self-vs-self run). Checkpoints load strict (no missing/unexpected keys).
CI gates pass locally: ruff check, ruff format --check, make fix-copies (no diff), check_dummies.
Test suite under tests/pipelines/boogu/ and tests/models/transformers/test_models_transformer_boogu.py passes.

Notes for reviewers

Double-stream attention & checkpoint keys. The single-stream attention processor is fully stateless. For the double-stream block, the per-stream projections (img_to_q / instruct_to_q / img_out / …) are stored in the already-published checkpoints under …img_instruct_attn.processor.*. Moving them onto BooguImageAttention to make that processor fully stateless would rename those state-dict keys and require re-saving the released Hub checkpoints. To keep existing downloads loading, those projections stay on the processor module for now; happy to do the key migration + re-publish in a follow-up if you prefer the fully-stateless form.
Boosted Orthogonal Guidance (BOG) has been removed — it defaulted to off, so standard-CFG output is unchanged. We'd like to re-introduce it as its own guider class under src/diffusers/guiders/ (and route advanced features through modular diffusers) in a follow-up PR, keeping this one focused on the standard pipeline.
TeaCache was removed from both the model and the pipeline; caching is deferred to CacheMixin / FirstBlockCache (and the WIP in Implement TeaCache #12652).
_keep_in_fp32_modules is intentionally not set: forcing time_caption_embed to fp32 changes bf16 inference numerics, so we left it to the model author / reviewer's call.
The fp8 examples include a small DeepGEMM-disable shim with version branches that are load-bearing for transformers 5.10.x (the env var alone does not disable DeepGEMM there).

Required Hub-side checkpoint changes (no code change in this PR)

Important for reviewers: the published Boogu/Boogu-Image-0.1-* checkpoints on the Hub are still packaged for the old custom-remote-code loading path and will not load correctly against this PR's native diffusers classes until the Hub repos are updated. We will push these Hub-repo edits to land together with the PR. Diff below is Hub main → required state.

1. model_index.json — point component classes at diffusers, not the bundled remote-code modules.
On Hub main it currently is:

"scheduler":   ["scheduling_flow_match_euler_discrete_time_shifting", "FlowMatchEulerDiscreteScheduler"],
"transformer": ["transformer_boogu", "BooguImageTransformer2DModel"],

Both must become library refs so from_pretrained resolves the in-tree classes added by this PR:

"scheduler":   ["diffusers", "FlowMatchEulerDiscreteScheduler"],
"transformer": ["diffusers", "BooguImageTransformer2DModel"],

2. Delete the two remote-code shim modules (they are thin re-export stubs that import boogu … from the private research package and raise ModuleNotFoundError: boogu for any external user):

transformer/transformer_boogu.py
scheduler/scheduling_flow_match_euler_discrete_time_shifting.py

3. scheduler/scheduler_config.json — drop the legacy custom-scheduler keys. Hub main has:

{ "_class_name": "FlowMatchEulerDiscreteScheduler", "_diffusers_version": "0.33.1",
  "do_shift": true, "dynamic_time_shift": false, "time_shift_version": "v1",
  "seq_len": 4096, "num_train_timesteps": 1000 }

Remove do_shift, dynamic_time_shift, time_shift_version (these drove the old custom scheduler's time-shift; the official scheduler ignores them). Keep seq_len — the pipeline's time-shift adapter reads scheduler.config.seq_len to compute the shift mu, so it is load-bearing → final config keeps _class_name / _diffusers_version / seq_len / num_train_timesteps.

4. transformer/config.json — remove prompt_tuning_configs (the prompt-tuning subsystem was dropped from this integration; the key is otherwise ignored with a warning).

The same four edits apply to all six repos (Base, Edit, Turbo and their -fp8 variants). The mllm/, processor/, vae/ and weight files are unchanged.

Fixes # (issue)

Before submitting

Did you use an AI agent (Claude Code, Codex, Cursor, etc.) to help with this PR? If so:
- Did you read the Coding with AI agents guide?
- Did you self-review the diff against .ai/review-rules.md?
Did you read the contributor guideline?
Did you read our philosophy doc? (important for complex PRs)
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?
Are you the author (or part of the team) of the model/pipeline (only applicable for model/pipeline related PRs)?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Integrate the Boogu-Image model into diffusers: - Models: BooguImageTransformer2DModel, PromptEmbedding, Boogu attention processors, Lumina2 blocks, and rotary embeddings. - Pipelines: BooguImagePipeline (text-to-image and instruction editing) and BooguImageTurboPipeline (DMD few-step text-to-image). - Scheduler: flow-match Euler scheduler with training-aligned time shifting. - Internal utils: TaylorSeer cache, TeaCache params, DPM cache helpers, and optional Triton fused RMSNorm. - Loading: resolve published checkpoints' custom module names to the integrated classes via module aliases, so from_pretrained needs no trust_remote_code. - Docs and runnable examples under docs/ and examples/boogu/. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Drop the Boogu-only TaylorSeer caching feature, which was only half-removed in the working tree (left dangling `enable_taylorseer` references that raised NameError, and collaterally deleted the TeaCache `__init__` setup so the transformer raised AttributeError on `enable_teacache`). - transformer_boogu.py: remove the remaining TaylorSeer branches; restore the TeaCache init block (enable_teacache, enable_teacache_for_all_layers, teacache_rel_l1_thresh, teacache_params, rescale_func) and the numpy / TeaCacheParams imports it needs. - pipeline_boogu.py: drop the cache_init import, the enable_taylorseer plumbing and per-condition cache_dic/current branches, collapsing each `if enable_taylorseer / elif enable_teacache` into a plain `if enable_teacache`. - Delete cache_functions/ and taylorseer_utils/ (Boogu-added, TaylorSeer-only, now unreferenced). The upstream hooks-based TaylorSeerCacheConfig is untouched. - Remove BOOGU_INTEGRATION.md (ephemeral integration notes); add an environment install link to examples/boogu/README.md. The pipeline uses the official FlowMatchEulerDiscreteScheduler via the thin BooguFlowMatchEulerDiscreteScheduler subclass (reuses the parent step). Tests: test_models_transformer_boogu (15 passed) and test_boogu (20 passed) green; check_copies and check_dummies pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… adapter Replace the BooguFlowMatchEulerDiscreteScheduler subclass with the official FlowMatchEulerDiscreteScheduler plus a standalone set_flow_match_timesteps adapter that applies Boogu's training-aligned static v1 time shift and 0->1 sigma schedule, reusing the parent's exponential shift formula. - Add pipelines/boogu/flow_match_boogu.py with set_flow_match_timesteps - Route the flow-match branch of retrieve_timesteps through the adapter (annotated "# Adapted from" to reflect the intentional divergence) - Update pipeline/test type hints and imports to the official scheduler - Drop the scheduler subclass and its registrations (schedulers/__init__, top-level __init__, dummy_pt_objects) Numerically bit-identical to the old subclass (max diff ~6e-08). The boogu test suite shows no regression vs the pre-change tree (same 11 pre-existing MLLM device-placement failures, 19 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…writer Reduce the boogu pipeline package from 7 files to 4 by removing dead and misplaced code, keeping the default T2I/TI2I inference path unchanged. - Inline set_flow_match_timesteps into pipeline_boogu.py (single caller) and delete flow_match_boogu.py, per the "inline single-caller helpers" rule. - Replace the image_processor.preprocess override (which duplicated the parent VaeImageProcessor wholesale) with a thin override that only derives the Boogu max_pixels/max_side_length target size, then delegates to the parent. Verified bit-identical output across sizes/constraints (max diff 0.0). - Remove BooguImageLoraLoaderMixin / lora_pipeline.py: LoRA is unused on the inference path, and the mixin belongs in loaders/ by diffusers convention. - Remove the instruction-rewriter feature entirely (static_skills.py, instruct_reasoner_static_skills.py, and ~1100 lines of rewriter methods, state, and public kwargs). It was gated by use_rewrite_text_instruction (default False) and unused by every example/test; the skills files were its only consumers. Net: -2255 / +74 lines. End-to-end TI2I inference reproduces the standalone reference (mean pixel diff 8.8, unchanged from before), and the boogu test suite shows the same pre-existing baseline (11 failed / 19 passed / 4 skipped, the 11 being unrelated MLLM device-placement failures). check_copies and check_dummies pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Both edit examples ran with no negative prompt. At text_guidance_scale=4.0 the model guides away from the negative instruction, so omitting it left the output oversaturated and under-stylized (style transfer barely applied). Add the standard negative prompt used by the reference inference so the colored-pencil style conversion comes through. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tions) Self-review against .ai/{AGENTS,models,pipelines,review-rules}.md surfaced a batch of mechanical issues fixed here (no behavior change on the default path; boogu test suite unchanged at 16/53/7, identical failure set — the remaining failures are a pre-existing MLLM cpu/cuda device-placement issue). pipeline_boogu.py: - Remove dead helpers: _project, _sigmoid_kernel, _softmax_kernel, the non-newton-schulz bog_norm branches, MomentumRollingSum._append_and_save (+ now-unused pathlib import). - Drop unused __call__ params verbose and callback_on_step_end_tensor_inputs, a bare `latents.shape[0]` expression, and several commented-out code blocks. - Replace all print() with module logger; drop emoji/blank-line prints. pipeline_boogu_turbo.py: - Add module logger; replace the inference print() with logger.info. transformer_boogu.py: - Default attention to the SDPA processor instead of selecting it from an os.getenv("device") read at __init__ (non-standard, and forced flash in fp32); drop the now-unused Flash2Varlen imports and the single-stream block alias. - Replace np.poly1d TeaCache rescale with inline Horner eval; drop numpy import. - Fix _no_split_modules / _repeated_blocks (remove the alias string that never matched __class__.__name__ and the invalid "nn.Embedding" entry). - Give PromptEmbedding flat @register_to_config kwargs so from_pretrained round-trips; remove its non-standard from_config override. - Remove dead self.layers, enable_teacache_for_all_layers, a commented-out param, a discarded dict lookup, and a stale section comment. attention_processor_boogu.py: - Remove no-op `layer = layer.to(device)` loops (rebind a local, never move the module) plus the bare shape expressions and commented debug lines above them. image_processor.py: - Guard get_new_height_width against None max_pixels / max_side_length (previously TypeError / UnboundLocalError when called with defaults); output is bit-identical when both constraints are set. Sync the class docstring to the actual __init__ signature. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

No released Boogu checkpoint ships a PromptEmbedding / prompt-tuning subfolder, so the prompt-tuning path is never exercised by a published model. Per .ai/AGENTS.md ("only keep the inference path you are actually integrating"), remove it entirely: - Delete PromptEmbedding (transformer_boogu.py), BooguImagePromptTuningPipeline (pipeline_boogu.py), and BooguImagePromptTuningRotaryPosEmbed (rope_boogu.py). - Drop the model's unused prompt_tuning_configs config arg, the pipeline's prompt_embedding attribute + set_prompt_embedding(), and the use_prompt_tuning_embedding branch of _get_instruction_feature_embeds (the normal VLM-encoding path is unchanged). The now-orphaned has_offload_strategy / _module_execution_device helpers go with it. - Remove the PromptEmbedding registrations (lazy import structure, top-level export, dummy object). Removing BooguImagePromptTuningPipeline also drops 2 of the 4 except-Exception fallback blocks (the other 2, in BooguImagePipeline, are handled separately). Verified: cached checkpoint transformer loads with no missing/unexpected keys (prompt_tuning_configs in config.json is now harmlessly ignored); import + ruff clean; no orphaned references remain. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

_get_instruction_feature_embeds wrapped the single-layer MLLM call in try output_hidden_states=False / except -> output_hidden_states=True and hidden_states[-1]. Both paths return the same tensor (.last_hidden_state == .hidden_states[-1]), so the except branch only masked real errors behind a UserWarning. Per .ai/AGENTS.md ("raise a concise error for unsupported cases rather than adding complex fallback logic"), call the single path unconditionally and let genuine failures surface. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Per .ai/models.md, attention processors must use dispatch_attention_fn rather than calling F.scaled_dot_product_attention / flash_attn_varlen_func directly. Rewrite the two live processors (single-stream BooguImageAttnProcessor and double-stream BooguImageDoubleStreamSelfAttnProcessor) to feed (B, L, H, D) tensors to dispatch_attention_fn with _attention_backend / _parallel_config, and delete the two dead *Flash2Varlen classes and their _upad_input helpers (no longer instantiated; varlen unpadding is handled inside the dispatcher). File shrinks 1128 -> 383 lines. State_dict keys are unchanged: the double-stream QKV/out projections stay on the processor module (...processor.img_to_q / instruct_to_q / img_out / instruct_out), so published checkpoints load strictly with no remapping. The attention mask is always materialized as a [B, 1, 1, L] bool mask (never dropped to None when no token is padded): the native backend rounds bf16 differently on its masked vs no-mask paths, and matching the trained behavior keeps output bit-identical to the pre-refactor pipeline. Verified bit-exact (maxdiff 0.0): CPU tiny-model forward, GPU bf16 single forward, and GPU end-to-end base / edit / turbo. Checkpoint loads strict; pytest suite unchanged at 16/53/7. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Per .ai/pipelines.md gotcha huggingface#4, a pipeline variant must be its own class with a duplicated __call__ rather than subclassing another pipeline in core src/ (the flux / sdxl / wan / qwenimage convention). BooguImageTurboPipeline previously subclassed BooguImagePipeline and overrode processing() with a DMD branch. Reparent it to DiffusionPipeline and give it its own pure-T2I DMD __call__: the setup (device management, encode_instruction, prepare_image, prepare_latents, RoPE) mirrors the parent's T2I path, then runs the DMD predict/renoise loop and decode directly — byte-for-byte the same computation the old processing() DMD branch performed. The DMD path takes no scheduler, reference images, or classifier-free guidance, so the negative / empty / BOG / cfg kwargs are dropped from the turbo signature. Shared utilities (encode_instruction, prepare_latents, prepare_image, predict, device management, the guidance-scale properties, …) are carried as `# Copied from diffusers.pipelines.boogu.pipeline_boogu.BooguImagePipeline.<method>` so make fix-copies keeps them in sync. Verified: end-to-end turbo output is bit-identical to the pre-change subclass (maxdiff 0.0); base / edit unaffected (also 0.0); check_copies consistent; ruff clean; pytest suite unchanged at 16/53/7. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Self-review round 2 against .ai rules, after the four structural refactors. No numerical change: CPU and GPU end-to-end (base/edit/turbo) A/B stay bit-identical (maxdiff 0.0); pytest suite unchanged at 16/53/7. Dead code removed: - MASK_VISION_TOKENS_FEATURE / VISION_TOKEN_IDs and their truncation branch (no public API ever sets them) plus the now-unused input_ids local. - base_sequence_length parameter and its proportional-attention branch from both attention processors (never passed by the transformer); drops the math import. - BooguImageRotaryPosEmbed reduced to the only thing used — the static get_freqs_cis — dropping its dead __init__/_get_freqs_cis/forward (the transformer uses BooguImageDoubleStreamRotaryPosEmbed; the pipeline only calls the static method). - Commented-out guidance formula and the `+ +` unary-plus typos in the triple guidance combination; stale docstrings (a "LoRA loading" mention with no LoRA, a reference to an internal training dataset class, a "may not be actually used" development note). Correctness / convention: - assert -> raise ValueError in the transformer / rope / attention forward paths (asserts are stripped under python -O). - _validate_device_format now relies on the validator's own raise instead of returning an ignored bool. - MomentumRollingSum states are only constructed when boosted orthogonal guidance is enabled. - encode_instruction return annotation corrected (it returns six values). - BooguImageTransformerTesterConfig inherits BaseModelTesterConfig (gives it model_split_percents etc., matching the other transformer tests). - examples: edit / edit_fp8 raise a clear error if base.png is missing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Collapse statements that fit on one line after the previous cleanup, so `make style` / `ruff format --check` is clean for the PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The instruct_reasoner_static_skills.py prompt-template module was removed during cleanup; its per-file ruff ignore in pyproject.toml pointed at a file that no longer exists. Remove the dead entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

yiyixuxu

thanks a lot for the PR and the very thoughtful self-review :)
i left some comments/questions

The triton fused-RMSNorm / flash-attn SwiGLU paths were gated behind an `os.getenv("device")` guard that defaulted to "cpu", so the published inference path always fell back to torch.nn.RMSNorm and a torch SwiGLU. Remove the unused ops/triton kernels (1261 lines) and ops/simple_layer_norm, drop the dead env-guard in block_lumina2, and the now-unused is_triton_available helper. Numerically identical to the default path; addresses reviewer feedback (single-file convention prep + perf-path removal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

diffusers follows a one-model-one-file convention. Merge the Boogu model's helper modules into transformer_boogu.py: - rope_boogu.py -> RoPE section - block_lumina2.py -> norm / feed-forward / embedding section - attention_processor_boogu.py -> attention-processor section Update the two pipelines and the transformer test to import BooguImageRotaryPosEmbed from transformer_boogu. Pure code relocation: the class bodies are unchanged, so checkpoints load identically and base/edit/turbo remain bit-exact (verified end-to-end on GPU). Addresses reviewer single-file convention feedback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A pipeline subclass should only carry pipeline-specific steps; device placement, offloading, and component registration belong to DiffusionPipeline. Remove the custom devices_manager / set_mllm / set_transformer / set_processor / set_scheduler / _validate_device_format / _check_device_strategy_validity methods, the enable_*_offload_flag / user_set_pipe_device state, and the now-unused validator_utils helper. __call__ resolves the device via the base class's _execution_device and drops its redundant `device=` kwarg; the mllm lm_head stripping stays in __init__. This also makes the inherited to()/enable_*_offload tests pass (previously 16-17 device/offload failures, now 0). Addresses reviewer feedback on pipeline-subclass responsibilities. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Boogu-Team · 2026-06-23T07:12:55Z

thanks a lot for the PR and the very thoughtful self-review :)
i left some comments/questions

Thank you for the quick and thoughtful review, @yiyixuxu — much appreciated! 🙏

I've addressed all the comments and pushed the changes (commits 9e672c2, d202a23, 5cef903):

single-file convention — merged the model's helper modules (rope /blocks / attention processors) into transformer_boogu.py.
triton fused RMSNorm — removed; the published path always fell back to torch.nn.RMSNorm, so it's numerically identical.
pipeline-subclass responsibilities — dropped all the custom device / offload / component-setter infrastructure from both pipelines; they now rely on DiffusionPipeline (.to() / _execution_device / enable_*_offload). This also made the previously-failing device/offload tests pass.

I've replied inline on each thread with the specifics and resolved them.
Happy to iterate further — thanks again!

yiyixuxu

thanks!
i left more comments

diffusers is inference-only, so the model never needs weight initialization — from_pretrained overwrites every parameter. Remove all initialize_weights / _initialize_weights methods and their __init__ call sites across the embedding, attention processors, and transformer blocks/model. Rewrite pipelines/boogu/__init__.py to use the standard _LazyModule import structure (matching flux2) instead of eager imports. Addresses reviewer comments r3470443836 / r3470445779 / r3470475167 / r3470478664 / r3470480686 / r3470553769 / r3470278582. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The norm / feed-forward / embedding helpers were named after Lumina but are not byte-identical to the canonical diffusers versions (Boogu uses torch.nn.RMSNorm and a fp32 SwiGLU), so they cannot share a `# Copied from` link. Rename them to make ownership clear: LuminaRMSNormZero -> BooguImageRMSNormZero LuminaLayerNormContinuous -> BooguImageLayerNormContinuous LuminaFeedForward -> BooguImageFeedForward Lumina2CombinedTimestepCaptionEmbedding -> BooguImageCombinedTimestepCaptionEmbedding Pure rename (class names are not part of the state dict), so checkpoints load identically. Addresses reviewer comments r3470436365 / r3470439983. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

BooguImageRotaryPosEmbed was a stateless namespace class exposing a single static get_freqs_cis; turn it into a module-level function. Its body was also duplicated verbatim as a never-called static method on BooguImageDoubleStreamRotaryPosEmbed — remove that dead copy (the module uses the instance _get_freqs_cis). Update the two pipelines and the transformer test to import the function. freqs_cis output is bit-identical. Addresses reviewer comments r3470387790 / r3470399801. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

TeaCache was bundled inline in the model forward and threaded through every guidance branch of the pipeline denoising loop. Per reviewer guidance, drop it: diffusers already ships FirstBlockCache via CacheMixin and a TeaCache backend is in progress (huggingface#12652), so caching should live in the model framework, not be hand-rolled here. Remove the model-side enable_teacache / teacache_params / rescale-coefficient state and the cache short-circuit in the single-stream stage, the per-condition TeaCacheParams bookkeeping in the pipeline loop, and delete the now-unused utils/teacache_util.py. TeaCache defaulted to off, so the default inference path is unchanged. Addresses reviewer comments r3470572315 / r3470616028 / r3470619173 / r3470928656. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replace the hand-written slope formula in set_flow_match_timesteps with the canonical calculate_shift helper (Copied from flux), reading the shift config with .get(..., default) like Flux does. Algebraically identical to the previous expression — for the published config (seq_len == max_image_seq_len == 4096) mu stays 1.15 — and verified bit-exact end-to-end on GPU (base maxdiff 0). NOTE on the related comment about routing through retrieve_timesteps(sigmas=...) instead of writing scheduler.timesteps/sigmas directly: Boogu feeds the 0->1 sigma to the transformer AS the timestep, whereas the official set_timesteps forces timesteps = sigmas * num_train_timesteps. Routing the custom 0->1 schedule through it would change the timestep scale handed to the model, so the adapter still sets the schedule explicitly. Addresses reviewer comment r3470650628 (and explains r3470809227). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replace the shared `attention_processor.Attention` used by the Boogu blocks with a dedicated `BooguImageAttention(nn.Module, AttentionModuleMixin)` that holds the per-head layers (to_q/to_k/to_v/norm_q/norm_k/to_out) and defers computation to a stateless processor, addressing the review request to keep processors free of standalone attention layers. - Single-stream blocks (`attn`, `img_self_attn`) now own their q/k/v on the BooguImageAttention module; `BooguImageAttnProcessor` reads them off `attn` exactly as before (it was already stateless). - The double-stream `img_instruct_attn` is built with `has_qkv=False`: its per-stream projections (img_to_q/instruct_to_q/img_out/instruct_out, ...) stay on `BooguImageDoubleStreamSelfAttnProcessor` so the published checkpoint keys (`...img_instruct_attn.processor.img_to_q` etc.) are preserved and no weight re-save is required. The module dispatches to the processor via its forward, so the block calls `self.img_instruct_attn(...)` directly instead of reaching into `.processor`. - qk-norm uses diffusers' RMSNorm (float32-upcasting), matching the previous `Attention(qk_norm="rms_norm")` numerics bit-for-bit; scale = dim_head**-0.5. - Drop the now-dead `del img_instruct_attn.to_q/to_k/to_v` + requires_grad block (those layers are never created with has_qkv=False). - Remove the unused `Attention` import. Verified: strict checkpoint load (0 missing/unexpected keys); GPU base (T2I+CFG, exercises both processors) bit-exact maxdiff=0 vs pre-refactor reference; edit (TI2I) maxdiff 8.6e-2 == same-code self-vs-self 8.2e-2 (the double-guidance path's inherent GPU nondeterminism, not a regression); ruff check/format, check_copies, check_dummies clean; 35 passed / 0 failed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Per review, drop the experimental boosted-orthogonal-guidance (BOG) path so the pipeline exposes only standard classifier-free guidance. Removed: - `MomentumRollingSum` class. - `_project_matrix`, `_newtonschulz5_batched`, `bog_norm`, and `calculate_boosted_orthogonal_guidance` methods. - `use_boosted_orthogonal_guidance` / `bog_mu` / `bog_range` / `bog_interval` and the six `*_momentum_rolling_sum_*` arguments from `__call__`, plus the matching `processing()` parameters and the momentum-state plumbing at the call site. BOG defaulted to `False`, so every guidance branch already fell through to the plain `model_pred - model_pred_uncond` delta; removing the dead BOG branches leaves the default numerics unchanged. Verified: GPU base (T2I+CFG) bit-exact maxdiff=0 vs pre-change reference; edit (TI2I) maxdiff 9.4e-2, within the double-guidance path's inherent GPU nondeterminism band (cf. same-code self-check ~8e-2); ruff check/format, check_copies clean; 20 passed / 0 failed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Boogu-Team · 2026-06-25T08:06:49Z

thanks! i left more comments

Thanks for the detailed second-round review! I've pushed updates addressing it — one commit per topic so each is easy to review in isolation:

Commit	What
`05cc899ae`	Drop all weight-init (`initialize_weights`), follow the lazy / inference-only `__init__` convention
`a597eed0b`	Rename the Lumina-derived layers to `BooguImage*`
`36298d3a6`	Collapse the RoPE namespace class into a `get_freqs_cis` function (and drop the dead duplicate)
`c9a6b5d06`	Remove TeaCache (model + pipeline) and delete `utils/teacache_util.py`; defer caching to `CacheMixin` / FirstBlockCache
`662e6ca11`	Compute the time-shift `mu` via `calculate_shift` (`# Copied from` flux)
`2977b2dce`	Introduce `BooguImageAttention` (flux2-style) that holds the layers; processors dispatch through it
`3b56e275c`	Remove boosted-orthogonal-guidance; the pipeline now exposes standard CFG only

I've replied inline on each thread. Of the 20 comments, 18 are done; 2 I've left open with a question for you rather than guess:

retrieve_timesteps(..., sigmas=) (pipeline timesteps) — Boogu feeds the 0→1 sigma value directly as the timestep, whereas set_timesteps(sigmas=) forces timesteps = sigmas * num_train_timesteps + its own shift, which changes the numerics and breaks bit-exactness. I kept the explicit assignment for now and proposed a small scheduler flag as a clean alternative — would value your call on the thread.
Inlining img_patch_embed_and_refine into forward — happy to inline per the coding-style doc, just wanted to confirm the preference since forward is already large.

Two things I'd also like your steer on:

Double-stream attention & checkpoint keys. The single-stream processor is now fully stateless. For the double-stream block, the per-stream projections (img_to_q / instruct_to_q / img_out / ...) are stored in the already-published checkpoints under ...img_instruct_attn.processor.*. Moving them onto BooguImageAttention to make that processor fully stateless would rename those state-dict keys and require re-saving the released Hub checkpoints (Base / Edit / Turbo / Turbo-fp8). To preserve compatibility for existing downloads, I kept those projections on the processor for this round. If you'd prefer the fully-stateless form, we'll do the key migration + re-publish in a follow-up once we can coordinate the Hub re-upload.
Boosted Orthogonal Guidance. I removed it outright here (it defaulted to off, so standard CFG output is unchanged / bit-exact). I'd like to bring it back as its own guider class under src/diffusers/guiders/ — and route advanced features through modular diffusers — in a follow-up PR, keeping this one focused on the standard pipeline. Does that sequencing work for you?

Verification: every change is checked on GPU. The base path (T2I + CFG, which exercises both attention processors) is bit-exact (maxdiff=0) vs the pre-refactor reference; the edit path (TI2I double-guidance) matches within that path's inherent GPU nondeterminism (~8–9e-2, same as a same-code self-vs-self run). ruff check / ruff format / check_copies / check_dummies are clean, and the model + pipeline test suites pass. The six examples/boogu scripts (base / edit / turbo, each + fp8) all generate images correctly.

Thanks again — happy to iterate on the two open questions.

Boogu-Team and others added 13 commits June 18, 2026 15:51

Boogu: apply ruff format

8952e6d

Collapse statements that fit on one line after the previous cleanup, so `make style` / `ruff format --check` is clean for the PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions Bot added size/L PR with diff > 200 LOC documentation Improvements or additions to documentation models tests utils pipelines examples schedulers and removed size/L PR with diff > 200 LOC labels Jun 22, 2026

yiyixuxu reviewed Jun 23, 2026

View reviewed changes

yiyixuxu mentioned this pull request Jun 23, 2026

[.ai] document single-file model layout and "don't reimplement Diffus… #14048

Merged

Boogu-Team and others added 3 commits June 23, 2026 03:12

github-actions Bot added the size/L PR with diff > 200 LOC label Jun 23, 2026

yiyixuxu reviewed Jun 24, 2026

View reviewed changes

Boogu-Team and others added 7 commits June 25, 2026 03:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Boogu-Image generation, editing, and turbo pipelines#14040

Add Boogu-Image generation, editing, and turbo pipelines#14040
Boogu-Team wants to merge 23 commits into
huggingface:mainfrom
Boogu-Team:feat/integrate-boogu

Boogu-Team commented Jun 22, 2026 •

edited

Loading

Uh oh!

yiyixuxu left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Boogu-Team commented Jun 23, 2026

Uh oh!

yiyixuxu left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Boogu-Team commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Boogu-Team commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Pipelines & model

Convention compliance

Verification

Notes for reviewers

Required Hub-side checkpoint changes (no code change in this PR)

Before submitting

Who can review?

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Boogu-Team commented Jun 23, 2026

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Boogu-Team commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Boogu-Team commented Jun 22, 2026 •

edited

Loading