Add Boogu-Image generation, editing, and turbo pipelines#14040
Add Boogu-Image generation, editing, and turbo pipelines#14040Boogu-Team wants to merge 23 commits into
Conversation
Integrate the Boogu-Image model into diffusers: - Models: BooguImageTransformer2DModel, PromptEmbedding, Boogu attention processors, Lumina2 blocks, and rotary embeddings. - Pipelines: BooguImagePipeline (text-to-image and instruction editing) and BooguImageTurboPipeline (DMD few-step text-to-image). - Scheduler: flow-match Euler scheduler with training-aligned time shifting. - Internal utils: TaylorSeer cache, TeaCache params, DPM cache helpers, and optional Triton fused RMSNorm. - Loading: resolve published checkpoints' custom module names to the integrated classes via module aliases, so from_pretrained needs no trust_remote_code. - Docs and runnable examples under docs/ and examples/boogu/. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the Boogu-only TaylorSeer caching feature, which was only half-removed in the working tree (left dangling `enable_taylorseer` references that raised NameError, and collaterally deleted the TeaCache `__init__` setup so the transformer raised AttributeError on `enable_teacache`). - transformer_boogu.py: remove the remaining TaylorSeer branches; restore the TeaCache init block (enable_teacache, enable_teacache_for_all_layers, teacache_rel_l1_thresh, teacache_params, rescale_func) and the numpy / TeaCacheParams imports it needs. - pipeline_boogu.py: drop the cache_init import, the enable_taylorseer plumbing and per-condition cache_dic/current branches, collapsing each `if enable_taylorseer / elif enable_teacache` into a plain `if enable_teacache`. - Delete cache_functions/ and taylorseer_utils/ (Boogu-added, TaylorSeer-only, now unreferenced). The upstream hooks-based TaylorSeerCacheConfig is untouched. - Remove BOOGU_INTEGRATION.md (ephemeral integration notes); add an environment install link to examples/boogu/README.md. The pipeline uses the official FlowMatchEulerDiscreteScheduler via the thin BooguFlowMatchEulerDiscreteScheduler subclass (reuses the parent step). Tests: test_models_transformer_boogu (15 passed) and test_boogu (20 passed) green; check_copies and check_dummies pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… adapter Replace the BooguFlowMatchEulerDiscreteScheduler subclass with the official FlowMatchEulerDiscreteScheduler plus a standalone set_flow_match_timesteps adapter that applies Boogu's training-aligned static v1 time shift and 0->1 sigma schedule, reusing the parent's exponential shift formula. - Add pipelines/boogu/flow_match_boogu.py with set_flow_match_timesteps - Route the flow-match branch of retrieve_timesteps through the adapter (annotated "# Adapted from" to reflect the intentional divergence) - Update pipeline/test type hints and imports to the official scheduler - Drop the scheduler subclass and its registrations (schedulers/__init__, top-level __init__, dummy_pt_objects) Numerically bit-identical to the old subclass (max diff ~6e-08). The boogu test suite shows no regression vs the pre-change tree (same 11 pre-existing MLLM device-placement failures, 19 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…writer Reduce the boogu pipeline package from 7 files to 4 by removing dead and misplaced code, keeping the default T2I/TI2I inference path unchanged. - Inline set_flow_match_timesteps into pipeline_boogu.py (single caller) and delete flow_match_boogu.py, per the "inline single-caller helpers" rule. - Replace the image_processor.preprocess override (which duplicated the parent VaeImageProcessor wholesale) with a thin override that only derives the Boogu max_pixels/max_side_length target size, then delegates to the parent. Verified bit-identical output across sizes/constraints (max diff 0.0). - Remove BooguImageLoraLoaderMixin / lora_pipeline.py: LoRA is unused on the inference path, and the mixin belongs in loaders/ by diffusers convention. - Remove the instruction-rewriter feature entirely (static_skills.py, instruct_reasoner_static_skills.py, and ~1100 lines of rewriter methods, state, and public kwargs). It was gated by use_rewrite_text_instruction (default False) and unused by every example/test; the skills files were its only consumers. Net: -2255 / +74 lines. End-to-end TI2I inference reproduces the standalone reference (mean pixel diff 8.8, unchanged from before), and the boogu test suite shows the same pre-existing baseline (11 failed / 19 passed / 4 skipped, the 11 being unrelated MLLM device-placement failures). check_copies and check_dummies pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both edit examples ran with no negative prompt. At text_guidance_scale=4.0 the model guides away from the negative instruction, so omitting it left the output oversaturated and under-stylized (style transfer barely applied). Add the standard negative prompt used by the reference inference so the colored-pencil style conversion comes through. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tions)
Self-review against .ai/{AGENTS,models,pipelines,review-rules}.md surfaced a
batch of mechanical issues fixed here (no behavior change on the default path;
boogu test suite unchanged at 16/53/7, identical failure set — the remaining
failures are a pre-existing MLLM cpu/cuda device-placement issue).
pipeline_boogu.py:
- Remove dead helpers: _project, _sigmoid_kernel, _softmax_kernel, the
non-newton-schulz bog_norm branches, MomentumRollingSum._append_and_save
(+ now-unused pathlib import).
- Drop unused __call__ params verbose and callback_on_step_end_tensor_inputs,
a bare `latents.shape[0]` expression, and several commented-out code blocks.
- Replace all print() with module logger; drop emoji/blank-line prints.
pipeline_boogu_turbo.py:
- Add module logger; replace the inference print() with logger.info.
transformer_boogu.py:
- Default attention to the SDPA processor instead of selecting it from an
os.getenv("device") read at __init__ (non-standard, and forced flash in fp32);
drop the now-unused Flash2Varlen imports and the single-stream block alias.
- Replace np.poly1d TeaCache rescale with inline Horner eval; drop numpy import.
- Fix _no_split_modules / _repeated_blocks (remove the alias string that never
matched __class__.__name__ and the invalid "nn.Embedding" entry).
- Give PromptEmbedding flat @register_to_config kwargs so from_pretrained
round-trips; remove its non-standard from_config override.
- Remove dead self.layers, enable_teacache_for_all_layers, a commented-out
param, a discarded dict lookup, and a stale section comment.
attention_processor_boogu.py:
- Remove no-op `layer = layer.to(device)` loops (rebind a local, never move the
module) plus the bare shape expressions and commented debug lines above them.
image_processor.py:
- Guard get_new_height_width against None max_pixels / max_side_length
(previously TypeError / UnboundLocalError when called with defaults); output
is bit-identical when both constraints are set. Sync the class docstring to
the actual __init__ signature.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
No released Boogu checkpoint ships a PromptEmbedding / prompt-tuning subfolder,
so the prompt-tuning path is never exercised by a published model. Per
.ai/AGENTS.md ("only keep the inference path you are actually integrating"),
remove it entirely:
- Delete PromptEmbedding (transformer_boogu.py), BooguImagePromptTuningPipeline
(pipeline_boogu.py), and BooguImagePromptTuningRotaryPosEmbed (rope_boogu.py).
- Drop the model's unused prompt_tuning_configs config arg, the pipeline's
prompt_embedding attribute + set_prompt_embedding(), and the
use_prompt_tuning_embedding branch of _get_instruction_feature_embeds (the
normal VLM-encoding path is unchanged). The now-orphaned has_offload_strategy
/ _module_execution_device helpers go with it.
- Remove the PromptEmbedding registrations (lazy import structure, top-level
export, dummy object).
Removing BooguImagePromptTuningPipeline also drops 2 of the 4 except-Exception
fallback blocks (the other 2, in BooguImagePipeline, are handled separately).
Verified: cached checkpoint transformer loads with no missing/unexpected keys
(prompt_tuning_configs in config.json is now harmlessly ignored); import +
ruff clean; no orphaned references remain.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_get_instruction_feature_embeds wrapped the single-layer MLLM call in
try output_hidden_states=False / except -> output_hidden_states=True and
hidden_states[-1]. Both paths return the same tensor (.last_hidden_state ==
.hidden_states[-1]), so the except branch only masked real errors behind a
UserWarning. Per .ai/AGENTS.md ("raise a concise error for unsupported cases
rather than adding complex fallback logic"), call the single path
unconditionally and let genuine failures surface.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per .ai/models.md, attention processors must use dispatch_attention_fn rather than calling F.scaled_dot_product_attention / flash_attn_varlen_func directly. Rewrite the two live processors (single-stream BooguImageAttnProcessor and double-stream BooguImageDoubleStreamSelfAttnProcessor) to feed (B, L, H, D) tensors to dispatch_attention_fn with _attention_backend / _parallel_config, and delete the two dead *Flash2Varlen classes and their _upad_input helpers (no longer instantiated; varlen unpadding is handled inside the dispatcher). File shrinks 1128 -> 383 lines. State_dict keys are unchanged: the double-stream QKV/out projections stay on the processor module (...processor.img_to_q / instruct_to_q / img_out / instruct_out), so published checkpoints load strictly with no remapping. The attention mask is always materialized as a [B, 1, 1, L] bool mask (never dropped to None when no token is padded): the native backend rounds bf16 differently on its masked vs no-mask paths, and matching the trained behavior keeps output bit-identical to the pre-refactor pipeline. Verified bit-exact (maxdiff 0.0): CPU tiny-model forward, GPU bf16 single forward, and GPU end-to-end base / edit / turbo. Checkpoint loads strict; pytest suite unchanged at 16/53/7. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per .ai/pipelines.md gotcha huggingface#4, a pipeline variant must be its own class with a duplicated __call__ rather than subclassing another pipeline in core src/ (the flux / sdxl / wan / qwenimage convention). BooguImageTurboPipeline previously subclassed BooguImagePipeline and overrode processing() with a DMD branch. Reparent it to DiffusionPipeline and give it its own pure-T2I DMD __call__: the setup (device management, encode_instruction, prepare_image, prepare_latents, RoPE) mirrors the parent's T2I path, then runs the DMD predict/renoise loop and decode directly — byte-for-byte the same computation the old processing() DMD branch performed. The DMD path takes no scheduler, reference images, or classifier-free guidance, so the negative / empty / BOG / cfg kwargs are dropped from the turbo signature. Shared utilities (encode_instruction, prepare_latents, prepare_image, predict, device management, the guidance-scale properties, …) are carried as `# Copied from diffusers.pipelines.boogu.pipeline_boogu.BooguImagePipeline.<method>` so make fix-copies keeps them in sync. Verified: end-to-end turbo output is bit-identical to the pre-change subclass (maxdiff 0.0); base / edit unaffected (also 0.0); check_copies consistent; ruff clean; pytest suite unchanged at 16/53/7. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Self-review round 2 against .ai rules, after the four structural refactors. No numerical change: CPU and GPU end-to-end (base/edit/turbo) A/B stay bit-identical (maxdiff 0.0); pytest suite unchanged at 16/53/7. Dead code removed: - MASK_VISION_TOKENS_FEATURE / VISION_TOKEN_IDs and their truncation branch (no public API ever sets them) plus the now-unused input_ids local. - base_sequence_length parameter and its proportional-attention branch from both attention processors (never passed by the transformer); drops the math import. - BooguImageRotaryPosEmbed reduced to the only thing used — the static get_freqs_cis — dropping its dead __init__/_get_freqs_cis/forward (the transformer uses BooguImageDoubleStreamRotaryPosEmbed; the pipeline only calls the static method). - Commented-out guidance formula and the `+ +` unary-plus typos in the triple guidance combination; stale docstrings (a "LoRA loading" mention with no LoRA, a reference to an internal training dataset class, a "may not be actually used" development note). Correctness / convention: - assert -> raise ValueError in the transformer / rope / attention forward paths (asserts are stripped under python -O). - _validate_device_format now relies on the validator's own raise instead of returning an ignored bool. - MomentumRollingSum states are only constructed when boosted orthogonal guidance is enabled. - encode_instruction return annotation corrected (it returns six values). - BooguImageTransformerTesterConfig inherits BaseModelTesterConfig (gives it model_split_percents etc., matching the other transformer tests). - examples: edit / edit_fp8 raise a clear error if base.png is missing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collapse statements that fit on one line after the previous cleanup, so `make style` / `ruff format --check` is clean for the PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The instruct_reasoner_static_skills.py prompt-template module was removed during cleanup; its per-file ruff ignore in pyproject.toml pointed at a file that no longer exists. Remove the dead entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
yiyixuxu
left a comment
There was a problem hiding this comment.
thanks a lot for the PR and the very thoughtful self-review :)
i left some comments/questions
The triton fused-RMSNorm / flash-attn SwiGLU paths were gated behind an
`os.getenv("device")` guard that defaulted to "cpu", so the published
inference path always fell back to torch.nn.RMSNorm and a torch SwiGLU.
Remove the unused ops/triton kernels (1261 lines) and ops/simple_layer_norm,
drop the dead env-guard in block_lumina2, and the now-unused
is_triton_available helper. Numerically identical to the default path;
addresses reviewer feedback (single-file convention prep + perf-path removal).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
diffusers follows a one-model-one-file convention. Merge the Boogu model's helper modules into transformer_boogu.py: - rope_boogu.py -> RoPE section - block_lumina2.py -> norm / feed-forward / embedding section - attention_processor_boogu.py -> attention-processor section Update the two pipelines and the transformer test to import BooguImageRotaryPosEmbed from transformer_boogu. Pure code relocation: the class bodies are unchanged, so checkpoints load identically and base/edit/turbo remain bit-exact (verified end-to-end on GPU). Addresses reviewer single-file convention feedback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A pipeline subclass should only carry pipeline-specific steps; device placement, offloading, and component registration belong to DiffusionPipeline. Remove the custom devices_manager / set_mllm / set_transformer / set_processor / set_scheduler / _validate_device_format / _check_device_strategy_validity methods, the enable_*_offload_flag / user_set_pipe_device state, and the now-unused validator_utils helper. __call__ resolves the device via the base class's _execution_device and drops its redundant `device=` kwarg; the mllm lm_head stripping stays in __init__. This also makes the inherited to()/enable_*_offload tests pass (previously 16-17 device/offload failures, now 0). Addresses reviewer feedback on pipeline-subclass responsibilities. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Thank you for the quick and thoughtful review, @yiyixuxu — much appreciated! 🙏 I've addressed all the comments and pushed the changes (commits 9e672c2, d202a23, 5cef903):
I've replied inline on each thread with the specifics and resolved them. |
yiyixuxu
left a comment
There was a problem hiding this comment.
thanks!
i left more comments
diffusers is inference-only, so the model never needs weight initialization — from_pretrained overwrites every parameter. Remove all initialize_weights / _initialize_weights methods and their __init__ call sites across the embedding, attention processors, and transformer blocks/model. Rewrite pipelines/boogu/__init__.py to use the standard _LazyModule import structure (matching flux2) instead of eager imports. Addresses reviewer comments r3470443836 / r3470445779 / r3470475167 / r3470478664 / r3470480686 / r3470553769 / r3470278582. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The norm / feed-forward / embedding helpers were named after Lumina but are not byte-identical to the canonical diffusers versions (Boogu uses torch.nn.RMSNorm and a fp32 SwiGLU), so they cannot share a `# Copied from` link. Rename them to make ownership clear: LuminaRMSNormZero -> BooguImageRMSNormZero LuminaLayerNormContinuous -> BooguImageLayerNormContinuous LuminaFeedForward -> BooguImageFeedForward Lumina2CombinedTimestepCaptionEmbedding -> BooguImageCombinedTimestepCaptionEmbedding Pure rename (class names are not part of the state dict), so checkpoints load identically. Addresses reviewer comments r3470436365 / r3470439983. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
BooguImageRotaryPosEmbed was a stateless namespace class exposing a single static get_freqs_cis; turn it into a module-level function. Its body was also duplicated verbatim as a never-called static method on BooguImageDoubleStreamRotaryPosEmbed — remove that dead copy (the module uses the instance _get_freqs_cis). Update the two pipelines and the transformer test to import the function. freqs_cis output is bit-identical. Addresses reviewer comments r3470387790 / r3470399801. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
TeaCache was bundled inline in the model forward and threaded through every guidance branch of the pipeline denoising loop. Per reviewer guidance, drop it: diffusers already ships FirstBlockCache via CacheMixin and a TeaCache backend is in progress (huggingface#12652), so caching should live in the model framework, not be hand-rolled here. Remove the model-side enable_teacache / teacache_params / rescale-coefficient state and the cache short-circuit in the single-stream stage, the per-condition TeaCacheParams bookkeeping in the pipeline loop, and delete the now-unused utils/teacache_util.py. TeaCache defaulted to off, so the default inference path is unchanged. Addresses reviewer comments r3470572315 / r3470616028 / r3470619173 / r3470928656. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the hand-written slope formula in set_flow_match_timesteps with the canonical calculate_shift helper (Copied from flux), reading the shift config with .get(..., default) like Flux does. Algebraically identical to the previous expression — for the published config (seq_len == max_image_seq_len == 4096) mu stays 1.15 — and verified bit-exact end-to-end on GPU (base maxdiff 0). NOTE on the related comment about routing through retrieve_timesteps(sigmas=...) instead of writing scheduler.timesteps/sigmas directly: Boogu feeds the 0->1 sigma to the transformer AS the timestep, whereas the official set_timesteps forces timesteps = sigmas * num_train_timesteps. Routing the custom 0->1 schedule through it would change the timestep scale handed to the model, so the adapter still sets the schedule explicitly. Addresses reviewer comment r3470650628 (and explains r3470809227). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the shared `attention_processor.Attention` used by the Boogu blocks with a dedicated `BooguImageAttention(nn.Module, AttentionModuleMixin)` that holds the per-head layers (to_q/to_k/to_v/norm_q/norm_k/to_out) and defers computation to a stateless processor, addressing the review request to keep processors free of standalone attention layers. - Single-stream blocks (`attn`, `img_self_attn`) now own their q/k/v on the BooguImageAttention module; `BooguImageAttnProcessor` reads them off `attn` exactly as before (it was already stateless). - The double-stream `img_instruct_attn` is built with `has_qkv=False`: its per-stream projections (img_to_q/instruct_to_q/img_out/instruct_out, ...) stay on `BooguImageDoubleStreamSelfAttnProcessor` so the published checkpoint keys (`...img_instruct_attn.processor.img_to_q` etc.) are preserved and no weight re-save is required. The module dispatches to the processor via its forward, so the block calls `self.img_instruct_attn(...)` directly instead of reaching into `.processor`. - qk-norm uses diffusers' RMSNorm (float32-upcasting), matching the previous `Attention(qk_norm="rms_norm")` numerics bit-for-bit; scale = dim_head**-0.5. - Drop the now-dead `del img_instruct_attn.to_q/to_k/to_v` + requires_grad block (those layers are never created with has_qkv=False). - Remove the unused `Attention` import. Verified: strict checkpoint load (0 missing/unexpected keys); GPU base (T2I+CFG, exercises both processors) bit-exact maxdiff=0 vs pre-refactor reference; edit (TI2I) maxdiff 8.6e-2 == same-code self-vs-self 8.2e-2 (the double-guidance path's inherent GPU nondeterminism, not a regression); ruff check/format, check_copies, check_dummies clean; 35 passed / 0 failed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per review, drop the experimental boosted-orthogonal-guidance (BOG) path so the pipeline exposes only standard classifier-free guidance. Removed: - `MomentumRollingSum` class. - `_project_matrix`, `_newtonschulz5_batched`, `bog_norm`, and `calculate_boosted_orthogonal_guidance` methods. - `use_boosted_orthogonal_guidance` / `bog_mu` / `bog_range` / `bog_interval` and the six `*_momentum_rolling_sum_*` arguments from `__call__`, plus the matching `processing()` parameters and the momentum-state plumbing at the call site. BOG defaulted to `False`, so every guidance branch already fell through to the plain `model_pred - model_pred_uncond` delta; removing the dead BOG branches leaves the default numerics unchanged. Verified: GPU base (T2I+CFG) bit-exact maxdiff=0 vs pre-change reference; edit (TI2I) maxdiff 9.4e-2, within the double-guidance path's inherent GPU nondeterminism band (cf. same-code self-check ~8e-2); ruff check/format, check_copies clean; 20 passed / 0 failed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Thanks for the detailed second-round review! I've pushed updates addressing it — one commit per topic so each is easy to review in isolation:
I've replied inline on each thread. Of the 20 comments, 18 are done; 2 I've left open with a question for you rather than guess:
Two things I'd also like your steer on:
Verification: every change is checked on GPU. The base path (T2I + CFG, which exercises both attention processors) is bit-exact (maxdiff=0) vs the pre-refactor reference; the edit path (TI2I double-guidance) matches within that path's inherent GPU nondeterminism (~8–9e-2, same as a same-code self-vs-self run). Thanks again — happy to iterate on the two open questions. |
What does this PR do?
Adds the Boogu-Image family of pipelines to
diffusers:BooguImagePipeline— text-to-image generation and instruction-based image editing.BooguImageTurboPipeline— few-step DMD distilled generation.The integration is purely additive: it introduces new files only and does not modify any existing upstream module.
Published checkpoints (Hugging Face Hub,
Boogu/…):Boogu-Image-0.1-Base,Boogu-Image-0.1-Edit,Boogu-Image-0.1-Turbo, and their-fp8variants.Pipelines & model
src/diffusers/pipelines/boogu/pipeline_boogu.pysrc/diffusers/pipelines/boogu/pipeline_boogu_turbo.pysrc/diffusers/models/transformers/transformer_boogu.pysrc/diffusers/pipelines/boogu/image_processor.pyDocs:
docs/source/en/api/pipelines/boogu.md. Runnable examples:examples/boogu/(base / edit / turbo + fp8, withREADME.md).Convention compliance
Implemented against the repo's
.airules:BooguImageAttention) + stateless processors, and the RoPE helper all live intransformer_boogu.py(one model = one file).dispatch_attention_fn(no directF.scaled_dot_product_attentionin the forward path); attention masks always materialized as[B, 1, 1, L]bool to stay bit-exact with the trained checkpoints under the native bf16 backend.BooguImageAttention(modelled onFlux2Attention) holds theto_q/to_k/to_v/norm_q/norm_k/to_outlayers and dispatches to a stateless processor.initialize_weights) — the model loads pretrained weights.__init__follows the lazy convention.BooguImageTurboPipelineis a standaloneDiffusionPipeline(not a subclass ofBooguImagePipeline), with shared methods carried via# Copied fromand kept in sync bymake fix-copies. Device placement / offloading / component registration reuse theDiffusionPipelinebase class.except Exceptionfallbacks, no unused "API-consistency" parameters — training / ablation / prompt-tuning code from the research repo was removed; only the inference path is integrated.Verification
examples/boogu/scripts run end-to-end and produce correct images (base / edit / turbo + fp8).maxdiff = 0) on GPU; the edit path (TI2I double-guidance) matches within that path's inherent GPU nondeterminism (~8–9e-2, confirmed equal to a same-code self-vs-self run). Checkpoints load strict (no missing/unexpected keys).ruff check,ruff format --check,make fix-copies(no diff),check_dummies.tests/pipelines/boogu/andtests/models/transformers/test_models_transformer_boogu.pypasses.Notes for reviewers
img_to_q/instruct_to_q/img_out/ …) are stored in the already-published checkpoints under…img_instruct_attn.processor.*. Moving them ontoBooguImageAttentionto make that processor fully stateless would rename those state-dict keys and require re-saving the released Hub checkpoints. To keep existing downloads loading, those projections stay on the processor module for now; happy to do the key migration + re-publish in a follow-up if you prefer the fully-stateless form.src/diffusers/guiders/(and route advanced features through modular diffusers) in a follow-up PR, keeping this one focused on the standard pipeline.CacheMixin/ FirstBlockCache (and the WIP in Implement TeaCache #12652)._keep_in_fp32_modulesis intentionally not set: forcingtime_caption_embedto fp32 changes bf16 inference numerics, so we left it to the model author / reviewer's call.transformers5.10.x (the env var alone does not disable DeepGEMM there).Required Hub-side checkpoint changes (no code change in this PR)
1.
model_index.json— point component classes atdiffusers, not the bundled remote-code modules.On Hub
mainit currently is:Both must become library refs so
from_pretrainedresolves the in-tree classes added by this PR:2. Delete the two remote-code shim modules (they are thin re-export stubs that
import boogu …from the private research package and raiseModuleNotFoundError: boogufor any external user):transformer/transformer_boogu.pyscheduler/scheduling_flow_match_euler_discrete_time_shifting.py3.
scheduler/scheduler_config.json— drop the legacy custom-scheduler keys. Hubmainhas:{ "_class_name": "FlowMatchEulerDiscreteScheduler", "_diffusers_version": "0.33.1", "do_shift": true, "dynamic_time_shift": false, "time_shift_version": "v1", "seq_len": 4096, "num_train_timesteps": 1000 }Remove
do_shift,dynamic_time_shift,time_shift_version(these drove the old custom scheduler's time-shift; the official scheduler ignores them). Keepseq_len— the pipeline's time-shift adapter readsscheduler.config.seq_lento compute the shiftmu, so it is load-bearing → final config keeps_class_name/_diffusers_version/seq_len/num_train_timesteps.4.
transformer/config.json— removeprompt_tuning_configs(the prompt-tuning subsystem was dropped from this integration; the key is otherwise ignored with a warning).The same four edits apply to all six repos (
Base,Edit,Turboand their-fp8variants). Themllm/,processor/,vae/and weight files are unchanged.Fixes # (issue)
Before submitting
.ai/review-rules.md?documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.