Skip to content

feat(llama-cpp-localai-paged): paged KV cache llama.cpp backend + cross-request prefix sharing + GB10 decode optimization [WIP]#10462

Open
localai-bot wants to merge 321 commits into
masterfrom
worktree-feat+paged-attention
Open

feat(llama-cpp-localai-paged): paged KV cache llama.cpp backend + cross-request prefix sharing + GB10 decode optimization [WIP]#10462
localai-bot wants to merge 321 commits into
masterfrom
worktree-feat+paged-attention

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

Status: draft / WIP - opened to track ongoing GB10 enterprise-serving work. Large branch (kernel experiments + analysis + the shippable feature); will be curated before any merge.

What this is

Vendored, opt-in paged KV cache + cross-request prefix sharing for the llama.cpp backend, plus GB10 (consumer Blackwell, sm_121) decode-path optimization and the supporting analysis. All paged behaviour is gated by LLAMA_KV_PAGED (env) / the kv_paged server option and is off by default - stock builds are byte-identical.

Shippable feature pieces

  • backend/cpp/llama-cpp/patches/paged/0001-0011 - vendored llama.cpp patch series, applied behind the LLAMA_PAGED build flag (patches/paged/, default on; LLAMA_PAGED=off gives a clean upstream checkout). Isolated in prepare.sh + Makefile with a sentinel guard against double-apply.
  • grpc-server.cpp - kv_paged per-server option (0005) + cross-request prefix share wired into update_slots (0008).
  • core/backend/hardware_defaults.go, pkg/xsysinfo/gpu.go - hardware-aware default consolidation.

Key results (measured on DGX Spark / GB10, Qwen3-32B NVFP4)

  • Prefix sharing (RAG / system-prompt fan-out): the cross-request cache reaches the server - concurrent shared-prefix requests skip recompute, 15-25x burst-wall prefill reduction (K=16/32), reuse confirmed (ref_cnt=K, suffix-only prefill).
  • Decode kernel: an in-kernel block-table read (0009) deletes the per-step gather regression; routing the GQA-grouped tile kernel by default (0011) brings paged decode to stock parity (within 1.8%), growing to -6.1% at 16k ctx. Token-correct (CPU byte-identical, GPU within the CUDA batch-shape non-determinism band).
  • Honest framing: the earlier "6x decode gap vs vLLM" was a measurement artifact (the pre-0009 gather regression + a warmup/server number). The real steady-state decode is GPU-bound at the LPDDR5x bandwidth wall (~50% irreducible weight-read floor), ~1.2x from vLLM at the kernel level. Remaining throughput headroom is at the scheduler/serving layer, not the kernel.

Analysis docs live under backend/cpp/llama-cpp/patches/paged/*.md and backend/cpp/llama-cpp/paged/*.md.

Next

  • Scheduler/serving lever (continuous batching + chunked prefill + paged-KV capacity) for aggregate throughput.
  • Correctness hardening (mask-pad invariant assert) + CUDA-graphs confirmation (graphs are already ON in serving).

Not for merge as-is

This branch also contains banked W4A16/Marlin kernel experiments and NVFP4/MXFP4 quality analysis that informed the direction but are not part of the feature. Those will be dropped/split before merge.

mudler added 28 commits June 23, 2026 11:25
… QoS budget)

Surface patch 0013's decoupled per-step prefill-token budget as a per-model
grpc-server option, mirroring the existing kv_paged option. When
max_prefill_tokens (aliases: mpt, prefill_budget) is set to a positive integer,
params_parse setenv's LLAMA_PREFILL_BUDGET before context creation so the
vendored update_slots() scheduler latches it; unset or non-positive leaves the
env untouched, preserving stock unbounded-prefill behaviour (an externally
exported LLAMA_PREFILL_BUDGET still works as an escape hatch).

This bounds the head-of-line decode stall a large prompt inflicts on the
in-flight decoders co-batched with it, with no steady-state throughput cost.

Verified on GB10 (sm_121), dense Qwen3-32B-NVFP4, paged build, 8-slot
continuous batching, one ~6k-token prefill injected mid-stream; same binary,
only the budget differs:

  budget      worst decode gap   prefill wall
  unset           2.462 s          6.672 s
  512             0.669 s (3.7x)   7.516 s
  256             0.398 s (6.2x)   8.854 s

Monotonic: a smaller budget cuts the decode stall further at a modest TTFT
cost, the classic chunked-prefill trade-off. grpc-server.cpp compiles cleanly
against the paged build tree.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Same-day steady-state aggregate-decode sweep at npl 8/32/64/128 for three
model classes, replacing the stale ~75-80%-of-vLLM carried figure with a
full concurrency curve.

Findings:
- Dense 32B (NVFP4 vs NVFP4A16): parity at batch-8 (97%), 72-86% mid/high.
- Small 0.6B: parity at batch-8 (99%), 49-67% at high concurrency
  (llama plateaus ~2.0k, vLLM scales to 4.2k; runtime/scheduler-bound).
- MoE 30B-A3B: llama-only at 290-1041 tok/s. vLLM cannot serve it on GB10
  (bf16 hangs at MoE warmup and reboots the box, twice; mxfp4 GGUF expert
  tensors unmappable by vLLM 0.23.0).

Batch-8 anomaly resolved: clean isolated dense batch-8 decode is ~88-90
tok/s (~89 ms/step) across paged-vs-stock (within 2%, paged slightly
faster) and ctx 65536-vs-163840 (within 1%). The prior 471 ms/step was a
mixed-load decode/prefill contention artifact, not paged overhead, ctx
allocation, or NVFP4 cost - the case patch 0013 LLAMA_PREFILL_BUDGET bounds.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Build-ready plan (not implemented) for matching/beating vLLM MoE
grouped-GEMM efficiency on GB10 sm_121 for Qwen3-30B-A3B mxfp4.

Honest reframe: the grouped GEMM the mission scoped to build already
exists upstream and runs on GB10 for mxfp4 - should_use_mmq() routes
MUL_MAT_ID to the grouped mmq path, which already contains both vLLM
building blocks (mm_ids_helper moe_align/scatter + a persistent stream-k
FP4-MMA grouped GEMM). The npl128 cliff was a since-fixed regression, not
a batched-bench artifact; re-measured decode is monotonic 85->1771 t/s.

The one structural gap is M-tile sizing: ggml maximizes mmq_x over the
aggregate token count while vLLM uses a small per-expert BLOCK_SIZE_M, so
each tiny per-expert M-tile is 3-6% filled at decode density. Scope is a
surgical two-step delta (expert-aware mmq_x selection; block-padded
moe_align), the parity gate (test_mul_mat_id bit-exact + ragged small-M),
and a phased plan gated behind the GB10 W4A16 occupancy wall.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Mirror of the dev-tree engine patch (ggml mmq.cuh) into the paged patch set,
plus its measurement writeup. Adds LLAMA_MOE_MMQ_X, an opt-in env cap on the MoE
grouped-GEMM token-tile (mmq_x) for the MUL_MAT_ID path; default-off =
byte-identical to stock.

Honest result of the MoE near-term lever: the npl128 decode cliff does NOT exist
on current HEAD (stock decode is monotonic 85/282/629/935/1295/1779 t/s at npl
1/8/32/64/128/256; the old cliff was fixed upstream by the sorted grouped
FP4-MMA GEMM + MoE stream-k). The cap is therefore not a cliff fix but a modest
high-batch decode micro-optimization: cap64 gives +4.8% decode at npl128 and
+2.3% at npl256 (reproducible, neutral at npl<=64) for a ~1.3% prefill cost;
cap16/cap32 are net-negative (prefill -41% / -17%). Full tables in
MOE_TOKEN_TILE_CAP.md; durable density-aware follow-up in
MOE_GROUPED_GEMM_SCOPE.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…015)

Mirror of llama-paged-dev commit 151343b into the pinned paged patch series.
The durable, default-on follow-up to patch 0014's opt-in LLAMA_MOE_MMQ_X global
cap: a host-side density-aware mmq_x auto-select in mul_mat_q_case that caps the
MUL_MAT_ID grouped FP4-MMA token-tile only at low per-expert density (decode) and
keeps the 128 tile at high density (prefill), so it is prefill-safe by construction
(removes 0014's ~1.3% prefill cost). No new kernel.

density_max default = 8 (not tile/4 = 16): 16 equals the 256-expert prefill-ubatch
density and regressed S_PP ~2% on Qwen3.6-35B-A3B NVFP4; 8 sits between decode and
prefill density for n_experts in [128,511] at n_ubatch=512.

Honest result on the mission's MoE target (Qwen3.6-35B-A3B NVFP4, 256 experts +
GDN/SSM linear attention, GB10 sm_121, median of 5 reps): NEUTRAL. Decode S_TG is
within run-to-run noise (npl128 +0.36%) and prefill S_PP neutral (within +/-0.7%).
This model is bound by the SSM recurrence and 256-tiny-expert weight bandwidth, not
the MoE col-tile occupancy, so the col-tile lever has nothing to bite on; a npl128
tile sweep confirms 64 is the only useful width (TILE8 -6.3% ... TILE96 -0.8%). The
lever's real win lives on col-tile-bound MoE (Qwen3-Coder-30B, +4.8% @npl128 per
patch 0014), which the auto-select reproduces at npl128 by construction at zero
prefill cost. Shipped default-on because it is prefill-safe, decode-neutral here,
and correctness-gated.

LLAMA_MOE_MMQ_X (0014) kept as a manual override; LLAMA_MOE_AUTO_TILE=0 restores
exact stock selection. P0 gate: test-backend-ops test_mul_mat_id ragged small-M
NVFP4/MXFP4 MoE decode-density shapes pass CUDA-vs-CPU on GB10 both default-on and
stock. Full rationale and tables in patches/paged/MOE_DENSITY_AUTO_TILE.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
First crash-resilient slab of the apples-to-apples NVFP4-vs-NVFP4
llama.cpp-vs-vLLM benchmark on GB10. MoE Qwen3.6-35B-A3B paged
llama.cpp (patch 0015) decode/prefill/TTFT/VRAM at npl 8/32/64/128.
vLLM and dense tables append as the sweeps land.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… dense + MoE)

Full 4-way sweep (npl 8/32/64/128): dense Qwen3.6-27B (clean W4A4) + MoE
Qwen3.6-35B-A3B (vLLM Marlin NvFp4). Parity at npl8; vLLM scales ~2.8-2.9x ahead
on decode at npl128. llama TTFT explodes at high concurrency - run WITHOUT
max_prefill_tokens (0013), the prefill-starvation also drags decode_agg; fair
re-run with the QoS budget pending. llama wins on on-demand memory (paged).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…weep

Re-run the dense Qwen3.6-27B NVFP4 vs vLLM A/B with patch 0013's QoS
prefill budget enabled (LLAMA_PREFILL_BUDGET swept over 256/512/1024),
fixing the prior run that left prefill unbounded and let high-concurrency
prefills starve each other.

At the saturated npl128 point budget=256 is the best lever: decode_agg
134.6 -> 161.2 tok/s (+19.8%) and TTFT 491.2 s -> 305.4 s (-37.8%) vs the
starved stock run, moving llama from 34.5% to 41.3% of vLLM decode. Larger
budgets help less; at light/moderate concurrency the budget is net-negative
for TTFT because this all-at-once workload has no in-flight decode to protect
at t=0. Documented honestly: a real but narrow high-concurrency lever, not a
gap-closer (vLLM still ~2.4x decode / ~12x lower TTFT at npl128).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…dget

Budget 256/512 sweep on the A3B MoE under patch 0013. Mirror image of the
dense case: stock MoE was never prefill-starved (3B active, TTFT 84.8s @npl128),
so the budget is a decode-throughput lever paid for in TTFT, not a TTFT fix.
Budget 256 lifts decode_agg +14% (292->333.5 tok/s) and restores monotonic
decode scaling (kills the stock +7.4% plateau, now +20% into npl128), moving
llama 36.0%->41.1% of vLLM decode. Gap not closed: vLLM still ~2.4x decode and
~12x lower TTFT @npl128.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…recard

Phase 3 synthesis of the max_prefill_tokens (patch 0013) fair re-run:
how much of the gap was prefill starvation, the genuine remaining gap
to vLLM, and where par-or-beat stands per concurrency/model.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…a-server

Build-ready plan (not implemented) for a vLLM-v1-style token-granular
continuous-batch scheduler in tools/server/server-context.cpp update_slots(),
the last lever after patch 0013 on the GB10 NVFP4 llama-vs-vLLM gap.

Key findings that shape the scope:
- The unified mixed batch already exists: Phase 1 (2604-2719) claims every
  ready decode token unconditionally, Phase 2 (2753-3330) fills prefill into
  the same llama_batch. Decode-first is structural, not a thing to build.
- The chunked-prefill slot state already persists across steps (a
  PROCESSING_PROMPT slot with prompt.n_tokens() < task->n_tokens() resumes).
  No slot-state rewrite is needed - the feared big risk does not materialize.
- The only missing piece is the budget POLICY: convert 0013's static per-step
  prefill cap into a dynamic, decode-first, per-slot-fair token budget (one
  total T, decode claims D, prefill gets leftover T-D, capped per slot).
- Honest ceiling: the residual ~2.4x decode gap is a decode-KERNEL batch
  scaling ceiling (~157-161 dense / ~333 MoE @npl128), NOT a scheduler defect.
  The scheduler closes the 12x TTFT gap and holds that ceiling tuning-free;
  the throughput residual is a separate, named decode-kernel lever (P3).

Phased P0-P3 with per-phase payoff, files, risks, and GB10 considerations.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Append a source-verified Review / risk section to
CONTINUOUS_BATCH_SCHEDULER_SCOPE.md. Verdict: scope is sound, GO on P0 ->
P1, conditional P2, separate-track P3.

Key checks against HEAD 151343b:
- Tractability: zero libllama changes. The mixed per-seq prefill+decode
  ubatch is the existing shipping path (common_batch_add per-token pos/seq,
  init_batch split, paged_alloc is hooks on the same llama_kv_cache class,
  not a new class). The new scheduler changes only the prefill token count,
  never the batch structure.
- The real serving config is kv_unified=false (-> n_stream=n_seq_max=128),
  so the split path is split_equal(sequential=true), not the contiguous
  split_simple the pseudocode implies. Fold into P0 ubatch-shape and
  determinism analysis; lock the split path in the A/B.
- CUDA graphs ruled out: both NVFP4 H2H vLLM servers ran --enforce-eager
  (cudagraph_mode=NONE), so the npl128 2.4x decode gap is genuine
  eager-kernel + per-step host overhead. Scheduler cannot close it; the
  157/333 ceiling stands.
- TTFT root quantified: prefill_tps collapses with concurrency for llama
  (dense 1117->125) while vLLM holds flat ~1420. The dynamic T-D budget
  attacks this directly and can sustain prefill_tps >= vLLM during the
  drain, so burst-TTFT parity is mechanically plausible, but it couples to
  a decode-ITL knob (T) that MUST be co-reported with TTFT.

Two calibration fixes required before P1: co-report drain-phase decode-ITL
with TTFT (stop charging/selling the steady-state decode_agg number), and
acknowledge the split_equal/n_stream=128 path. Neither changes the go
decision. P1 is the minimal high-ROI step (handful of line edits at named
seams); gate P2 on P1 metrics; P3 (kernel/CUDA-graph) owns the 2.4x
residual independent of the scheduler.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Decompose vLLM's enforce_eager decode step (attention / weight GEMM /
sampling / host loop) on GB10 (DGX Spark, sm_121) and attribute the
measured ~2.4x NVFP4 decode-throughput gap to its parts, from source
reading plus the existing nsys decode trace and H2H bench logs.

Key finding: the gap is dominantly a KERNEL-efficiency gap (~80-90%),
not a host-overhead gap. llama's GPU is already ~94.6% busy during
steady decode, so a CUDA-graphed decode is a minority lever (~10-20%
of the gap, bounded by the GPU-idle bubble), not the silver bullet.
vLLM's wins: in-kernel paged-decode read (no gather tax), faster
long-context attention, fused native-FP4 / grouped-Marlin GEMM, and
O(1)-in-ctx GDN linear-attention layers on these Qwen3.6 hybrids.
vLLM achieved 2.4x with synchronous scheduling and no CUDA graphs.

Evidence: vllm 0.23.0 source (gpu_model_runner, flash_attn/gdn
backends, modelopt/marlin GEMM, v1/sample), reproduced nsys kernel
categorization (cat2.py), and QWEN36_NVFP4_BENCH / DECODE_GAP_STUDY /
CONTINUOUS_BATCH_SCHEDULER_SCOPE.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…6, continuous-batch P1)

Mirror the P1 engine change of CONTINUOUS_BATCH_SCHEDULER_SCOPE.md into the
vendored paged patch series and surface it as a LocalAI model option.

- patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch:
  supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC,
  decode-first token budget in update_slots(). At the budget seam (already
  after Phase 1's decode fill, so batch.n_tokens == D is known) compute
  T = clamp(LLAMA_MAX_BATCH_TOKENS ?: n_batch, n_ubatch, n_batch),
  prefill_budget_step = max(n_ubatch, T - D), and a per-slot prompt-chunk
  cap prefill_cap_per_slot; bound the Phase-2 prompt-fill loop and outer
  admission break by these instead of 0013's constant. Policy-only change,
  no new slot states, no batch-formation rewrite, zero libllama changes.
  Decode is structurally claimed first (Phase 1) so the decode-first
  guarantee is free. As decode load D rises the leftover auto-shrinks, so
  the budget self-tunes across npl 8..128 and dense vs MoE and holds the
  GB10 decode ceiling tuning-free (vs 0013's hand-picked 256). The legacy
  LLAMA_PREFILL_BUDGET path is preserved (honoured only when the dynamic
  knob is unset), so 0013 is cleanly subsumed. DEFAULT-OFF byte-identical:
  all-knobs-unset and the degenerate T == n_batch case are bit-identical to
  stock by construction (the n_batch hard ceiling is kept and the dynamic
  bounds reach it at the same point for every D). Orthogonal to
  LLAMA_KV_PAGED.

- grpc-server.cpp: wire the new knob as model options max_batch_tokens / mbt
  (-> LLAMA_MAX_BATCH_TOKENS) and prefill_cap (-> LLAMA_PREFILL_CAP), beside
  the existing max_prefill_tokens / mpt seam; default-off, takes precedence
  over the legacy static budget when set.

- patches/paged/P1_DYNAMIC_BUDGET_RESULTS.md: design, the byte-identical
  determinism analysis (verified by construction), the local patch-apply
  verification, and the gate + A/B bench methodology.

Validation status: the patch applies cleanly on top of LLAMA_VERSION
(f3e1828) + paged 0001-0015, and the off-path / T==n_batch determinism is
proven by construction. The GB10 sm_121 build, the four runtime gates, and
the dense+MoE A/B sweep are PENDING a DGX run (the dev box was unreachable
this session) and are documented as such in P1_DYNAMIC_BUDGET_RESULTS.md; do
not sell the quantitative TTFT payoff until that re-run lands.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The prior all-at-once BURST H2H is adversarial to any prefill budget (TTFT is
prefill-rate-bound, a cap only slows the drain) and showed 0016 ~= 0013. Run a
STAGGERED-arrival benchmark on the GB10 DGX (patch 0016 built @253cbae): a
steady-rate client that keeps a mix of in-flight decoders + newly-arriving
prefills, capturing per-request TTFT and the full inter-token-latency series.

Append the metrics (in-flight decode protection + new-request TTFT, per arm) and
an honest verdict to P1_DYNAMIC_BUDGET_RESULTS.md. On staggered traffic stock's
in-flight decoders freeze multi-second on every prefill admission while both
budget arms keep ITL flat; 0016 (mbt512) sits at a strictly better point on the
protection/TTFT frontier than 0013-256 (equal spike-free protection, materially
lower TTFT/throughput/wall) and adds a decode-adaptive single-T knob. It does not
strictly dominate stock (Pareto tradeoff: smoothness vs raw TTFT). Verdict: 0016
earns its keep over 0013 on staggered traffic; recommend LLAMA_MAX_BATCH_TOKENS=512.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…4x lever

Closes lever 5 of VLLM_DECODE_GROUNDING.md. GGUF metadata + source reading on
the paged dev tree plus nsys decode traces on Qwen3.6-27B NVFP4 (GB10 sm_121)
confirm the Gated-Delta-Net linear-attention layers decode as a fused single
CUDA kernel (gated_delta_net.cu) updating a fixed-size cached recurrent state:
no context-length parameter, no KV re-scan. Matched-batch context-scaling
control (npl4, pure decode) shows the GDN kernel flat (10.3 -> 8.0 us/launch)
across 4x context while full-attention grows 3.1x (27 -> 85 us). GDN is a small,
context-flat share (~0.4-10%% by batch); the FP4 weight GEMM dominates (~67%).
Verdict: GDN decode is efficient, not the cheap model-specific fix; the 2.4x is
the general GEMM + full-attention kernel work, as the grounding concluded.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…/no-go

Roofline at the decode batch shape (M=128, NVFP4 weights) on GB10 (sm_121):
the dense weight-read floor (~1,940 tok/s) and MoE floor (~1,590 tok/s) sit
4-6x above vLLM's 391/811, so 273 GB/s is NOT the wall. At FP4 peak the GEMM is
bandwidth-bound (crossover M*~611 >> 128); at the kernel's ~3% achieved FP4
efficiency it is compute-bound by its own inefficiency (471 ms vs a 66 ms floor).

Verdict: dense decode parity is plausibly reachable via a tuned FP4-MMA decode
M-tile (track B) + fused act-quant (track A), landing 376-394 tok/s = 90-103% of
vLLM 391, but only at the top of the demonstrated GB10 FP4 envelope (~17-21%) and
with no margin (occupancy wall is the binding constraint, not bandwidth). MoE
parity is NOT reachable from the GEMM alone (ceiling ~60-76% of 811): its floor
is the hardest grouped-GEMM regime and ~24% of its step is non-GEMM work outside
track B. GO (conditional) for dense, PARTIAL for MoE. Build-ready phased plan
included; tune the existing block_fp4_mmq path, not a W4A16 rewrite.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…cies

Add the source-read kernel-mechanism map (no cp.async weight pipeline,
mmq_x tile-maximizing selector vs GB10 occupancy, MoE per-expert M-tile
waste, iter_k=512 coupling, ruled-out non-levers) and strip the stray
trailing tags from the prior write.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…er-phase decode_agg

Rewrite the track-B scope into the definitive build-ready plan for the
NVFP4 FP4-MMA decode GEMM toward vLLM GB10 parity. Source-read of the
mmq.cuh/mma.cuh/quantize.cu FP4 path on the dgx paged dev tree settles two
load-bearing facts the prior draft got partly wrong:

  - llama's dense path is already TRUE W4A4 (block_fp4_mmq packs 256 e2m1
    values + ue4m3 scales; the MMA is kind::mxf4nvf4 e2m1.e2m1...ue4m3), so
    there is no activation-bit-width work to do; the whole dense deficit is
    scheduling/occupancy.
  - the mmq_x selector minimizes ntiles_x, which PINS dense decode at
    mmq_x=128 (weights read once). Shrinking mmq_x re-reads the 18 GB
    weights, so the dense occupancy lever is mmq_y-down (BW-neutral), NOT
    mmq_x-down; MoE's free lever is the per-expert mmq_x-down (patch 0015).

Adds the explicit kernel-approach decision (tune the existing FP4-MMA
mul_mat_q; reject the cutlass-SM120 rewrite, dead on GB10 and broken on
sm_121; reject the BF16-Marlin descent), the concrete build-ready changes
(mmq_y/granularity/stream-k knobs, FP4-MMA fragment invariants, the
ue4m3 scale path, and the block_fp4_mmq y-tile ABI contract for the
track-A act-quant fusion handoff), the GB10-fit rules, the bit-exact
test-backend-ops gate with decode-shape + ragged-M cases, and per-phase
expected decode_agg tables.

Verdict (honest, roofline-grounded): the decode GEMM is bandwidth-bound on
the hardware roofline (M=128 << crossover 611; weight-read floors 4-6x
above vLLM) but compute-bound in practice at ~3% FP4 eff, so 273 GB/s is
not the wall. DENSE: GO (conditional) - B+A reaches 376-394 tok/s =
90-103% of vLLM 391, gated by a P2 occupancy kill-gate (<15% FP4 eff ->
parity off). MoE: PARTIAL/NO-GO - ceiling ~76% of 811 (618) from the GEMM
alone; full MoE parity needs the non-GEMM tracks too.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Append section 9 (skeptical staff-CUDA-engineer review) to FP4_GEMM_SCOPE_B.md,
stress-testing the dense/MoE parity verdict against the committed grounding.

Key findings:
- Not the W4A16 wall: the npl-sweep (dense 99/56/46/41% of vLLM at npl 8/32/64/128)
  shows llama's FP4-MMA kernel HITS the weight-read floor at M=8 and FALLS OFF it as
  M grows, while vLLM HOLDS it. Working-path tune, dual existence proof (M=8 + vLLM
  M=128), not a greenfield build. Same binding constraint as W4A16 though (hide
  LPDDR5x latency at the larger tile on an occupancy-dominated part).
- The dense gap is ~82-87% GEMM, ~13-18% non-GEMM (467 ms total = 383-405 GEMM +
  62-84 non-GEMM). B alone caps ~80%; track A is what tips dense over the parity line.
- Sharpest omission: vLLM's M=128 floor is reached via cutlass TMA + deep pipeline -
  the technique the doc forbids on GB10. TMA != manual cp.async (lower occupancy cost);
  it must be an in-scope P2 fallback, not categorically banned.
- Honest landing: dense ~80-90% (parity the optimistic tail, contingent on B+A+floor),
  MoE ~55-65% (parity not reachable from B). Low-regret: even a tripped P2 kill-gate
  lands B+A ~89%, doubling today's 41%.
- Sequencing fix: land A first (defines B's interface + baseline + kill-gate), then
  run B's P2 against the post-A number.

Verdict: DENSE conditional GO (scope as GEMM-gap-closing, not true parity; A-first,
gate at P2, add TMA); MoE NO-GO for parity from B (do the cheap mmq_x-down win as a
1.7-1.85x, not parity).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…sults (patch 0017)

Mirror of llama.cpp dev-tree commit 089f78d. Track B P0 (bit-exact NVFP4 dense decode-shape
MUL_MAT parity gate) + P1 (default-off occupancy levers) for the GB10 dense FP4 weight GEMM.

P1 kill-gate TRIPPED: the cheap host/occupancy levers do not lift decode_agg on GB10 (sm_121).
DENSE q36-27b-nvfp4 @npl128 149.5 -> minblocks2 147.9 (-1.1%) -> dense mmq_x=64 144.3 (-3.5%);
MoE q36-35b-a3b mmq_x-down regresses (TILE16 -3.7%, TILE8 -5.9%, reproduces patch 0015). nsys:
the FP4 GEMM mul_mat_q<NVFP4,128,0> went 2.782s->3.025s (+8.7% slower) under register-capping
(spilling). The dense M=128 tile is already weight-read/one-read-optimal; the only untested lever
is the structural mmq_y-down (nwarps=4 warp-remap, blocked by nwarps*tile_C::I==mmq_y), deferred
to P2. All levers default-off => default build byte-identical to stock. See THROUGHPUT_B_P1_RESULTS.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Phase 1 measures the CUDA-graph lever on the paged decode (q36-27b-nvfp4
dense, GB10 sm_121, fusion off). The 4-cell decode_agg {stock,paged} x
{graphs on,off} is flat within ~1%: the graphs-on win is +0.13% at npl128
and +1.1% at npl32 (both within run noise). The default paged decode is not
eager: it captures and replays graphs with a 256-token reset cadence
identical to stock non-paged (block-table ne0 = GGML_PAD(n_gather,256) only
steps at 256-token boundaries); only the gather fallback grows n_gather every
step and runs pure eager. 'graphs reused=0' was a uid fast-path false negative
(llama rebuilds the cgraph each step, so the reuse log never fires while the
graph still replays via the instance path).

nsys (reliable eager trace, plus the captured trace re-run with
--cuda-graph-trace=node to defeat nsys omitting graph-internal kernels, an
artifact that otherwise reads 0.3% busy) shows the steady decode is 99.4-99.5%
GPU-busy. Idle is ~0.6% of the step: 0.37% within-step launch gaps (the only
thing graphs remove, cut to 0.11% when captured) plus a 0.24% between-step
host gap (~2ms per step). Throughput is identical on/off.

Verdict: CUDA-graphing the paged decode is not a throughput lever; the decode
is GPU-compute-bound and the 2.6x gap to vLLM (148 vs 391) is in the per-step
GPU kernel work (FP4 GEMM + attention at batch 128), not launch overhead or
the host loop.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…et SSM path)

Phase 1 ruled out CUDA graphs as the paged-decode lever (GPU 99.4% busy,
decode_agg flat graphs on-vs-off) and attributed the 2.6x gap to vLLM to the
per-step GPU kernel work (FP4 GEMM + attention at batch 128). Phase 2 decomposed
that kernel work directly on the Phase-1 nsys reps and corrects the attribution.

Findings (q36-27b-nvfp4 = gguf arch qwen35, a 48:16 hybrid gated-DeltaNet
linear-attention + full-attention model; DGX GB10 sm_121, fusion off):
- Graphs re-confirmed not the lever: fresh paged graphs-ON 146.03 vs OFF 144.90
  t/s (+0.78%, noise); the captured rep is 99.5% busy with the same ~3267ms
  memcpy (graphs capture memcpy nodes too).
- The 99.4% busy is real but ~19% of it is D2D memcpy, not compute: an
  overlap-correct interval-union sweep gives kernels-only 80.2% busy, the gap
  filled by 1584 D2D copies/run (~80/step, ~230MB each = the gated-DeltaNet
  recurrent state). Phase 1's cuda_gpu_trace lumped this into compute.
- Decode GPU-time decomposition (% of kernel+memcpy busy): gated_delta_net 23.4%,
  get_rows 21.9%, D2D state copy 18.9%, FP4 GEMV 15.5%, FP4 GEMM 10.4%,
  full attention 0.4%. Grouped: SSM/gated-DeltaNet machinery ~67%, FP4 matmul
  ~28%, full attention (all paged-attn optimizes) ~0.4%.

Verdict: not graphs, not the host loop, not primarily FP4 GEMM, not attention.
Paged attention touches ~0.4% of decode on this model, so no paged/graph/
block-table change can move decode_agg. The lever is the ggml qwen35
gated-DeltaNet decode: kill the per-layer recurrent-state D2D copy and fuse the
get_rows gather into the recurrence (vLLM's fused_recurrent_gated_delta_rule
keeps state in place). Ceiling: -copy ~146->180; -copy-and-gather ~146->247 t/s.

No code patch (the lever is an SSM-path rewrite, orthogonal to paged attention);
patches/paged/0018 stays free.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Append the four-point synthesis to A2_CUDAGRAPH_DECODE.md: measured
CUDA-graph lever size (<1%, not the guessed 10-20%), the corrected
'eager' premise (default paged decode already captures), the unchanged
37-38% of vLLM at npl128, and the honest verdict that A.2 closes none of
the 2.6x gap because paged attention touches ~0.4% of decode on this
hybrid-SSM model. Residual lever is the qwen35 gated-DeltaNet SSM path
(state D2D copy + get_rows gather), orthogonal to paged attention.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ch 0018)

Mirror of the llama-paged-dev patch 0018 engine change plus the measured
results. Per SSM layer per step decode no longer D2D-copies the full ~225 MB
recurrent state into the cache: the fused gated_delta_net op writes the final
state in place at the active sequences cache slot (new
ggml_gated_delta_net_inplace, src[6] = state_dst), mirroring vLLM
fused_recurrent_gated_delta_rule. SSM math unchanged (bit-identical greedy).

Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on):
  q36-27b-nvfp4 dense: npl32 113.74 -> 136.39 (+19.9 percent),
    npl128 146.23 -> 180.53 (+23.5 percent, = predicted copy-removal ceiling).
  q36-35b-a3b-nvfp4 MoE: npl128 313.36 -> 372.62 (+18.9 percent).
nsys D2D memcpy bucket 18.9 -> 0.23 percent (356 -> 2.93 GB). vLLM share
(391 @128) 37.4 -> 46.2 percent. See SSM_DECODE_FIX_RESULTS.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Mirror of the llama-paged-dev patch 0019 engine change plus the measured
results. Step 2 of the SSM decode work: after Step 1 (in-place state write-back,
patch 0018) the largest non-GEMM decode bucket was the recurrent-state get_rows
gather (18.8 percent of decode GPU time). This removes that materialization,
mirroring ggml_ssm_scan's ids source: ggml_gated_delta_net_inplace_ids reads each
sequence's prior state directly from cache[ids[seq]] (src[5] = full cache,
src[7] = ids), so combined with Step 1's in-place write the op reads AND writes
the cache directly with no state materialization at all.

Race-free by construction: identity sequences (ids[seq] == rs_head + seq, the
whole AR decode path) read s0 in place from the destination slot; non-identity
sequences (reorder / rs_zero, e.g. multi-new-seq prefill) read from a disjoint
scratch a small gather kernel populates first. ids stays a device pointer.
Bit-identical to the get_rows path. Gated to qwen35 + qwen35moe; qwen3next,
kimi-linear, the non-fused and rollback paths are unchanged.

Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on, fusion off):
  q36-27b-nvfp4 dense: npl32 137.64 -> 170.68 (+24.0 percent),
    npl128 186.25 -> 256.57 (+37.8 percent, 47.6 -> 65.6 percent of vLLM 391).
  q36-35b-a3b-nvfp4 MoE: npl32 299.68 -> 366.69 (+22.4 percent),
    npl128 409.30 -> 553.63 (+35.3 percent).
Greedy (--temp 0 --seed 1) llama-completion bit-identical vs the Step-1 build
(dense + MoE). nsys k_get_rows_float bucket 18.8 -> 0.7 percent. The residual
decode gap to vLLM is now the FP4 GEMM (~48 percent of decode). See
SSM_DECODE_FIX_RESULTS.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…osition

Fresh post-SSM nsys of llama (build-cuda-base, patch 0019) AND vLLM 0.23.0 at
npl128 decode. Reproduces the 391 reference (vLLM 394 t/s eager / 420 graphs,
graphs +6% only) and confirms llama 245 t/s. Both ~98% GPU-busy; the gap is
GPU kernel-time, not idle/host/graphs. GDN compute comparable (llama 4.03 vs
vLLM 3.62 ms/call, +11%). bytes/step: llama not higher (131 vs 85 MB memcpy;
SSM-fix 18GB/step DtoD removal confirmed in-trace). Single biggest llama-specific
overage = FP4 matmul path 236 vs 117 ms/step (+119 ms = 64% of the gap),
dominated by mul_mat_vec_q (FP4 GEMV at batch 128, 132 ms/step, 26%, one per
GDN layer). Track B optimized the wrong FP4 kernel (mul_mat_q, not the GEMV).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ever

Cross-check the adversarial validation against the profiler ground-truth and
finalize DECODE_PARITY_EXPLORE.md. The post-SSM 254->391 decode gap is one
llama-specific defect: the gated-DeltaNet output projection (ssm_out) runs as
an FP4 GEMV (mul_mat_vec_q, 132 ms/step = 26% of decode) at batch 128 instead
of a tensor-core MMQ GEMM. Mechanism confirmed at source: final_output is 3D
[6144,1,n_seqs] so src1->ne[1]=1 trips the MMVQ dispatch (<=8), with the 128
sequences in ne[2]. vLLM packs the same projection into a cutlass M=128 GEMM.

GDN recurrence is only +11%/call (not the lever); P2a optimized the wrong FP4
kernel (the 17% MMQ, not the 26% MMVQ); CUDA graphs, host loop, and DRAM bytes
are all ruled out. Decode parity is reachable in software (not a hardware
floor): identical bytes/floor, vLLM hits 62% util vs llama 40% on the same
GB10. Highest-value next step (~free, bit-exact): collapse final_output to 2D
before ssm_out so M=128 routes to MMQ. Ranked levers + cumulative ceilings
toward 391 documented.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@richiejp

Copy link
Copy Markdown
Collaborator

It seems like the patches are being made apply onto a dirty tree. I put a fix here: richiejp@1060414

Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM
models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM
(patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling
both engines pinned the largest llama-specific overage to the gated-DeltaNet
output projection (ssm_out).

The GDN op left its output in SSM layout and the graph reshaped it to 3D
[value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so
src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128
sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the
ssm_out weight read across the 128 sequences. vLLM packs the same projection into
one M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D.

The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs]
(= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128
routes to the MMQ M=128 tensor-core GEMM. The result is then already 2D, so the
redundant post-matmul reshape_2d is dropped. Same contiguous data, just a 2D vs
3D view: bit-identical. Gated to the gated-DeltaNet path (qwen35 / qwen35moe /
qwen3next); other archs untouched.

Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both
q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical.
test-backend-ops MUL_MAT and MUL_MAT_ID OK.

decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128):
  dense q36-27b:     170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%)
  MoE   q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%)
Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit).

nsys: the o_proj mul_mat_vec_q<NVFP4,m=1> bucket (132.8 ms / 48 inst) collapses
to zero; mul_mat_q<NVFP4,m=128> absorbs it (+1200 inst, +363 ms) at a LOWER
per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call
vs 2.77 ms/call for the old GEMV.

Mirrors DGX dev-tree commit df1cc97.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
mudler added 30 commits July 1, 2026 12:42
Assisted-by: Codex:gpt-5
Record Phase81 default-off BF16 persistent S-cache results, including md5 drift, op gates, decode profile, and KL smoke. Scope Phase82 as full f16-reference KL plus serving A/B before patch-series promotion.

Assisted-by: Codex:gpt-5
Record the phase 110-140 GDN/MoE campaign benchmark log and append the
series-trim decision to the parity handoff: keep the Phase135 routed-FFN
fused-quant line plus the MoE test sentinels and the MTP-draft correctness
fix; drop the W4A16 structural line, the trace/tile-policy patches, GPU-sort,
W4A16-direct-A, and the finalize fusion. Rejected/neutral levers are recorded
in the handoff and the per-phase bench artifacts. Fork re-mirrored on
51168c5ee: fd920cf8a a85c1e098 2fed6aacf f1d976f06 1edddc8fe (HEAD tree
097c862c).

Assisted-by: Claude:opus-4.8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… 1edddc8fe

The campaign patches 0048-0063 were added without matching fork commits.
After a keep/drop review, the series is trimmed and re-mirrored 1:1 onto the
fork branch mudler/llama.cpp:localai-paged (HEAD 1edddc8fe, on 51168c5ee).

Kept, renumbered from the fork (now carry Assisted-by + Signed-off-by):
- 0048 test(paged): cover MoE swiglu down chain      (was 0051, fd920cf8a)
- 0049 test(paged): cover MoE weighted combine chain (was 0052, a85c1e098)
- 0050 test(paged): cover ragged MoE dispatch        (was 0053, 2fed6aacf)
- 0051 fix(speculative): disable backend sampling for MTP drafts (was 0054, f1d976f06)
- 0052 feat(paged): whole-pattern MoE matcher + routed-FFN fused NVFP4-quant
       down MMQ (new, 1edddc8fe)

Dropped (no fork commits, removed from the series):
- 0048-0050 W4A16 grouped-tile pack/tune/pad: dead line, W4A16 ~1.5x slower
  than grouped-MMQ.
- 0055-0063 speculative/moe/mul-mat/cublas route traces + the rejected small-M
  tile-policy knob (0059).
- All other 110-140 campaign markers not needed by Phase135 (GPU-sort,
  W4A16-direct-A, boundary trace/timing, Phase133 sorted-F32, Phase134
  fused-SWIGLU, Phase138 finalize) carry no code in this tree.

Tree-hash proof (the mirror invariant): a fresh detached worktree at
LLAMA_VERSION 0ed235ea2c17a19fc8238668653946721ed136fd with every on-disk
patches/paged/0*.patch applied in numeric order (git apply) stages to tree
097c862c6834b7d8b90419b305b8402155ef8373, byte-identical to fork HEAD
1edddc8fe's tree. Series is 43 patches (0001-0047 unchanged + 0048-0052).

Gated on GB10 sm_121a: default md5 MoE 8cb0ce23 / dense 5951a5b4 unchanged;
opt-in md5-clean; MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46,
MOE_SWIGLU_DOWN 7/7, MUL_MAT_ID_RAGGED_MOE 6/6; six mmq_moe_quantized_raw
markers with zero sorted launches on the opt-in sentinel.

Assisted-by: Claude:opus-4.8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…gram)

Reframe the GB10 vLLM-parity gap from a per-lever "hardware floor" verdict
to a ggml-execution-architecture-conditional one: same-silicon 2-3x is
software architecture, not silicon. Add EXECUTION_REARCH_SCOPE.md, a phased
additive program (P1 bf16-native stream, P2 expert-major fused MoE region,
P3 Marlin large-M retry on P1+P2, P4 token-budget scheduler, P5 blocked-solve
GDN, P6 fp8 KV), each with the ggml/fork seam, default-off env gate, per-path
md5/KL correctness gate, a falsifiable P0 kill-gate, expected-recovery
arithmetic grounded in the both-engine nsys buckets, and upstream-clash
analysis. Point the README docs list and PARITY_HANDOFF forward-direction at
it.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Adversarial verification against the canonical fork mudler/llama.cpp:localai-paged
HEAD 1edddc8fe found the scope doc's section-3 seam references were anchored to
the abandoned pre-trim tree 237ad9b96, which the immediately-preceding commit
b529cc5 reset away. Two classes of defect, both corrected:

- Phantom scaffolding (honesty): the doc claimed "the team has already started
  scaffolding P1 and P3" citing four commits (237ad9b96 bf16 GDN state cache,
  afc2c7030 act-quant trace, ea0875d14 LLAMA_BF16_CUBLAS_F32_OUT, 7967ad47f
  W4A16 direct-A stub) that b529cc5 TRIMMED - none exist at 1edddc8fe (git
  cat-file: not a valid object). w4a16-policy.h, test-cuda-w4a16-policy.cpp and
  ggml_cuda_mul_mat_id_w4a16_grouped_direct_a are absent from the tree. Reworded
  P1 plank-1 and the P3 mechanism/files/effort to say these must be re-introduced
  on top of the surviving grouped W4A16 path (patch 0035), not "finished".

- Stale line numbers (additivity): every file:line was off (computed against the
  larger 237ad9b96 tree). Re-anchored to 1edddc8fe: ggml_cuda_try_fuse 4232 (was
  4661), capture loop 4908 (was 5444), moe whole-pattern matcher 4157 (was 4678),
  routed_ffn_poc moe-ffn.cu:275 (was 456), grouped W4A16 hook ggml-cuda.cu:2797
  (was 3093/3188; the direct-A hooks 3085/3171 never existed), concurrent_event
  machinery 4769 (was 5305-5318), continuous-batch budget server-context.cpp
  3083-3135 with LLAMA_MAX_BATCH_TOKENS at 3105 / prefill_budget_step at 3113
  (was 3122-3200).

Numbers (attribution table, recovery arithmetic), the six P0 kill-gates, and the
unreachable-floor honesty were verified sound and left unchanged.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Additive regen mirroring fork mudler/llama.cpp:localai-paged HEAD 653bb2f3d
(base 1edddc8fe + 3 P1 commits). Patches 0001-0052 are untouched.

- 0053 residual-segment executor + norm-bf16.{cu,cuh} + LLAMA_BF16_CUBLAS_F32_OUT
- 0054 bf16 residual-add + rope op-variants
- 0055 BF16_STREAM_SEGMENT test-backend-ops sentinel

Kill-gate: a fresh detached worktree at pin 0ed235ea2c17a19fc8238668653946721ed136fd
applied all 46 on-disk patches in numeric order (strict git apply) and staged
tree 6cf1523047e0e38679baff20844bdc9e6829eb22, byte-for-byte == fork HEAD tree.
All default-off (LLAMA_BF16_STREAM); default md5 canonical both models
(MoE 8cb0ce23777bf55f92f63d0292c756b0, dense 5951a5b4d624ce891e22ab5fca9bc439).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
P1 of the EXECUTION_REARCH_SCOPE additive program landed: LLAMA_BF16_STREAM
(default-off) bf16-resident residual-segment executor for the q36 MoE model's
projection boundaries.

- EXECUTION_REARCH_SCOPE.md: dated "P1 RESULT" subsection (P0 kill-gate GO,
  full build-out deltas, KL, correctness gates, honest magnitude, provenance).
- PARITY_HANDOFF.md: chronology note (verdict, engagement, prefill/KL numbers,
  fork commits, deferred-not-failed measurements).

Key reframe recorded: q36 GDN/attention projections are BF16 weights (not
NVFP4), so bf16-stream is a MoE-model prefill lever; the dense model quantizes
those projections to NVFP4 and engages nothing (stays bit-identical). Prefill
MoE @512 +1.99% (reproducible, at noise floor), KL delta -0.00052 (KL-improving),
all md5 + test-backend-ops gates green. Fork HEAD 653bb2f3d, tree 6cf1523047.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…gap)

P2 (expert-major fused routed-FFN region executor, LLAMA_MOE_REGION_EXECUTOR,
default-off) is recorded as NO-GO on two independent signals; nothing built
beyond the P0 kill-gate, nothing landed, fork localai-paged HEAD untouched at
653bb2f3d (LocalAI series stays at 46 patches, 0001-0055).

(1) Primary GO metric flat: n=257 MOE_SWIGLU_DOWN region 1022.15 us vs
grouped-MMQ control 1021.61 us = -0.05% (needed >5% faster); n=128 -0.34%;
MUL_MAT_ID_RAGGED_MOE +0.48%/+0.28% (region never engages). All inside the
5-sample spread - reproduces the six prior one-boundary transplants
(phases 113/114/122/123/125/127). A compact expert-major layout + single sort,
both GEMMs still ragged grouped-MMQ, does not move the ragged-tile tax; that
needs P3 Marlin persistent-CTA, not a P2 layout swap.

(2) Decisive structural blocker: q36-35b-a3b-nvfp4 ships separate
ffn_gate_exps/ffn_up_exps (+ per-tensor .scale) with ggml_swiglu_split, not the
merged gate_up->VIEW->VIEW->SWIGLU->down shape the whole-pattern matcher
requires; the matcher, region executor, and pre-existing POC/fused-quant all
engage 0x on q36 in prefill and decode. KL delta 0.000000 is vacuous (0
engagement). Default md5 canonical both models (MoE 8cb0ce23, dense 5951a5b4);
test-backend-ops all green both arms.

Prerequisite handoff (gates P2 and P3): rebuild the seam for q36's
separate/scaled/swiglu-split FFN shape before any MoE-region lever can engage,
then re-evaluate a fused two-GEMM region (not a layout swap). Topic branch
p2-moe-region retained on the fork for forensics at 2d87564dd (base 653bb2f3d),
not pushed.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
P4 (token-granular continuous-batching scheduler, LLAMA_CONTINUOUS_BATCH_V2,
default-off) stopped honestly at the P0 perf kill-gate. The kill-gate subset
(per-seq chunked-prefill cursors + adaptive decode bucketing, server-side only,
zero ggml/ files, ~68 LOC + a new unit-tested server-admission-policy.h) was
implemented and correctness-proven green (canonical md5 both models default-off
AND cbv2-on: MoE 8cb0ce23, dense 5951a5b4; test-backend-ops MUL_MAT 1146/1146,
MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46; cursor-interleave PROVEN via
LLAMA_CBV2_TRACE with decode+prefill co-batched and per-seq cursors advancing
across steps, dbucket==n_decode no-pad; determinism-NEUTRAL: CBv2 diverges from
control no more than control diverges from itself, the paged concurrent-greedy
path being inherently non-deterministic run-to-run in the baseline too).

The kill-gate GO criterion - a >20% TTFT-under-load drop with md5 green and
serving-aggregate not regressed - was NOT demonstrated: the staggered/burst TTFT
A/B was force-terminated by the harness mid-run (CONTROL-only, 30/60 raws), so
the TTFT deltas are not-yet-measured placeholders, not measured neutrality. Per
the phased contract go=false was the kill-gate default: nothing built beyond P0
(no SLOT_STATE_PREEMPTED, no aging/starvation-freedom), nothing landed. This is
the scope-anticipated outcome - P4 is a GB10 TTFT/fairness/enabler lever, not a
throughput lever (decode is GPU-compute-bound), so a NO-GO on the TTFT gate is
expected and any throughput payoff is non-GB10.

Records the honest rejection in EXECUTION_REARCH_SCOPE.md (P4 RESULT subsection)
and PARITY_HANDOFF.md chronology, including the re-score path: read the finalized
DGX ~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md once the CANDIDATE arm
completes; a genuine >20% staggered-TTFT drop clearing max(2%, 3*stdev) re-scores
go=true and triggers the full P4 build-out. Fork localai-paged untouched at
653bb2f3d; LocalAI series stays at 46 patches; topic branch p4-cbv2 retained on
the DGX fork at ebb649335 (base 653bb2f3d, not pushed).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The forced-report placeholders are replaced with the completed 60/60-raw A/B
from dgx:~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md: NO-GO confirmed by
measurement, and stronger than flat. CBv2 fair-share chunked prefill regresses
TTFT under staggered load (N=32 p50 +33.6%, N=128 p50 +15.5%) and regresses
aggregate/decode -6.9% beyond noise at staggered N=128. Analysis recorded:
processor-sharing delays near-uniform prompt completion by construction; the
scheduler-shaped-TTFT premise is partially refuted for GB10 (patch 0016 already
captures the schedulable win); TTFT parity routes through P3/P5 prefill compute.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…med shared-hardware floor

P5 ported the full six-kernel vLLM-FLA chunk_gated_delta_rule_fwd pipeline
(cumsum, chunk_scaled_dot_kkt, blocked merge_16x16_to_64x64 solve_tril,
recompute_w_u, register/smem-resident chunk_gated_delta_rule_fwd_h with the
chunk loop in-kernel, chunk_fwd_o) to CUDA tf32 mma, per-kernel validated vs
host fp64 (o NMSE 2.2e-7, final-state 1.2e-7), integrated behind
LLAMA_GDN_FLA_CHUNK=1 (default-off), and A/B'd in-backend vs the shipped M5
chunked scan.

P0 perf kill-gate FAILED decisively: nsys --cuda-graph-trace=node, MoE
q36-35b-a3b, npp2048 M5 56.31 vs FLA 119.46 us/tok (FLA 2.12x SLOWER,
gdn_delta_pct -112.1); npp512 2.29x slower; end-to-end S_PP -13.33%/-13.12%.
GO required FLA >10% faster at npp2048.

Novel decomposition: the blocked solve_tril is only ~2.8% of the FLA bucket
(55.6 ms); fwd_h 46.2% (903 ms) + fwd_o 31.5% (617 ms) dominate. The cost is
the state-recurrence GEMMs plus per-chunk h-state materialization to global
LPDDR5x that FLA's split-kernel structure forces; the fused M5 single kernel
keeps the 128x128 state resident in smem and never materializes per-chunk h,
so it is 2.1x faster on GB10's low-bandwidth memory. So the GDN prefill bucket
(+59.2, the single largest prefill lever) is a confirmed shared-hardware /
memory-bandwidth floor, NOT recoverable by the blocked-solve algorithm.
Extends Phase74 (standalone blocked-inverse 0.59x) and bf16-C64 (-18.75%).

Gates: SMEM PASS (max 96KB < 99KB cap). KL band GREEN (FLA KLD 0.137028 vs
control 0.136563, delta +0.000465 < 0.01; same-top-p 84.61%). DEFAULT path
untouched (canonical md5 GREEN both models default-off AND FLA-on; MoE
8cb0ce23, dense 5951a5b4; GATED_DELTA_NET DEFAULT 46/46). Nothing landed;
series stays at 46 patches. WIP on DGX fork branch p5-fla-gdn (2d64c37f0),
NOT pushed.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…e ceiling

P6 (final program phase) could not run its kill-gate: the DGX/GB10 was
unreachable for the entire window (cloudflared access via prem-vm returned
HTTP 530 / websocket bad-handshake on every probe; re-confirmed with 5 fresh
probes). Stage 0a (measured nsys graph-node decode ceiling) and Stage 0b
(fp8-e4m3 kernel + kill-gate A/B) were physically impossible with no GPU.

Records the honest infra-block (NOT a measured NO-GO, NOT a NO-GO-by-ceiling)
plus the load-bearing artifact: the analytical fp8-KV decode ceiling table.
fp8 halves KV bytes -> theoretical-max decode saving = 0.5 x flash-attn share:
ctx256 0.65% (standard shape hard NO-GO), ctx1024 2.55%, ctx2048 4.98% (first
crosses +3%), ctx4096 9.49%, ctx8192 17.34%. The win, if realizable, lives
only at ctx>=2048; the hybrid-GDN structure (10/40 layers carry KV, 30 GDN
layers hold fixed-size recurrent state with no KV) caps what any KV-dtype
lever can save. The dominant null stands unrefuted: Q8_0 KV was a measured
+7.8% decode regression on GB10. Notes the capacity-play framing (fp8-KV as a
memory feature remains open even if throughput-flat).

Fork localai-paged untouched at 653bb2f3d; series stays at 46 patches
(0001-0055); P3's p3-w4a16-direct work undisturbed. Docs-only; no code, no
topic branch, no patches. Not pushed.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…fill conclusion

P3 (the last big prefill lever) is a decisive NO-GO. The direct-A W4A16 Marlin
path was re-created per the section-3 contract, engaged behind
LLAMA_W4A16_DIRECT_A, and A/B'd against the FP4-MMQ default: -46.9/-48.0/-49.1%
at M=512/1024/2048 (MoE q36-35b-a3b, 3-iter medians). The forensics retry is
REFUTED - the integration tax it blamed was genuinely removed (act-quant
18.92 -> ~0 us/tok; host expert-sort + src1-gather + separate cast eliminated)
and direct-A still lost. nsys graph-node decomposition: the mature bf16
grouped-W4A16 GEMM = 323.90 us/tok = 1.97x the FP4-MMQ int8 GEMM (164.6) =
exactly bf16 = half int8/FP4 tensor-core peak on sm_121. Bucket 2 (GEMM tiling,
+56.5) is now a CONFIRMED FP4-MMQ-optimal floor on GB10, joining bucket 1 (GDN
scan, P5-confirmed). Novel sub-finding: fusing the A-gather in-kernel is a NET
pessimization vs a separate bf16 pre-cast (+128 > ~63 tax removed), a
GB10-specific inversion of the no-round-trips heuristic. KL in-band and better
than control (KLD 0.130260 / same-top-p 85.172%); default md5s green both models;
engagement proven (7680 env-on, 0 default). Nothing built beyond P0, nothing
landed; fork localai-paged HEAD untouched at 653bb2f3d, series stays 46 patches;
topic branch p3-w4a16-direct retained on the DGX fork at 8eef7ba43 (NOT pushed).

Because P3 is the last major lever, this also writes the program-level
conclusion into EXECUTION_REARCH_SCOPE.md section 4a (dated) and corrects the
pre-execution projection to measured reality: six phases gated, exactly one
landed (P1 +2% MoE prefill, bucket-3 projection boundary); P2/P3/P4/P5 rejected,
P6 blocked-on-infra. Prefill closes to ~50-51% of vLLM (not ~55-65%),
serving-agg stays ~60.7% (not ~80%), decode-GPU-steady stays ~86% (not ~95%),
TTFT stays ~3.4x - because the two largest prefill buckets (1+2 = +115.7 of the
198.9 gap) are confirmed silicon/bandwidth floors that lift only on datacenter
Blackwell. This confirms and strengthens the standing conclusion that GB10
throughput-parity is unreachable by exhaustion; the paged fork's precision
parity + memory advantage stand. Default path untouched; canonical md5s green.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…apacity-play open

Retry of P6 unblocked the prior infra-block (DGX reachable via ssh dgx.casa) and
ran the kill-gate. Two measured artifacts replace the analytical estimates:

Stage 0a decode ceiling (v2 per-kernel decode-isolation, cross-checked within
0.3% of the batched-bench wall t_tg): fp8-KV theoretical-MAX decode saving
(fa-only) tops at +8.81% at ctx8192 x npl8 and clears +3% only at long context;
standard npl128 serving shapes reach +2.2/+3.4%. This refutes the earlier
analytical prior (0.65% std, +17.34% ctx8192) in both directions.

Stage 0b zero-code Q8_0-KV A/B proxy at the highest-ceiling shape (5 reps/arm):
dense ctx8192 +0.37% decode (flat), moe ctx8192 -2.63% decode REGRESSION. Even
Q8_0 - which wins on the integer DP4A fattn-vec dot that e4m3 cannot use -
realizes ~none of the ceiling; dequant-in-attention eats the KV-read BW saving,
re-confirming the historical Q8_0 +7.8% null. e4m3's KQ path is strictly worse
than Q8_0's, so the e4m3 throughput kernel is a definitive NO-GO and was not
built. The capacity-play (halving the 10/40 attention layers' KV footprint)
stays open as a footprint feature.

Default path measured green on the byte-identical worktree (canonical greedy-md5
re-run: MoE 8cb0ce23, dense 5951a5b4, paged). Fork localai-paged untouched at
653bb2f3d; topic branch p6-fp8-kv retained on the DGX, not pushed; series stays
46 patches (0001-0055). P3's landed program conclusion is preserved; only the
now-stale P6 status descriptors in it were corrected to the measured NO-GO.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants