perf(pymllm): speed up Qwen3-VL Jetson decode/prefill and refine benchmarking#680
Conversation
Lift cfg.server.max_running_requests to max(batch_size) before creating the ModelRunner, mirroring SGLang main() which sets server_args.cuda_graph_max_bs = max(bench_args.batch_size). In pymllm the CUDA graph capture set and req_to_token_pool size both derive from ModelRunner.max_running_requests. Without this, sweeping a batch size larger than the default capture set makes decode silently fall off the graph path and run eager, biasing decode latency vs SGLang. No effect on batch_size=1 runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The decode loop read req_pool_indices[i].item() and seq_lens[i].item() per request, each a CUDA->CPU sync inside the timed decode region. SGLang and pymllm's own production decode path (orchestrator/model_runner_process) write the KV mapping without any per-request sync. Replace the loop with a single vectorized scatter req_to_token[req_pool_indices, seq_lens - 1] = out_cache_loc, removing 2*batch_size syncs per decode step. Negligible at batch_size=1; removes a linearly growing, SGLang-absent bias for batch_size > 1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a SGLang-style capacity guard: when batch_size exceeds max_total_num_tokens // (input_len + output_len), log a skip and return without running, instead of hard-failing mid-sweep on KV slot allocation. run_benchmark now treats a None result as a skip and continues. Mirrors ModelRunner.max_batch_size / latency_test_run_once in SGLang's bench_one_batch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ing#4) Rework profiling to match SGLang's bench_one_batch: - decode uses a single continuous profiler spanning [profile_start_step, profile_stop_step) -> one decode trace, instead of wrapping each step in its own profiler context (one file per step); - default profile start step is output_len // 2 (was (output_len-1)//2); - torch profiler runs with with_stack=True so kernels map to Python source; - traces are written as .trace.json.gz; - print the key_averages table on stop; - support --profile-activities CUDA_PROFILER to drive nsys via cudaProfilerStart/Stop. Replace the _maybe_profile context manager with start/stop helpers mirroring SGLang's start_profile/stop_profile. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
run_single_setting wrapped both stages in profiling unconditionally, so --profile would also profile the non-recorded warmup run, wasting time and emitting extra traces. Add an allow_profile flag (default True) and pass allow_profile=False for the warmup call, matching SGLang which hardcodes profile=False during warmup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add --correct (--correctness-test): load the tokenizer, encode real prompts, run one full prefill at batch_size=1, greedy-decode output_len tokens, and print the decoded text. main() branches to run_correctness when the flag is set. This is the single-stage variant. SGLang's --correct additionally does a cut_len two-stage prefill to exercise prefix-KV reuse; with greedy decoding the printed per-prompt text is identical either way. The two-stage variant can be layered on later (prepare_forward_batch_extend already accepts extend_prefix_lens > 0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rewrite the pymllm_runtime web docs for accuracy and readability. - setup_and_usage: condense overview, drop docker-only assumption, fix launch args (mem_fraction_static/max_total_tokens/dtype defaults), refresh bench_one_batch (text/vit/multimodal modes) and profile (torch.profiler supported, nsys experimental); align mem_fraction_static semantics with the model-weights+KV-pool static budget - runtime_design / models_and_quantization / kernels_and_acceleration / developer_guide: humanize prose, trim AI-style scaffolding, keep content Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Slim down README.md and README-ZH.md to match the pymllm_runtime setup_and_usage doc: fix launch args (drop stale flags, mem_fraction_static 0.8 / max_total_tokens 4096), align mem_fraction_static semantics, refresh bench_one_batch (text / vit / multimodal modes) and profile guidance, and drop redundant status / forward-path / microbench / known-limitations sections. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds Triton‑gated fused M‑RoPE and FlashInfer device‑property patching, integrated‑GPU memory profiling and richer KV diagnostics, a greedy sampling fast path, multimodal benchmarking and correctness mode, synchronized vision timing/ViT metrics, broad docs/README updates, and matching tests. ChangesRuntime & Acceleration
Multimodal Benchmarking & Correctness
Documentation & READMEs
Tests
Estimated code review effort 🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
pymllm/executor/model_runner.py (1)
142-153: ⚡ Quick winAdd a docstring for
MemoryProfileResult.This new class is part of runtime diagnostics and should document field meaning/units to keep error reporting maintainable.
Suggested update
`@dataclass` class MemoryProfileResult: + """Snapshot of KV-cache memory profiling inputs and computed limits. + + All memory values are in GiB unless otherwise noted. + """ pre_model_available_gb: float available_gb: float mem_fraction: floatAs per coding guidelines, “Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pymllm/executor/model_runner.py` around lines 142 - 153, Add a clear docstring to the MemoryProfileResult dataclass explaining its purpose (runtime memory profiling result used for token budgeting) and document each field with short descriptions and units: pre_model_available_gb (GB available before model allocation), available_gb (GB available after model allocation), mem_fraction (fraction of available memory to use), static_kv_budget_gb (GB reserved for static KV cache), cell_size_bytes (bytes per model cell), profiled_max_tokens (max tokens observed during profiling), requested_max_total_tokens (optional requested max total tokens), effective_max_tokens (final token limit used), and gdn_pool_gb (GB for GDN pool, default 0.0); keep it concise and place it immediately above the MemoryProfileResult declaration in model_runner.py.Source: Coding guidelines
pymllm/layers/rope.py (1)
549-575: 💤 Low valueFallback path does not modify tensors in-place despite
_suffix convention.The trailing underscore in
apply_mrope_fused_signals in-place modification, but the fallback branch (lines 568-575) returns new tensors fromapply_mroperather than modifying the inputs in-place. While the current callers correctly use the returned values, this inconsistency could cause subtle bugs if callers assume in-place semantics based on the naming convention.Consider either:
- Documenting this clearly in the docstring, or
- Copying fallback results back into the original tensors for consistent in-place behavior
📝 Option 2: Make fallback truly in-place
if not _can_use_mrope_fused(q, k, positions, cos_sin_cache): - return apply_mrope( + q_out, k_out = apply_mrope( q, k, positions, cos_sin_cache, mrope_section, mrope_interleaved, ) + q.copy_(q_out) + k.copy_(k_out) + return q, k🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pymllm/layers/rope.py` around lines 549 - 575, The fallback branch in apply_mrope_fused_ currently returns new tensors from apply_mrope instead of performing in-place updates despite the trailing underscore; change the fallback to call apply_mrope(q, k, positions, cos_sin_cache, mrope_section, mrope_interleaved), then copy the returned tensors back into the original q and k (e.g., using q.copy_(returned_q) and k.copy_(returned_k)) and finally return the original q and k so callers see in-place semantics; ensure you preserve device/dtype/shape expectations and use the unique symbols apply_mrope_fused_, apply_mrope, q, and k to locate the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pymllm/bench_one_batch.py`:
- Around line 1218-1222: DEFAULT_CORRECTNESS_PROMPTS contains a typo; update the
second tuple entry in DEFAULT_CORRECTNESS_PROMPTS from "The capital of the
United Kindom is" to "The capital of the United Kingdom is" so the prompt reads
correctly; modify the string literal in the DEFAULT_CORRECTNESS_PROMPTS tuple to
the corrected spelling.
---
Nitpick comments:
In `@pymllm/executor/model_runner.py`:
- Around line 142-153: Add a clear docstring to the MemoryProfileResult
dataclass explaining its purpose (runtime memory profiling result used for token
budgeting) and document each field with short descriptions and units:
pre_model_available_gb (GB available before model allocation), available_gb (GB
available after model allocation), mem_fraction (fraction of available memory to
use), static_kv_budget_gb (GB reserved for static KV cache), cell_size_bytes
(bytes per model cell), profiled_max_tokens (max tokens observed during
profiling), requested_max_total_tokens (optional requested max total tokens),
effective_max_tokens (final token limit used), and gdn_pool_gb (GB for GDN pool,
default 0.0); keep it concise and place it immediately above the
MemoryProfileResult declaration in model_runner.py.
In `@pymllm/layers/rope.py`:
- Around line 549-575: The fallback branch in apply_mrope_fused_ currently
returns new tensors from apply_mrope instead of performing in-place updates
despite the trailing underscore; change the fallback to call apply_mrope(q, k,
positions, cos_sin_cache, mrope_section, mrope_interleaved), then copy the
returned tensors back into the original q and k (e.g., using q.copy_(returned_q)
and k.copy_(returned_k)) and finally return the original q and k so callers see
in-place semantics; ensure you preserve device/dtype/shape expectations and use
the unique symbols apply_mrope_fused_, apply_mrope, q, and k to locate the code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9f479274-833c-43ff-a40b-3d4c75a9071b
⛔ Files ignored due to path filters (2)
bench_assets/two_cats.jpgis excluded by!**/*.jpgbench_assets/two_cats_480p.jpgis excluded by!**/*.jpg
📒 Files selected for processing (20)
docs/pymllm_runtime/developer_guide.rstdocs/pymllm_runtime/kernels_and_acceleration.rstdocs/pymllm_runtime/models_and_quantization.rstdocs/pymllm_runtime/runtime_design.rstdocs/pymllm_runtime/setup_and_usage.rstpymllm/README-ZH.mdpymllm/README.mdpymllm/bench_one_batch.pypymllm/executor/model_runner.pypymllm/layers/__init__.pypymllm/layers/rms_norm.pypymllm/layers/rope.pypymllm/models/qwen3_vl.pypymllm/orchestrator/model_runner_process.pypymllm/tests/test_bench_one_batch.pypymllm/tests/test_model_runner_memory_pool.pypymllm/tests/test_model_runner_sampling.pypymllm/tests/test_mrope_triton.pypymllm/tests/test_qwen3_vl_deepstack.pypymllm/tests/test_rms_norm.py
There was a problem hiding this comment.
🧹 Nitpick comments (2)
README.md (1)
40-40: ⚡ Quick winAdd alt text to images for accessibility.
The embedded images are missing alt text attributes, which are essential for screen readers and accessibility compliance. Consider adding descriptive alt text.
♻️ Proposed fix to add alt text
<div align="center"> - <img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" width="90%"> + <img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" alt="Jetson Orin pymllm vs llama.cpp speedup summary (input_len=2048)" width="90%"> </div> <div align="center"> - <img src="./assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg" width="90%"> + <img src="./assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg" alt="Jetson Orin pymllm prefill throughput (input_len=2048)" width="90%"> </div>Also applies to: 44-44
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` at line 40, Image tags in README lack alt text which harms accessibility; update each <img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" width="90%"> (and the other similar <img> entries) to include a meaningful alt attribute (e.g., alt="pymllm Jetson speedup summary") by editing the README image elements so they read with an alt="..." describing the image content for screen readers.README-ZH.md (1)
41-41: ⚡ Quick winAdd alt text to images for accessibility.
The embedded images are missing alt text attributes, which are essential for screen readers and accessibility compliance. Consider adding descriptive alt text.
♻️ Proposed fix to add alt text
<div align="center"> - <img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" width="90%"> + <img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" alt="Jetson Orin pymllm vs llama.cpp 加速比汇总 (input_len=2048)" width="90%"> </div> <div align="center"> - <img src="./assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg" width="90%"> + <img src="./assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg" alt="Jetson Orin pymllm prefill 吞吐率 (input_len=2048)" width="90%"> </div>Also applies to: 45-45
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README-ZH.md` at line 41, The image tags referencing "./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" (and the other image at the same block) lack alt text; update the <img> elements in README-ZH.md (the img referencing that filename) to include a descriptive alt="..." attribute (e.g., alt="PyMLLM on Jetson: speedup summary") for screen-reader accessibility and repeat for the similar image at the other location.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@README-ZH.md`:
- Line 41: The image tags referencing
"./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" (and the other image at
the same block) lack alt text; update the <img> elements in README-ZH.md (the
img referencing that filename) to include a descriptive alt="..." attribute
(e.g., alt="PyMLLM on Jetson: speedup summary") for screen-reader accessibility
and repeat for the similar image at the other location.
In `@README.md`:
- Line 40: Image tags in README lack alt text which harms accessibility; update
each <img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg"
width="90%"> (and the other similar <img> entries) to include a meaningful alt
attribute (e.g., alt="pymllm Jetson speedup summary") by editing the README
image elements so they read with an alt="..." describing the image content for
screen readers.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d93aebb8-6766-4872-8795-f7d5a1c42927
⛔ Files ignored due to path filters (2)
assets/jetson/pymllm-jetson-prefill-throughput-2048.jpgis excluded by!**/*.jpgassets/jetson/pymllm-jetson-speedup-summary-2048.jpgis excluded by!**/*.jpg
📒 Files selected for processing (2)
README-ZH.mdREADME.md
Summary
This PR builds on #670 and focuses on Qwen3-VL Jetson decode / prefill performance improvements, benchmark metric cleanup, and KV cache memory profiling in
pymllm.The main changes optimize the Qwen3-VL inference hot path, extend
bench_one_batchfor multimodal benchmarking, add real-image VIT timing, add multimodal prefill length sweeps, improve CUDA graph batch-size handling and profiler output, and safely skip settings that exceed KV pool capacity. This PR also fixes the static memory-budget semantics ofmem_fraction_static.What Changed
Improve Qwen3-VL decode / prefill hot paths:
shared_memory_per_block_optin.is_all_greedymetadata.bench_one_batchto avoid per-request.item()synchronization.Extend
pymllm.bench_one_batch:multimodal_prefill_*metrics for full multimodal prefill, while keepingvit_prefill_*for vision-only timing..trace.json.gzoutputkey_averagesloggingCUDA_PROFILERsupport for nsys capture--correct/--correctness-testas a single-stage smoke correctness check.Fix static KV memory profiling:
server.mem_fraction_staticas the static memory budget fraction for model weights plus KV cache pool.server.max_total_tokensas an upper bound after profiling, not a fixed value that bypasses memory profiling.Add test coverage for:
mem_fraction_static/max_total_tokensstatic memory-budget behavior.Refresh docs:
pymllmREADME / README-ZH around the current runtime setup.docs/pymllm_runtime/*for setup, runtime design, kernels, quantization, and developer guidance.Key Implementation Notes
bench_one_batch --imagenow has two modes:--input-lenis omitted, it uses the actual multimodal prompt length produced by the processor.--input-lenis provided, it means target total multimodal prefill length: image placeholder tokens + text tokens.vit_prefill_*only measuresself.visual(...).multimodal_prefill_*measures the full multimodal prefill path.--correctis intentionally a single-stage smoke check. It is not equivalent to a two-stage prefix-KV correctness test.mem_fraction_staticnow represents the static memory budget fraction shared by model weights and KV cache pool. Older commands that relied on a very lowmem_fraction_staticplus a fixedmax_total_tokensmay need a highermem_fraction_static.Validation
bench_one_batch2048/128 on AGX Orin now reaches the target prefill / decode performance range.mem_fraction_staticsemantics, Orin NX can profile tomax_tokens=4096.Notes for Reviewers
bench_one_batchis still a directModelRunnerbenchmark. Its results should not be mixed with HTTP serving metrics such as TTFT, ITL, TPOT, or E2E latency.mem_fraction_staticthan directbench_one_batchbecause tokenizer, scheduler, detokenizer, HTTP workers, IPC buffers, and Jetson unified-memory pressure are part of the real serving process.Summary by CodeRabbit
Documentation
New Features
Performance
Bug Fixes / Reliability
Tests