perf(pymllm): speed up Qwen3-VL Jetson decode/prefill and refine benchmarking by jialilve · Pull Request #680 · UbiquitousLearning/mllm

jialilve · 2026-06-08T12:44:41Z

Summary

This PR builds on #670 and focuses on Qwen3-VL Jetson decode / prefill performance improvements, benchmark metric cleanup, and KV cache memory profiling in pymllm.

The main changes optimize the Qwen3-VL inference hot path, extend bench_one_batch for multimodal benchmarking, add real-image VIT timing, add multimodal prefill length sweeps, improve CUDA graph batch-size handling and profiler output, and safely skip settings that exceed KV pool capacity. This PR also fixes the static memory-budget semantics of mem_fraction_static.

What Changed

Improve Qwen3-VL decode / prefill hot paths:
- Add a Triton in-place fused kernel for Qwen3-VL M-RoPE.
- Reuse converted RoPE cos/sin cache instead of repeating dtype/device conversions per layer and token.
- Patch FlashInfer RMSNorm device properties on Jetson PyTorch builds that miss shared_memory_per_block_optin.
- Avoid CUDA synchronization in greedy sampling by passing CPU-side is_all_greedy metadata.
- Tensorize batched decode KV mapping writes in bench_one_batch to avoid per-request .item() synchronization.
Extend pymllm.bench_one_batch:
- Add a real-image multimodal benchmark mode with VIT timing.
- Add explicit multimodal prefill input length sweeps.
- Add multimodal_prefill_* metrics for full multimodal prefill, while keeping vit_prefill_* for vision-only timing.
- Cover benchmark sweep batch sizes in CUDA graph capture batch-size configuration.
- Skip benchmark settings that exceed KV pool capacity instead of failing mid-run.
- Improve profiler behavior:
  - continuous decode profiling window
  - no warmup profiling
  - gzipped .trace.json.gz output
  - key_averages logging
  - CUDA_PROFILER support for nsys capture
- Add --correct / --correctness-test as a single-stage smoke correctness check.
Fix static KV memory profiling:
- Interpret server.mem_fraction_static as the static memory budget fraction for model weights plus KV cache pool.
- Treat server.max_total_tokens as an upper bound after profiling, not a fixed value that bypasses memory profiling.
- Add detailed memory profile diagnostics for KV cache allocation failures.
- Use system available memory for Jetson integrated GPUs when estimating available memory.
Add test coverage for:
- Triton fused M-RoPE correctness and in-place behavior.
- FlashInfer RMSNorm Jetson device-property patching.
- Greedy sampling without tensor reduction.
- Multimodal VIT timing and multimodal input length resizing.
- Gzipped profiler trace path generation.
- Batched decode KV mapping writes.
- mem_fraction_static / max_total_tokens static memory-budget behavior.
Refresh docs:
- Rewrite pymllm README / README-ZH around the current runtime setup.
- Rewrite docs/pymllm_runtime/* for setup, runtime design, kernels, quantization, and developer guidance.
- Document Qwen3 projection implementation details and clarify the current benchmark correctness scope.

Key Implementation Notes

bench_one_batch --image now has two modes:
- If --input-len is omitted, it uses the actual multimodal prompt length produced by the processor.
- If --input-len is provided, it means target total multimodal prefill length: image placeholder tokens + text tokens.
Multimodal input resizing only pads or truncates text tokens. It never truncates image placeholder tokens, and it raises an error if the target length is smaller than the image-token count.
vit_prefill_* only measures self.visual(...).
multimodal_prefill_* measures the full multimodal prefill path.
--correct is intentionally a single-stage smoke check. It is not equivalent to a two-stage prefix-KV correctness test.
mem_fraction_static now represents the static memory budget fraction shared by model weights and KV cache pool. Older commands that relied on a very low mem_fraction_static plus a fixed max_total_tokens may need a higher mem_fraction_static.

Validation

Added unit tests for the new benchmark, memory profiling, sampling, RMSNorm, M-RoPE, and multimodal timing behavior.
Jetson records in the project notes show:
- Qwen3-VL W8A8 bench_one_batch 2048/128 on AGX Orin now reaches the target prefill / decode performance range.
- After fixing mem_fraction_static semantics, Orin NX can profile to max_tokens=4096.
- W8A8 keeps a clear advantage in long multimodal prefill sweeps.
- Qwen3-VL-4B AWQ and W8A8 were validated on AGX Orin with both synthetic benchmarks and multimodal prefill benchmarks.

Notes for Reviewers

This PR does not reintroduce the Qwen3 Jetson BF16, W4A16, and W8A8 serving foundation added in feat(pymllm): support Qwen3 Jetson BF16, W4A16, and W8A8 serving #670. It builds on that work to improve performance, benchmarking, and memory profiling.
bench_one_batch is still a direct ModelRunner benchmark. Its results should not be mixed with HTTP serving metrics such as TTFT, ITL, TPOT, or E2E latency.
Server mode may need a higher mem_fraction_static than direct bench_one_batch because tokenizer, scheduler, detokenizer, HTTP workers, IPC buffers, and Jetson unified-memory pressure are part of the real serving process.
Qwen3-VL ViT, lm_head, embeddings, and LayerNorm remain outside the current W8A8 quantized scope.

Summary by CodeRabbit

Documentation
- Major rewrites across setup, architecture, kernels, models, quantization, and usage with clearer setup steps, validation checklists, benchmark guidance, and Jetson Orin–specific instructions; README updates and launch/benchmark examples.
New Features
- Multimodal benchmarking (image+text) with vision prefill timing and a correctness/smoke mode; improved benchmark CLI and profiling controls.
Performance
- Fused M‑RoPE/Triton path and device-compatibility patch for norm kernels to improve stability and throughput; vision timing metrics.
Bug Fixes / Reliability
- Better GPU memory discovery for integrated GPUs and richer KV-cache allocation diagnostics.
Tests
- Expanded unit and CUDA tests for multimodal bench, memory profiling, sampling, and fused-path correctness.

Lift cfg.server.max_running_requests to max(batch_size) before creating the ModelRunner, mirroring SGLang main() which sets server_args.cuda_graph_max_bs = max(bench_args.batch_size). In pymllm the CUDA graph capture set and req_to_token_pool size both derive from ModelRunner.max_running_requests. Without this, sweeping a batch size larger than the default capture set makes decode silently fall off the graph path and run eager, biasing decode latency vs SGLang. No effect on batch_size=1 runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The decode loop read req_pool_indices[i].item() and seq_lens[i].item() per request, each a CUDA->CPU sync inside the timed decode region. SGLang and pymllm's own production decode path (orchestrator/model_runner_process) write the KV mapping without any per-request sync. Replace the loop with a single vectorized scatter req_to_token[req_pool_indices, seq_lens - 1] = out_cache_loc, removing 2*batch_size syncs per decode step. Negligible at batch_size=1; removes a linearly growing, SGLang-absent bias for batch_size > 1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add a SGLang-style capacity guard: when batch_size exceeds max_total_num_tokens // (input_len + output_len), log a skip and return without running, instead of hard-failing mid-sweep on KV slot allocation. run_benchmark now treats a None result as a skip and continues. Mirrors ModelRunner.max_batch_size / latency_test_run_once in SGLang's bench_one_batch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ing#4) Rework profiling to match SGLang's bench_one_batch: - decode uses a single continuous profiler spanning [profile_start_step, profile_stop_step) -> one decode trace, instead of wrapping each step in its own profiler context (one file per step); - default profile start step is output_len // 2 (was (output_len-1)//2); - torch profiler runs with with_stack=True so kernels map to Python source; - traces are written as .trace.json.gz; - print the key_averages table on stop; - support --profile-activities CUDA_PROFILER to drive nsys via cudaProfilerStart/Stop. Replace the _maybe_profile context manager with start/stop helpers mirroring SGLang's start_profile/stop_profile. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

run_single_setting wrapped both stages in profiling unconditionally, so --profile would also profile the non-recorded warmup run, wasting time and emitting extra traces. Add an allow_profile flag (default True) and pass allow_profile=False for the warmup call, matching SGLang which hardcodes profile=False during warmup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add --correct (--correctness-test): load the tokenizer, encode real prompts, run one full prefill at batch_size=1, greedy-decode output_len tokens, and print the decoded text. main() branches to run_correctness when the flag is set. This is the single-stage variant. SGLang's --correct additionally does a cut_len two-stage prefill to exercise prefix-KV reuse; with greedy decoding the printed per-prompt text is identical either way. The two-stage variant can be layered on later (prepare_forward_batch_extend already accepts extend_prefix_lens > 0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Rewrite the pymllm_runtime web docs for accuracy and readability. - setup_and_usage: condense overview, drop docker-only assumption, fix launch args (mem_fraction_static/max_total_tokens/dtype defaults), refresh bench_one_batch (text/vit/multimodal modes) and profile (torch.profiler supported, nsys experimental); align mem_fraction_static semantics with the model-weights+KV-pool static budget - runtime_design / models_and_quantization / kernels_and_acceleration / developer_guide: humanize prose, trim AI-style scaffolding, keep content Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Slim down README.md and README-ZH.md to match the pymllm_runtime setup_and_usage doc: fix launch args (drop stale flags, mem_fraction_static 0.8 / max_total_tokens 4096), align mem_fraction_static semantics, refresh bench_one_batch (text / vit / multimodal modes) and profile guidance, and drop redundant status / forward-path / microbench / known-limitations sections. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-08T12:44:55Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d9380c7-75bc-4add-a62f-4ae5dc4ca9e9

📥 Commits

Reviewing files that changed from the base of the PR and between 32662c0 and 9ed367e.

📒 Files selected for processing (1)

README.md

📝 Walkthrough

Walkthrough

Adds Triton‑gated fused M‑RoPE and FlashInfer device‑property patching, integrated‑GPU memory profiling and richer KV diagnostics, a greedy sampling fast path, multimodal benchmarking and correctness mode, synchronized vision timing/ViT metrics, broad docs/README updates, and matching tests.

Changes

Runtime & Acceleration

Layer / File(s)	Summary
Fused M‑RoPE and FlashInfer norm patch `pymllm/layers/rope.py`, `pymllm/layers/rms_norm.py`, `pymllm/layers/__init__.py`	Add Triton‑gated fused M‑RoPE kernel with PyTorch fallback, export `apply_mrope_fused_`, and monkeypatch CUDA device properties for FlashInfer RMSNorm paths.
Qwen3‑VL RoPE wiring and vision timing `pymllm/models/qwen3_vl.py`, `pymllm/tests/test_qwen3_vl_deepstack.py`	Wire attention to fused RoPE with cache dtype/device alignment, add synchronized wall‑clock vision timing helper, record ViT prefill metrics, and add timing tests.
ModelRunner memory profiling (integrated GPU fallback) `pymllm/executor/model_runner.py`	Add system‑memory fallback (psutil/os.sysconf) for integrated devices, capture pre‑model‑load available GB, introduce `MemoryProfileResult`, and improve KV‑cache allocation diagnostics.
Sampling optimization: is_all_greedy `pymllm/executor/model_runner.py`, `pymllm/orchestrator/model_runner_process.py`, `pymllm/tests/test_model_runner_sampling.py`	Introduce `is_all_greedy` argument to `ModelRunner.sample()` and thread it through scheduler to avoid tensor reductions when all requests are greedy; add CPU test.

Multimodal Benchmarking & Correctness

Layer / File(s)	Summary
Multimodal contracts and metrics `pymllm/bench_one_batch.py`, `pymllm/tests/test_bench_one_batch.py`	Add `ExtendResult`, `MultimodalBenchInput`, `MultimodalProcessorBundle`, multimodal input construction, image token handling, and ViT/multimodal metric helpers with tests.
Runner extend API & profiling refactor `pymllm/bench_one_batch.py`	Update `PymllmBenchRunner.extend()` to accept multimodal tensors and `benchmark_vision_timing`; replace profiling context with explicit `_start_profile`/`_stop_profile` supporting torch.profiler and CUDA profiler (nsys).
Execution loop, KV writes, correctness mode `pymllm/bench_one_batch.py`	Vectorize decode KV writes, add `_max_batch_size_for()` capacity skips, adjust `generate_settings()` for image mode defaults, and implement tokenizer‑driven correctness test flow; update trace filename extension.

Documentation & READMEs

Layer / File(s)	Summary
Runtime design and transport modes `docs/pymllm_runtime/runtime_design.rst`	Rewrite architecture overview, request lifecycle, ModelRunner init/forward modes, and complete `tensor_transport_mode` behaviors.
Kernels and quantization docs `docs/pymllm_runtime/kernels_and_acceleration.rst`, `docs/pymllm_runtime/models_and_quantization.rst`	Clarify attention backend scope, CUDA Graph replay semantics, W4A16/W8A8 execution and CUTLASS requirements, layout conversion timing, and LinearMethod lifecycle.
Setup, usage, developer guide `docs/pymllm_runtime/setup_and_usage.rst`, `docs/pymllm_runtime/developer_guide.rst`	Add explicit install recipes (triton/flashinfer), document W8A8 CUTLASS JIT cache behavior, rewrite benchmark guidance for multimodal modes, and provide structured contribution guidance.
README EN/ZH modernization `pymllm/README.md`, `pymllm/README-ZH.md`	Consolidate launch entry (`pymllm.server.launch`), update environment/version tables, refresh examples (absolute image path rules), emphasize Jetson Orin/INT8/BF16 support, and expand benchmark documentation.

Tests

Layer / File(s)	Summary
Memory, RoPE, sampling, norm tests `pymllm/tests/test_model_runner_memory_pool.py`, `pymllm/tests/test_mrope_triton.py`, `pymllm/tests/test_model_runner_sampling.py`, `pymllm/tests/test_rms_norm.py`	Unit tests for memory profiling, integrated GPU behavior, fused M‑RoPE numerical correctness and in‑place semantics, sampling greedy fast path, and device‑property patching.
Bench multimodal & decode tests `pymllm/tests/test_bench_one_batch.py`	Tests for multimodal CLI/setting generation, ViT/multimodal metrics, multimodal input shaping (pad/truncate), extend()/decode() integration, and profile trace filename behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

UbiquitousLearning/mllm#653: Updates to README and supported‑models table that overlap README/README‑ZH changes in this PR.

Suggested reviewers

oreomaker
liang1232018
yirongjie
chenghuaWang

Poem

In carrot code I nibble lines of speed,
I weave a rope where Triton does the deed,
I count the tokens, guard the memory pool,
Greedy hops skip work — that’s my tiny rule,
🐇 — a rabbit cheering Orin’s lead.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.28% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main performance optimization changes focused on Qwen3-VL Jetson decode/prefill and benchmark refinements.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering summary, detailed changes, implementation notes, validation, and reviewer guidance.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chenghuaWang

LGTM

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

pymllm/executor/model_runner.py (1)
142-153: ⚡ Quick win

Add a docstring for MemoryProfileResult.

This new class is part of runtime diagnostics and should document field meaning/units to keep error reporting maintainable.
Suggested update
 `@dataclass`
 class MemoryProfileResult:
+    """Snapshot of KV-cache memory profiling inputs and computed limits.
+
+    All memory values are in GiB unless otherwise noted.
+    """
     pre_model_available_gb: float
     available_gb: float
     mem_fraction: float
As per coding guidelines, “Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pymllm/executor/model_runner.py` around lines 142 - 153, Add a clear
docstring to the MemoryProfileResult dataclass explaining its purpose (runtime
memory profiling result used for token budgeting) and document each field with
short descriptions and units: pre_model_available_gb (GB available before model
allocation), available_gb (GB available after model allocation), mem_fraction
(fraction of available memory to use), static_kv_budget_gb (GB reserved for
static KV cache), cell_size_bytes (bytes per model cell), profiled_max_tokens
(max tokens observed during profiling), requested_max_total_tokens (optional
requested max total tokens), effective_max_tokens (final token limit used), and
gdn_pool_gb (GB for GDN pool, default 0.0); keep it concise and place it
immediately above the MemoryProfileResult declaration in model_runner.py.
Source: Coding guidelines
pymllm/layers/rope.py (1)
549-575: 💤 Low value

Fallback path does not modify tensors in-place despite _ suffix convention.

The trailing underscore in apply_mrope_fused_ signals in-place modification, but the fallback branch (lines 568-575) returns new tensors from apply_mrope rather than modifying the inputs in-place. While the current callers correctly use the returned values, this inconsistency could cause subtle bugs if callers assume in-place semantics based on the naming convention.

Consider either:

Documenting this clearly in the docstring, or

Copying fallback results back into the original tensors for consistent in-place behavior
📝 Option 2: Make fallback truly in-place
     if not _can_use_mrope_fused(q, k, positions, cos_sin_cache):
-        return apply_mrope(
+        q_out, k_out = apply_mrope(
             q,
             k,
             positions,
             cos_sin_cache,
             mrope_section,
             mrope_interleaved,
         )
+        q.copy_(q_out)
+        k.copy_(k_out)
+        return q, k
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pymllm/layers/rope.py` around lines 549 - 575, The fallback branch in
apply_mrope_fused_ currently returns new tensors from apply_mrope instead of
performing in-place updates despite the trailing underscore; change the fallback
to call apply_mrope(q, k, positions, cos_sin_cache, mrope_section,
mrope_interleaved), then copy the returned tensors back into the original q and
k (e.g., using q.copy_(returned_q) and k.copy_(returned_k)) and finally return
the original q and k so callers see in-place semantics; ensure you preserve
device/dtype/shape expectations and use the unique symbols apply_mrope_fused_,
apply_mrope, q, and k to locate the code.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pymllm/bench_one_batch.py`:
- Around line 1218-1222: DEFAULT_CORRECTNESS_PROMPTS contains a typo; update the
second tuple entry in DEFAULT_CORRECTNESS_PROMPTS from "The capital of the
United Kindom is" to "The capital of the United Kingdom is" so the prompt reads
correctly; modify the string literal in the DEFAULT_CORRECTNESS_PROMPTS tuple to
the corrected spelling.

---

Nitpick comments:
In `@pymllm/executor/model_runner.py`:
- Around line 142-153: Add a clear docstring to the MemoryProfileResult
dataclass explaining its purpose (runtime memory profiling result used for token
budgeting) and document each field with short descriptions and units:
pre_model_available_gb (GB available before model allocation), available_gb (GB
available after model allocation), mem_fraction (fraction of available memory to
use), static_kv_budget_gb (GB reserved for static KV cache), cell_size_bytes
(bytes per model cell), profiled_max_tokens (max tokens observed during
profiling), requested_max_total_tokens (optional requested max total tokens),
effective_max_tokens (final token limit used), and gdn_pool_gb (GB for GDN pool,
default 0.0); keep it concise and place it immediately above the
MemoryProfileResult declaration in model_runner.py.

In `@pymllm/layers/rope.py`:
- Around line 549-575: The fallback branch in apply_mrope_fused_ currently
returns new tensors from apply_mrope instead of performing in-place updates
despite the trailing underscore; change the fallback to call apply_mrope(q, k,
positions, cos_sin_cache, mrope_section, mrope_interleaved), then copy the
returned tensors back into the original q and k (e.g., using q.copy_(returned_q)
and k.copy_(returned_k)) and finally return the original q and k so callers see
in-place semantics; ensure you preserve device/dtype/shape expectations and use
the unique symbols apply_mrope_fused_, apply_mrope, q, and k to locate the code.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9f479274-833c-43ff-a40b-3d4c75a9071b

📥 Commits

Reviewing files that changed from the base of the PR and between 729ca4c and d442aae.

⛔ Files ignored due to path filters (2)

bench_assets/two_cats.jpg is excluded by !**/*.jpg
bench_assets/two_cats_480p.jpg is excluded by !**/*.jpg

📒 Files selected for processing (20)

docs/pymllm_runtime/developer_guide.rst
docs/pymllm_runtime/kernels_and_acceleration.rst
docs/pymllm_runtime/models_and_quantization.rst
docs/pymllm_runtime/runtime_design.rst
docs/pymllm_runtime/setup_and_usage.rst
pymllm/README-ZH.md
pymllm/README.md
pymllm/bench_one_batch.py
pymllm/executor/model_runner.py
pymllm/layers/__init__.py
pymllm/layers/rms_norm.py
pymllm/layers/rope.py
pymllm/models/qwen3_vl.py
pymllm/orchestrator/model_runner_process.py
pymllm/tests/test_bench_one_batch.py
pymllm/tests/test_model_runner_memory_pool.py
pymllm/tests/test_model_runner_sampling.py
pymllm/tests/test_mrope_triton.py
pymllm/tests/test_qwen3_vl_deepstack.py
pymllm/tests/test_rms_norm.py

coderabbitai

🧹 Nitpick comments (2)

README.md (1)

40-40: ⚡ Quick win

Add alt text to images for accessibility.

The embedded images are missing alt text attributes, which are essential for screen readers and accessibility compliance. Consider adding descriptive alt text.

♻️ Proposed fix to add alt text

 <div align="center">
-  <img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" width="90%">
+  <img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" alt="Jetson Orin pymllm vs llama.cpp speedup summary (input_len=2048)" width="90%">
 </div>

 <div align="center">
-  <img src="./assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg" width="90%">
+  <img src="./assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg" alt="Jetson Orin pymllm prefill throughput (input_len=2048)" width="90%">
 </div>

Also applies to: 44-44

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` at line 40, Image tags in README lack alt text which harms
accessibility; update each <img
src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" width="90%"> (and
the other similar <img> entries) to include a meaningful alt attribute (e.g.,
alt="pymllm Jetson speedup summary") by editing the README image elements so
they read with an alt="..." describing the image content for screen readers.

README-ZH.md (1)

41-41: ⚡ Quick win

Add alt text to images for accessibility.

The embedded images are missing alt text attributes, which are essential for screen readers and accessibility compliance. Consider adding descriptive alt text.

♻️ Proposed fix to add alt text

 <div align="center">
-  <img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" width="90%">
+  <img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" alt="Jetson Orin pymllm vs llama.cpp 加速比汇总 (input_len=2048)" width="90%">
 </div>

 <div align="center">
-  <img src="./assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg" width="90%">
+  <img src="./assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg" alt="Jetson Orin pymllm prefill 吞吐率 (input_len=2048)" width="90%">
 </div>

Also applies to: 45-45

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README-ZH.md` at line 41, The image tags referencing
"./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" (and the other image at
the same block) lack alt text; update the <img> elements in README-ZH.md (the
img referencing that filename) to include a descriptive alt="..." attribute
(e.g., alt="PyMLLM on Jetson: speedup summary") for screen-reader accessibility
and repeat for the similar image at the other location.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@README-ZH.md`:
- Line 41: The image tags referencing
"./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" (and the other image at
the same block) lack alt text; update the <img> elements in README-ZH.md (the
img referencing that filename) to include a descriptive alt="..." attribute
(e.g., alt="PyMLLM on Jetson: speedup summary") for screen-reader accessibility
and repeat for the similar image at the other location.

In `@README.md`:
- Line 40: Image tags in README lack alt text which harms accessibility; update
each <img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg"
width="90%"> (and the other similar <img> entries) to include a meaningful alt
attribute (e.g., alt="pymllm Jetson speedup summary") by editing the README
image elements so they read with an alt="..." describing the image content for
screen readers.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d93aebb8-6766-4872-8795-f7d5a1c42927

📥 Commits

Reviewing files that changed from the base of the PR and between d442aae and f346552.

⛔ Files ignored due to path filters (2)

assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg is excluded by !**/*.jpg
assets/jetson/pymllm-jetson-speedup-summary-2048.jpg is excluded by !**/*.jpg

📒 Files selected for processing (2)

README-ZH.md
README.md

jialilve and others added 19 commits April 30, 2026 02:19

docs(pymllm): document qwen3 projection alignment

8699786

perf(qwen3-vl): fuse M-RoPE and reuse RoPE cache

ec95ddd

feat(bench): add multimodal VIT timing benchmark mode

d837252

perf(rmsnorm): patch Jetson FlashInfer device properties

f9be384

perf(sampling): avoid GPU sync for greedy decode

0e93e6d

test(bench): expect gzipped profiler traces

e33a5ef

test(bench): cover batched decode kv mapping

8dcb2a1

docs(bench): label correct mode as smoke check

e7ed949

feat(bench): sweep multimodal prefill input length

75a67e5

perf(memory): refine static KV pool profiling

2dae7ce

perf(memory): improve KV cache budget diagnostics

7eb9502

jialilve requested review from chenghuaWang, liang1232018, oreomaker, xumengwei and yirongjie as code owners June 8, 2026 12:44

chenghuaWang approved these changes Jun 8, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread pymllm/bench_one_batch.py

jialilve added 2 commits June 8, 2026 16:16

docs(readme): highlight Jetson Orin pymllm performance

f346552

docs(pymllm): fix benchmark prompt typo

32662c0

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

chenghuaWang approved these changes Jun 8, 2026

View reviewed changes

Merge branch 'main' into feature/jetson-qwen3-family-bf16-w4a16-w8a8

9ed367e

chenghuaWang merged commit 4f3ed31 into UbiquitousLearning:main Jun 8, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(pymllm): speed up Qwen3-VL Jetson decode/prefill and refine benchmarking#680

perf(pymllm): speed up Qwen3-VL Jetson decode/prefill and refine benchmarking#680
chenghuaWang merged 22 commits into
UbiquitousLearning:mainfrom
jialilve:feature/jetson-qwen3-family-bf16-w4a16-w8a8

jialilve commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

chenghuaWang left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jialilve commented Jun 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Key Implementation Notes

Validation

Notes for Reviewers

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

chenghuaWang left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jialilve commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading