Skip to content

perf(pymllm): speed up Qwen3-VL Jetson decode/prefill and refine benchmarking#680

Merged
chenghuaWang merged 22 commits into
UbiquitousLearning:mainfrom
jialilve:feature/jetson-qwen3-family-bf16-w4a16-w8a8
Jun 8, 2026
Merged

perf(pymllm): speed up Qwen3-VL Jetson decode/prefill and refine benchmarking#680
chenghuaWang merged 22 commits into
UbiquitousLearning:mainfrom
jialilve:feature/jetson-qwen3-family-bf16-w4a16-w8a8

Conversation

@jialilve

@jialilve jialilve commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR builds on #670 and focuses on Qwen3-VL Jetson decode / prefill performance improvements, benchmark metric cleanup, and KV cache memory profiling in pymllm.

The main changes optimize the Qwen3-VL inference hot path, extend bench_one_batch for multimodal benchmarking, add real-image VIT timing, add multimodal prefill length sweeps, improve CUDA graph batch-size handling and profiler output, and safely skip settings that exceed KV pool capacity. This PR also fixes the static memory-budget semantics of mem_fraction_static.

What Changed

  • Improve Qwen3-VL decode / prefill hot paths:

    • Add a Triton in-place fused kernel for Qwen3-VL M-RoPE.
    • Reuse converted RoPE cos/sin cache instead of repeating dtype/device conversions per layer and token.
    • Patch FlashInfer RMSNorm device properties on Jetson PyTorch builds that miss shared_memory_per_block_optin.
    • Avoid CUDA synchronization in greedy sampling by passing CPU-side is_all_greedy metadata.
    • Tensorize batched decode KV mapping writes in bench_one_batch to avoid per-request .item() synchronization.
  • Extend pymllm.bench_one_batch:

    • Add a real-image multimodal benchmark mode with VIT timing.
    • Add explicit multimodal prefill input length sweeps.
    • Add multimodal_prefill_* metrics for full multimodal prefill, while keeping vit_prefill_* for vision-only timing.
    • Cover benchmark sweep batch sizes in CUDA graph capture batch-size configuration.
    • Skip benchmark settings that exceed KV pool capacity instead of failing mid-run.
    • Improve profiler behavior:
      • continuous decode profiling window
      • no warmup profiling
      • gzipped .trace.json.gz output
      • key_averages logging
      • CUDA_PROFILER support for nsys capture
    • Add --correct / --correctness-test as a single-stage smoke correctness check.
  • Fix static KV memory profiling:

    • Interpret server.mem_fraction_static as the static memory budget fraction for model weights plus KV cache pool.
    • Treat server.max_total_tokens as an upper bound after profiling, not a fixed value that bypasses memory profiling.
    • Add detailed memory profile diagnostics for KV cache allocation failures.
    • Use system available memory for Jetson integrated GPUs when estimating available memory.
  • Add test coverage for:

    • Triton fused M-RoPE correctness and in-place behavior.
    • FlashInfer RMSNorm Jetson device-property patching.
    • Greedy sampling without tensor reduction.
    • Multimodal VIT timing and multimodal input length resizing.
    • Gzipped profiler trace path generation.
    • Batched decode KV mapping writes.
    • mem_fraction_static / max_total_tokens static memory-budget behavior.
  • Refresh docs:

    • Rewrite pymllm README / README-ZH around the current runtime setup.
    • Rewrite docs/pymllm_runtime/* for setup, runtime design, kernels, quantization, and developer guidance.
    • Document Qwen3 projection implementation details and clarify the current benchmark correctness scope.

Key Implementation Notes

  • bench_one_batch --image now has two modes:
    • If --input-len is omitted, it uses the actual multimodal prompt length produced by the processor.
    • If --input-len is provided, it means target total multimodal prefill length: image placeholder tokens + text tokens.
  • Multimodal input resizing only pads or truncates text tokens. It never truncates image placeholder tokens, and it raises an error if the target length is smaller than the image-token count.
  • vit_prefill_* only measures self.visual(...).
  • multimodal_prefill_* measures the full multimodal prefill path.
  • --correct is intentionally a single-stage smoke check. It is not equivalent to a two-stage prefix-KV correctness test.
  • mem_fraction_static now represents the static memory budget fraction shared by model weights and KV cache pool. Older commands that relied on a very low mem_fraction_static plus a fixed max_total_tokens may need a higher mem_fraction_static.

Validation

  • Added unit tests for the new benchmark, memory profiling, sampling, RMSNorm, M-RoPE, and multimodal timing behavior.
  • Jetson records in the project notes show:
    • Qwen3-VL W8A8 bench_one_batch 2048/128 on AGX Orin now reaches the target prefill / decode performance range.
    • After fixing mem_fraction_static semantics, Orin NX can profile to max_tokens=4096.
    • W8A8 keeps a clear advantage in long multimodal prefill sweeps.
    • Qwen3-VL-4B AWQ and W8A8 were validated on AGX Orin with both synthetic benchmarks and multimodal prefill benchmarks.

Notes for Reviewers

  • This PR does not reintroduce the Qwen3 Jetson BF16, W4A16, and W8A8 serving foundation added in feat(pymllm): support Qwen3 Jetson BF16, W4A16, and W8A8 serving #670. It builds on that work to improve performance, benchmarking, and memory profiling.
  • bench_one_batch is still a direct ModelRunner benchmark. Its results should not be mixed with HTTP serving metrics such as TTFT, ITL, TPOT, or E2E latency.
  • Server mode may need a higher mem_fraction_static than direct bench_one_batch because tokenizer, scheduler, detokenizer, HTTP workers, IPC buffers, and Jetson unified-memory pressure are part of the real serving process.
  • Qwen3-VL ViT, lm_head, embeddings, and LayerNorm remain outside the current W8A8 quantized scope.

Summary by CodeRabbit

  • Documentation

    • Major rewrites across setup, architecture, kernels, models, quantization, and usage with clearer setup steps, validation checklists, benchmark guidance, and Jetson Orin–specific instructions; README updates and launch/benchmark examples.
  • New Features

    • Multimodal benchmarking (image+text) with vision prefill timing and a correctness/smoke mode; improved benchmark CLI and profiling controls.
  • Performance

    • Fused M‑RoPE/Triton path and device-compatibility patch for norm kernels to improve stability and throughput; vision timing metrics.
  • Bug Fixes / Reliability

    • Better GPU memory discovery for integrated GPUs and richer KV-cache allocation diagnostics.
  • Tests

    • Expanded unit and CUDA tests for multimodal bench, memory profiling, sampling, and fused-path correctness.

jialilve and others added 19 commits April 30, 2026 02:19
Lift cfg.server.max_running_requests to max(batch_size) before creating
the ModelRunner, mirroring SGLang main() which sets
server_args.cuda_graph_max_bs = max(bench_args.batch_size).

In pymllm the CUDA graph capture set and req_to_token_pool size both
derive from ModelRunner.max_running_requests. Without this, sweeping a
batch size larger than the default capture set makes decode silently
fall off the graph path and run eager, biasing decode latency vs SGLang.
No effect on batch_size=1 runs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The decode loop read req_pool_indices[i].item() and seq_lens[i].item()
per request, each a CUDA->CPU sync inside the timed decode region. SGLang
and pymllm's own production decode path (orchestrator/model_runner_process)
write the KV mapping without any per-request sync.

Replace the loop with a single vectorized scatter
req_to_token[req_pool_indices, seq_lens - 1] = out_cache_loc, removing
2*batch_size syncs per decode step. Negligible at batch_size=1; removes a
linearly growing, SGLang-absent bias for batch_size > 1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a SGLang-style capacity guard: when batch_size exceeds
max_total_num_tokens // (input_len + output_len), log a skip and return
without running, instead of hard-failing mid-sweep on KV slot allocation.
run_benchmark now treats a None result as a skip and continues.

Mirrors ModelRunner.max_batch_size / latency_test_run_once in SGLang's
bench_one_batch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ing#4)

Rework profiling to match SGLang's bench_one_batch:
- decode uses a single continuous profiler spanning
  [profile_start_step, profile_stop_step) -> one decode trace, instead of
  wrapping each step in its own profiler context (one file per step);
- default profile start step is output_len // 2 (was (output_len-1)//2);
- torch profiler runs with with_stack=True so kernels map to Python source;
- traces are written as .trace.json.gz;
- print the key_averages table on stop;
- support --profile-activities CUDA_PROFILER to drive nsys via
  cudaProfilerStart/Stop.

Replace the _maybe_profile context manager with start/stop helpers mirroring
SGLang's start_profile/stop_profile.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
run_single_setting wrapped both stages in profiling unconditionally, so
--profile would also profile the non-recorded warmup run, wasting time and
emitting extra traces. Add an allow_profile flag (default True) and pass
allow_profile=False for the warmup call, matching SGLang which hardcodes
profile=False during warmup.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add --correct (--correctness-test): load the tokenizer, encode real
prompts, run one full prefill at batch_size=1, greedy-decode output_len
tokens, and print the decoded text. main() branches to run_correctness
when the flag is set.

This is the single-stage variant. SGLang's --correct additionally does a
cut_len two-stage prefill to exercise prefix-KV reuse; with greedy decoding
the printed per-prompt text is identical either way. The two-stage variant
can be layered on later (prepare_forward_batch_extend already accepts
extend_prefix_lens > 0).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rewrite the pymllm_runtime web docs for accuracy and readability.

- setup_and_usage: condense overview, drop docker-only assumption,
  fix launch args (mem_fraction_static/max_total_tokens/dtype defaults),
  refresh bench_one_batch (text/vit/multimodal modes) and profile
  (torch.profiler supported, nsys experimental); align mem_fraction_static
  semantics with the model-weights+KV-pool static budget
- runtime_design / models_and_quantization / kernels_and_acceleration /
  developer_guide: humanize prose, trim AI-style scaffolding, keep content

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Slim down README.md and README-ZH.md to match the pymllm_runtime
setup_and_usage doc: fix launch args (drop stale flags, mem_fraction_static
0.8 / max_total_tokens 4096), align mem_fraction_static semantics, refresh
bench_one_batch (text / vit / multimodal modes) and profile guidance, and
drop redundant status / forward-path / microbench / known-limitations
sections.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d9380c7-75bc-4add-a62f-4ae5dc4ca9e9

📥 Commits

Reviewing files that changed from the base of the PR and between 32662c0 and 9ed367e.

📒 Files selected for processing (1)
  • README.md

📝 Walkthrough

Walkthrough

Adds Triton‑gated fused M‑RoPE and FlashInfer device‑property patching, integrated‑GPU memory profiling and richer KV diagnostics, a greedy sampling fast path, multimodal benchmarking and correctness mode, synchronized vision timing/ViT metrics, broad docs/README updates, and matching tests.

Changes

Runtime & Acceleration

Layer / File(s) Summary
Fused M‑RoPE and FlashInfer norm patch
pymllm/layers/rope.py, pymllm/layers/rms_norm.py, pymllm/layers/__init__.py
Add Triton‑gated fused M‑RoPE kernel with PyTorch fallback, export apply_mrope_fused_, and monkeypatch CUDA device properties for FlashInfer RMSNorm paths.
Qwen3‑VL RoPE wiring and vision timing
pymllm/models/qwen3_vl.py, pymllm/tests/test_qwen3_vl_deepstack.py
Wire attention to fused RoPE with cache dtype/device alignment, add synchronized wall‑clock vision timing helper, record ViT prefill metrics, and add timing tests.
ModelRunner memory profiling (integrated GPU fallback)
pymllm/executor/model_runner.py
Add system‑memory fallback (psutil/os.sysconf) for integrated devices, capture pre‑model‑load available GB, introduce MemoryProfileResult, and improve KV‑cache allocation diagnostics.
Sampling optimization: is_all_greedy
pymllm/executor/model_runner.py, pymllm/orchestrator/model_runner_process.py, pymllm/tests/test_model_runner_sampling.py
Introduce is_all_greedy argument to ModelRunner.sample() and thread it through scheduler to avoid tensor reductions when all requests are greedy; add CPU test.

Multimodal Benchmarking & Correctness

Layer / File(s) Summary
Multimodal contracts and metrics
pymllm/bench_one_batch.py, pymllm/tests/test_bench_one_batch.py
Add ExtendResult, MultimodalBenchInput, MultimodalProcessorBundle, multimodal input construction, image token handling, and ViT/multimodal metric helpers with tests.
Runner extend API & profiling refactor
pymllm/bench_one_batch.py
Update PymllmBenchRunner.extend() to accept multimodal tensors and benchmark_vision_timing; replace profiling context with explicit _start_profile/_stop_profile supporting torch.profiler and CUDA profiler (nsys).
Execution loop, KV writes, correctness mode
pymllm/bench_one_batch.py
Vectorize decode KV writes, add _max_batch_size_for() capacity skips, adjust generate_settings() for image mode defaults, and implement tokenizer‑driven correctness test flow; update trace filename extension.

Documentation & READMEs

Layer / File(s) Summary
Runtime design and transport modes
docs/pymllm_runtime/runtime_design.rst
Rewrite architecture overview, request lifecycle, ModelRunner init/forward modes, and complete tensor_transport_mode behaviors.
Kernels and quantization docs
docs/pymllm_runtime/kernels_and_acceleration.rst, docs/pymllm_runtime/models_and_quantization.rst
Clarify attention backend scope, CUDA Graph replay semantics, W4A16/W8A8 execution and CUTLASS requirements, layout conversion timing, and LinearMethod lifecycle.
Setup, usage, developer guide
docs/pymllm_runtime/setup_and_usage.rst, docs/pymllm_runtime/developer_guide.rst
Add explicit install recipes (triton/flashinfer), document W8A8 CUTLASS JIT cache behavior, rewrite benchmark guidance for multimodal modes, and provide structured contribution guidance.
README EN/ZH modernization
pymllm/README.md, pymllm/README-ZH.md
Consolidate launch entry (pymllm.server.launch), update environment/version tables, refresh examples (absolute image path rules), emphasize Jetson Orin/INT8/BF16 support, and expand benchmark documentation.

Tests

Layer / File(s) Summary
Memory, RoPE, sampling, norm tests
pymllm/tests/test_model_runner_memory_pool.py, pymllm/tests/test_mrope_triton.py, pymllm/tests/test_model_runner_sampling.py, pymllm/tests/test_rms_norm.py
Unit tests for memory profiling, integrated GPU behavior, fused M‑RoPE numerical correctness and in‑place semantics, sampling greedy fast path, and device‑property patching.
Bench multimodal & decode tests
pymllm/tests/test_bench_one_batch.py
Tests for multimodal CLI/setting generation, ViT/multimodal metrics, multimodal input shaping (pad/truncate), extend()/decode() integration, and profile trace filename behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • oreomaker
  • liang1232018
  • yirongjie
  • chenghuaWang

Poem

In carrot code I nibble lines of speed,
I weave a rope where Triton does the deed,
I count the tokens, guard the memory pool,
Greedy hops skip work — that’s my tiny rule,
🐇 — a rabbit cheering Orin’s lead.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.28% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main performance optimization changes focused on Qwen3-VL Jetson decode/prefill and benchmark refinements.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering summary, detailed changes, implementation notes, validation, and reviewer guidance.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chenghuaWang chenghuaWang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
pymllm/executor/model_runner.py (1)

142-153: ⚡ Quick win

Add a docstring for MemoryProfileResult.

This new class is part of runtime diagnostics and should document field meaning/units to keep error reporting maintainable.

Suggested update
 `@dataclass`
 class MemoryProfileResult:
+    """Snapshot of KV-cache memory profiling inputs and computed limits.
+
+    All memory values are in GiB unless otherwise noted.
+    """
     pre_model_available_gb: float
     available_gb: float
     mem_fraction: float

As per coding guidelines, “Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pymllm/executor/model_runner.py` around lines 142 - 153, Add a clear
docstring to the MemoryProfileResult dataclass explaining its purpose (runtime
memory profiling result used for token budgeting) and document each field with
short descriptions and units: pre_model_available_gb (GB available before model
allocation), available_gb (GB available after model allocation), mem_fraction
(fraction of available memory to use), static_kv_budget_gb (GB reserved for
static KV cache), cell_size_bytes (bytes per model cell), profiled_max_tokens
(max tokens observed during profiling), requested_max_total_tokens (optional
requested max total tokens), effective_max_tokens (final token limit used), and
gdn_pool_gb (GB for GDN pool, default 0.0); keep it concise and place it
immediately above the MemoryProfileResult declaration in model_runner.py.

Source: Coding guidelines

pymllm/layers/rope.py (1)

549-575: 💤 Low value

Fallback path does not modify tensors in-place despite _ suffix convention.

The trailing underscore in apply_mrope_fused_ signals in-place modification, but the fallback branch (lines 568-575) returns new tensors from apply_mrope rather than modifying the inputs in-place. While the current callers correctly use the returned values, this inconsistency could cause subtle bugs if callers assume in-place semantics based on the naming convention.

Consider either:

  1. Documenting this clearly in the docstring, or
  2. Copying fallback results back into the original tensors for consistent in-place behavior
📝 Option 2: Make fallback truly in-place
     if not _can_use_mrope_fused(q, k, positions, cos_sin_cache):
-        return apply_mrope(
+        q_out, k_out = apply_mrope(
             q,
             k,
             positions,
             cos_sin_cache,
             mrope_section,
             mrope_interleaved,
         )
+        q.copy_(q_out)
+        k.copy_(k_out)
+        return q, k
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pymllm/layers/rope.py` around lines 549 - 575, The fallback branch in
apply_mrope_fused_ currently returns new tensors from apply_mrope instead of
performing in-place updates despite the trailing underscore; change the fallback
to call apply_mrope(q, k, positions, cos_sin_cache, mrope_section,
mrope_interleaved), then copy the returned tensors back into the original q and
k (e.g., using q.copy_(returned_q) and k.copy_(returned_k)) and finally return
the original q and k so callers see in-place semantics; ensure you preserve
device/dtype/shape expectations and use the unique symbols apply_mrope_fused_,
apply_mrope, q, and k to locate the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pymllm/bench_one_batch.py`:
- Around line 1218-1222: DEFAULT_CORRECTNESS_PROMPTS contains a typo; update the
second tuple entry in DEFAULT_CORRECTNESS_PROMPTS from "The capital of the
United Kindom is" to "The capital of the United Kingdom is" so the prompt reads
correctly; modify the string literal in the DEFAULT_CORRECTNESS_PROMPTS tuple to
the corrected spelling.

---

Nitpick comments:
In `@pymllm/executor/model_runner.py`:
- Around line 142-153: Add a clear docstring to the MemoryProfileResult
dataclass explaining its purpose (runtime memory profiling result used for token
budgeting) and document each field with short descriptions and units:
pre_model_available_gb (GB available before model allocation), available_gb (GB
available after model allocation), mem_fraction (fraction of available memory to
use), static_kv_budget_gb (GB reserved for static KV cache), cell_size_bytes
(bytes per model cell), profiled_max_tokens (max tokens observed during
profiling), requested_max_total_tokens (optional requested max total tokens),
effective_max_tokens (final token limit used), and gdn_pool_gb (GB for GDN pool,
default 0.0); keep it concise and place it immediately above the
MemoryProfileResult declaration in model_runner.py.

In `@pymllm/layers/rope.py`:
- Around line 549-575: The fallback branch in apply_mrope_fused_ currently
returns new tensors from apply_mrope instead of performing in-place updates
despite the trailing underscore; change the fallback to call apply_mrope(q, k,
positions, cos_sin_cache, mrope_section, mrope_interleaved), then copy the
returned tensors back into the original q and k (e.g., using q.copy_(returned_q)
and k.copy_(returned_k)) and finally return the original q and k so callers see
in-place semantics; ensure you preserve device/dtype/shape expectations and use
the unique symbols apply_mrope_fused_, apply_mrope, q, and k to locate the code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9f479274-833c-43ff-a40b-3d4c75a9071b

📥 Commits

Reviewing files that changed from the base of the PR and between 729ca4c and d442aae.

⛔ Files ignored due to path filters (2)
  • bench_assets/two_cats.jpg is excluded by !**/*.jpg
  • bench_assets/two_cats_480p.jpg is excluded by !**/*.jpg
📒 Files selected for processing (20)
  • docs/pymllm_runtime/developer_guide.rst
  • docs/pymllm_runtime/kernels_and_acceleration.rst
  • docs/pymllm_runtime/models_and_quantization.rst
  • docs/pymllm_runtime/runtime_design.rst
  • docs/pymllm_runtime/setup_and_usage.rst
  • pymllm/README-ZH.md
  • pymllm/README.md
  • pymllm/bench_one_batch.py
  • pymllm/executor/model_runner.py
  • pymllm/layers/__init__.py
  • pymllm/layers/rms_norm.py
  • pymllm/layers/rope.py
  • pymllm/models/qwen3_vl.py
  • pymllm/orchestrator/model_runner_process.py
  • pymllm/tests/test_bench_one_batch.py
  • pymllm/tests/test_model_runner_memory_pool.py
  • pymllm/tests/test_model_runner_sampling.py
  • pymllm/tests/test_mrope_triton.py
  • pymllm/tests/test_qwen3_vl_deepstack.py
  • pymllm/tests/test_rms_norm.py

Comment thread pymllm/bench_one_batch.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
README.md (1)

40-40: ⚡ Quick win

Add alt text to images for accessibility.

The embedded images are missing alt text attributes, which are essential for screen readers and accessibility compliance. Consider adding descriptive alt text.

♻️ Proposed fix to add alt text
 <div align="center">
-  <img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" width="90%">
+  <img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" alt="Jetson Orin pymllm vs llama.cpp speedup summary (input_len=2048)" width="90%">
 </div>

 <div align="center">
-  <img src="./assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg" width="90%">
+  <img src="./assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg" alt="Jetson Orin pymllm prefill throughput (input_len=2048)" width="90%">
 </div>

Also applies to: 44-44

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` at line 40, Image tags in README lack alt text which harms
accessibility; update each <img
src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" width="90%"> (and
the other similar <img> entries) to include a meaningful alt attribute (e.g.,
alt="pymllm Jetson speedup summary") by editing the README image elements so
they read with an alt="..." describing the image content for screen readers.
README-ZH.md (1)

41-41: ⚡ Quick win

Add alt text to images for accessibility.

The embedded images are missing alt text attributes, which are essential for screen readers and accessibility compliance. Consider adding descriptive alt text.

♻️ Proposed fix to add alt text
 <div align="center">
-  <img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" width="90%">
+  <img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" alt="Jetson Orin pymllm vs llama.cpp 加速比汇总 (input_len=2048)" width="90%">
 </div>

 <div align="center">
-  <img src="./assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg" width="90%">
+  <img src="./assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg" alt="Jetson Orin pymllm prefill 吞吐率 (input_len=2048)" width="90%">
 </div>

Also applies to: 45-45

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README-ZH.md` at line 41, The image tags referencing
"./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" (and the other image at
the same block) lack alt text; update the <img> elements in README-ZH.md (the
img referencing that filename) to include a descriptive alt="..." attribute
(e.g., alt="PyMLLM on Jetson: speedup summary") for screen-reader accessibility
and repeat for the similar image at the other location.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@README-ZH.md`:
- Line 41: The image tags referencing
"./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg" (and the other image at
the same block) lack alt text; update the <img> elements in README-ZH.md (the
img referencing that filename) to include a descriptive alt="..." attribute
(e.g., alt="PyMLLM on Jetson: speedup summary") for screen-reader accessibility
and repeat for the similar image at the other location.

In `@README.md`:
- Line 40: Image tags in README lack alt text which harms accessibility; update
each <img src="./assets/jetson/pymllm-jetson-speedup-summary-2048.jpg"
width="90%"> (and the other similar <img> entries) to include a meaningful alt
attribute (e.g., alt="pymllm Jetson speedup summary") by editing the README
image elements so they read with an alt="..." describing the image content for
screen readers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d93aebb8-6766-4872-8795-f7d5a1c42927

📥 Commits

Reviewing files that changed from the base of the PR and between d442aae and f346552.

⛔ Files ignored due to path filters (2)
  • assets/jetson/pymllm-jetson-prefill-throughput-2048.jpg is excluded by !**/*.jpg
  • assets/jetson/pymllm-jetson-speedup-summary-2048.jpg is excluded by !**/*.jpg
📒 Files selected for processing (2)
  • README-ZH.md
  • README.md

@chenghuaWang chenghuaWang merged commit 4f3ed31 into UbiquitousLearning:main Jun 8, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants