feat: add vLLM prefix cache and preemption metrics by puneeshkhanna · Pull Request #2843 · NVIDIA-NeMo/RL

puneeshkhanna · 2026-06-16T12:58:12Z

What does this PR do ?

Tracks prefix_cache_queries, prefix_cache_hits, prefix_cache_hit_rate, and num_preemptions as windowed deltas from vLLM cumulative counters. Also add optional scalar aggregates (mean/max/p95) and a toggle to disable heavy timeline image plots in wandb logging.

Issues

NA

Usage

Add below config parameters to your yaml:

vllm_cfg:
    # --- vLLM metrics logger ---                                                                                                                                           
    enable_vllm_metrics_logger: true                                                                                                                                        
    vllm_metrics_logger_interval: 1.0   # seconds between metric snapshots                                                                                                  
    # Log per-step scalar aggregates (mean/max/p95) as W&B line charts — trackable across steps.                                                                            
    vllm_metrics_log_scalars: true                                                                                                                                        
    # Heavy per-step matplotlib IMAGE plots (one figure per metric per step) — slow to upload/render.                                                                       
    # Off: rely on the lightweight scalars above. Set true only if you want the per-worker timelines.                                                                     
    vllm_metrics_log_timeline_plots: false

This gives you prefix cache hit rate, preemption counts, generation tokens, KV cache usage, etc. as lightweight scalar line charts in W&B, without the heavy image plots.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

I think that no new tests are required to cover this.
Wandb plots over a single step on a small model Qwen3-1.7B:

copy-pr-bot · 2026-06-16T12:58:16Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

puneeshkhanna · 2026-06-16T15:50:23Z

/review-pr --deep

Branch: feat/vllm_prefix_counters (vs main)
Files changed: 5

--- Findings (scored >= 80) ---
None.

--- Filtered (scored < 80) ---
8 low-confidence issues omitted:
- [BUG 5] vllm_worker_async.py:405 — Wrong vLLM metric names
→ FALSE POSITIVE. Verified against vLLM 0.20.0 (project
target):
names are correct. Subagent was checking vLLM 0.9.1.
- [GUIDELINE 15] Keyword-only args for two adjacent bools
→ Not genuinely confusable; all callers already use keyword
syntax.
- [GUIDELINE ~30] Config .get() defaults at call sites
→ Pre-existing pattern for all vLLM metrics logger fields.
- [BUG ~20] if val: truthiness drops empty lists
→ Pre-existing pattern, correct behavior for this use case.
- [BUG ~10] numpy import missing → Already imported at line 22.
- [BUG ~10] stop_event unused → Pre-existing, not introduced by
this diff.
- [BUG ~15] Gauge/counter lock gap → Pre-existing pattern.
- [BUG ~10] Negative delta clamping → By design, documented in
docstring.

LGTM — deep review found no actionable issues. All candidates were
either false positives or pre-existing patterns consistent with
surrounding code. The vLLM metric names are confirmed correct for
the target version (0.20.0).

Linting and formatting

All
previously identified issues have been fixed:

┌────────────────────────────────────────────────────┬────────┐
│ Check │ Status │
├────────────────────────────────────────────────────┼────────┤
│ Naming (_k/_v → key/val) │ Fixed │
├────────────────────────────────────────────────────┼────────┤
│ Reflection (getattr/setattr → _prev_counters dict) │ Fixed │
├────────────────────────────────────────────────────┼────────┤
│ Type syntax (Optional[int] → int | None) │ Fixed │
├────────────────────────────────────────────────────┼────────┤
│ 4-space indentation │ Clean │
├────────────────────────────────────────────────────┼────────┤
│ snake_case naming throughout │ Clean │
├────────────────────────────────────────────────────┼────────┤
│ Google-style docstrings │ Clean │
├────────────────────────────────────────────────────┼────────┤
│ No unexplained commented-out code │ Clean │
├────────────────────────────────────────────────────┼────────┤
│ No reflection │ Clean │
└────────────────────────────────────────────────────┴────────┘

Track prefix_cache_queries, prefix_cache_hits, prefix_cache_hit_rate, and num_preemptions as windowed deltas from vLLM cumulative counters. Add optional per-engine scalar aggregates (mean/max/p95) and a toggle to disable heavy timeline image plots in wandb logging. Signed-off-by: Puneesh Khanna <puneesh.khanna@tii.ae>

puneeshkhanna requested review from a team as code owners June 16, 2026 12:58

github-actions Bot added the community-request label Jun 16, 2026

puneeshkhanna force-pushed the feat/vllm_prefix_counters branch from e3a6909 to d86e4e3 Compare June 17, 2026 05:32

puneeshkhanna force-pushed the feat/vllm_prefix_counters branch from d86e4e3 to fb0293b Compare June 17, 2026 05:35

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add vLLM prefix cache and preemption metrics#2843

feat: add vLLM prefix cache and preemption metrics#2843
puneeshkhanna wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
puneeshkhanna:feat/vllm_prefix_counters

puneeshkhanna commented Jun 16, 2026

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

puneeshkhanna commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

puneeshkhanna commented Jun 16, 2026

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

puneeshkhanna commented Jun 16, 2026

/review-pr --deep

Linting and formatting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants