Skip to content

feat: add vLLM prefix cache and preemption metrics#2843

Open
puneeshkhanna wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
puneeshkhanna:feat/vllm_prefix_counters
Open

feat: add vLLM prefix cache and preemption metrics#2843
puneeshkhanna wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
puneeshkhanna:feat/vllm_prefix_counters

Conversation

@puneeshkhanna

Copy link
Copy Markdown

What does this PR do ?

Tracks prefix_cache_queries, prefix_cache_hits, prefix_cache_hit_rate, and num_preemptions as windowed deltas from vLLM cumulative counters. Also add optional scalar aggregates (mean/max/p95) and a toggle to disable heavy timeline image plots in wandb logging.

Issues

NA

Usage

Add below config parameters to your yaml:

vllm_cfg:
    # --- vLLM metrics logger ---                                                                                                                                           
    enable_vllm_metrics_logger: true                                                                                                                                        
    vllm_metrics_logger_interval: 1.0   # seconds between metric snapshots                                                                                                  
    # Log per-step scalar aggregates (mean/max/p95) as W&B line charts — trackable across steps.                                                                            
    vllm_metrics_log_scalars: true                                                                                                                                        
    # Heavy per-step matplotlib IMAGE plots (one figure per metric per step) — slow to upload/render.                                                                       
    # Off: rely on the lightweight scalars above. Set true only if you want the per-worker timelines.                                                                     
    vllm_metrics_log_timeline_plots: false     

This gives you prefix cache hit rate, preemption counts, generation tokens, KV cache usage, etc. as lightweight scalar line charts in W&B, without the heavy image plots.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

I think that no new tests are required to cover this.
Wandb plots over a single step on a small model Qwen3-1.7B:
image

@puneeshkhanna puneeshkhanna requested review from a team as code owners June 16, 2026 12:58
@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@puneeshkhanna

Copy link
Copy Markdown
Author

/review-pr --deep

Branch: feat/vllm_prefix_counters (vs main)
Files changed: 5

--- Findings (scored >= 80) ---
None.

--- Filtered (scored < 80) ---
8 low-confidence issues omitted:
- [BUG 5] vllm_worker_async.py:405 — Wrong vLLM metric names
→ FALSE POSITIVE. Verified against vLLM 0.20.0 (project
target):
names are correct. Subagent was checking vLLM 0.9.1.
- [GUIDELINE 15] Keyword-only args for two adjacent bools
→ Not genuinely confusable; all callers already use keyword
syntax.
- [GUIDELINE ~30] Config .get() defaults at call sites
→ Pre-existing pattern for all vLLM metrics logger fields.
- [BUG ~20] if val: truthiness drops empty lists
→ Pre-existing pattern, correct behavior for this use case.
- [BUG ~10] numpy import missing → Already imported at line 22.
- [BUG ~10] stop_event unused → Pre-existing, not introduced by
this diff.
- [BUG ~15] Gauge/counter lock gap → Pre-existing pattern.
- [BUG ~10] Negative delta clamping → By design, documented in
docstring.

LGTM — deep review found no actionable issues. All candidates were
either false positives or pre-existing patterns consistent with
surrounding code. The vLLM metric names are confirmed correct for
the target version (0.20.0).

Linting and formatting

All
previously identified issues have been fixed:

┌────────────────────────────────────────────────────┬────────┐
│ Check │ Status │
├────────────────────────────────────────────────────┼────────┤
│ Naming (_k/_v → key/val) │ Fixed │
├────────────────────────────────────────────────────┼────────┤
│ Reflection (getattr/setattr → _prev_counters dict) │ Fixed │
├────────────────────────────────────────────────────┼────────┤
│ Type syntax (Optional[int] → int | None) │ Fixed │
├────────────────────────────────────────────────────┼────────┤
│ 4-space indentation │ Clean │
├────────────────────────────────────────────────────┼────────┤
│ snake_case naming throughout │ Clean │
├────────────────────────────────────────────────────┼────────┤
│ Google-style docstrings │ Clean │
├────────────────────────────────────────────────────┼────────┤
│ No unexplained commented-out code │ Clean │
├────────────────────────────────────────────────────┼────────┤
│ No reflection │ Clean │
└────────────────────────────────────────────────────┴────────┘

@puneeshkhanna puneeshkhanna force-pushed the feat/vllm_prefix_counters branch from e3a6909 to d86e4e3 Compare June 17, 2026 05:32
Track prefix_cache_queries, prefix_cache_hits, prefix_cache_hit_rate,
and num_preemptions as windowed deltas from vLLM cumulative counters.
Add optional per-engine scalar aggregates (mean/max/p95) and a toggle
to disable heavy timeline image plots in wandb logging.

Signed-off-by: Puneesh Khanna <puneesh.khanna@tii.ae>
@puneeshkhanna puneeshkhanna force-pushed the feat/vllm_prefix_counters branch from d86e4e3 to fb0293b Compare June 17, 2026 05:35
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants