Skip to content

Add EPYC CPU serving skill (vLLM + zentorch)#76

Open
amd-lalithnc wants to merge 1 commit into
amd:mainfrom
amd-lalithnc:add-serving-llms-on-epyc
Open

Add EPYC CPU serving skill (vLLM + zentorch)#76
amd-lalithnc wants to merge 1 commit into
amd:mainfrom
amd-lalithnc:add-serving-llms-on-epyc

Conversation

@amd-lalithnc

Copy link
Copy Markdown

What

Adds serving-llms-on-epyc: a skill that brings up a single vLLM OpenAI endpoint on an
AMD EPYC CPU host with the zentorch backend, in a container (Docker/Podman) or a conda env.

Flow

  1. Detect the CPU: vendor, EPYC generation + Zen arch, AVX-512, physical cores, NUMA, RAM (detect.py).
  2. Validate the environment (validate.py): container runtime (docker/podman) or conda fallback;
    image present, and if already pulled, import vllm, zentorch inside it; host perf libraries
    (tcmalloc / OpenMP via LD_PRELOAD); HF_TOKEN; RAM.
  3. Resolve + check the model (check_model.py): confirm vLLM supports the architecture via its
    model registry (text or multimodal); reject pooling / non-LLM (not chat endpoints).
    Gated models require HF_TOKEN + license acceptance.
  4. Check RAM fit (estimate_memory.py): weights + KV cache + headroom ≤ host RAM.
  5. Size the runtime from the hardware (cpu_tune.py): bind to socket 0's physical cores and
    set VLLM_CPU_KVCACHE_SPACE; no memory binding by default (NPS2/NPS4 get a perf note).
  6. Confirm: present a sized plan and wait for the user to confirm before launching.
  7. Launch: vllm serve (never --device cpu on vLLM ≥ 0.20).
  8. Verify + hand over: poll /health, validate the /v1/chat/completions endpoint, then print a
    connection table.

Single instance. On any failure it reports the cause + logs and stops, no retry, no debugging loop.

Contents

  • SKILL.md, reference.md, skill-card.md, data/epyc.json
  • scripts: detect.py, validate.py, check_model.py, estimate_memory.py, cpu_tune.py
  • behavioral eval: eval/behavioral/tests/test_serving_llms_on_epyc.py
  • registered in .claude-plugin/marketplace.json (+ regenerated Cursor manifest)

Notes / scope

  • Uses the amdih/zendnn_zentorch image on Docker Hub.
  • KV cache is bf16-only on zentorch CPU; TORCHINDUCTOR_FREEZING=1 requires VLLM_USE_AOT_COMPILE=0.
  • OMP_NUM_THREADS and VLLM_CPU_NUM_OF_RESERVED_CPU are intentionally left unset — vLLM derives
    them (from the bind list / its own default).
  • NUMA default: socket 0's physical cores, no memory binding.

Testing

  • Structural gate (check.sh): passes (0 errors).
  • Behavioral eval (LLM-judged, sonnet): 13/13 — exercises detect → validate → check_model →
    estimate → cpu_tune → confirm, plus the guardrails. Live launch/serve is the manual /
    integration tier on a real EPYC host.

Change-Id: I1dc2362e0983326658b6618015a161ecd44f40e6

Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com>
Change-Id: I1dc2362e0983326658b6618015a161ecd44f40e6
@danielholanda danielholanda requested a review from Mahdi-CV June 25, 2026 22:24
@danielholanda

Copy link
Copy Markdown
Collaborator

@Mahdi-CV Can you help review this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants