Add EPYC CPU serving skill (vLLM + zentorch) by amd-lalithnc · Pull Request #76 · amd/skills

amd-lalithnc · 2026-06-24T12:17:00Z

What

Adds serving-llms-on-epyc: a skill that brings up a single vLLM OpenAI endpoint on an
AMD EPYC CPU host with the zentorch backend, in a container (Docker/Podman) or a conda env.

Flow

Detect the CPU: vendor, EPYC generation + Zen arch, AVX-512, physical cores, NUMA, RAM (detect.py).
Validate the environment (validate.py): container runtime (docker/podman) or conda fallback;
image present, and if already pulled, import vllm, zentorch inside it; host perf libraries
(tcmalloc / OpenMP via LD_PRELOAD); HF_TOKEN; RAM.
Resolve + check the model (check_model.py): confirm vLLM supports the architecture via its
model registry (text or multimodal); reject pooling / non-LLM (not chat endpoints).
Gated models require HF_TOKEN + license acceptance.
Check RAM fit (estimate_memory.py): weights + KV cache + headroom ≤ host RAM.
Size the runtime from the hardware (cpu_tune.py): bind to socket 0's physical cores and
set VLLM_CPU_KVCACHE_SPACE; no memory binding by default (NPS2/NPS4 get a perf note).
Confirm: present a sized plan and wait for the user to confirm before launching.
Launch: vllm serve (never --device cpu on vLLM ≥ 0.20).
Verify + hand over: poll /health, validate the /v1/chat/completions endpoint, then print a
connection table.

Single instance. On any failure it reports the cause + logs and stops, no retry, no debugging loop.

Notes / scope

Uses the amdih/zendnn_zentorch image on Docker Hub.
KV cache is bf16-only on zentorch CPU; TORCHINDUCTOR_FREEZING=1 requires VLLM_USE_AOT_COMPILE=0.
OMP_NUM_THREADS and VLLM_CPU_NUM_OF_RESERVED_CPU are intentionally left unset — vLLM derives
them (from the bind list / its own default).
NUMA default: socket 0's physical cores, no memory binding.

Testing

Structural gate (check.sh): passes (0 errors).
Behavioral eval (LLM-judged, sonnet): 13/13 — exercises detect → validate → check_model →
estimate → cpu_tune → confirm, plus the guardrails. Live launch/serve is the manual /
integration tier on a real EPYC host.

Change-Id: I1dc2362e0983326658b6618015a161ecd44f40e6

Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Change-Id: I1dc2362e0983326658b6618015a161ecd44f40e6

danielholanda · 2026-06-25T22:24:52Z

@Mahdi-CV Can you help review this?

add serving-llms-on-epyc skill (vLLM + zentorch CPU serving)

f62dd74

Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Change-Id: I1dc2362e0983326658b6618015a161ecd44f40e6

danielholanda requested a review from Mahdi-CV June 25, 2026 22:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add EPYC CPU serving skill (vLLM + zentorch)#76

Add EPYC CPU serving skill (vLLM + zentorch)#76
amd-lalithnc wants to merge 1 commit into
amd:mainfrom
amd-lalithnc:add-serving-llms-on-epyc

amd-lalithnc commented Jun 24, 2026

Uh oh!

danielholanda commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

amd-lalithnc commented Jun 24, 2026

What

Flow

Contents

Notes / scope

Testing

Uh oh!

danielholanda commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants