Add hipGraph capture/replay for decode eval (gated to fp16)#5019
Add hipGraph capture/replay for decode eval (gated to fp16)#5019aditya-dl wants to merge 1 commit into
Conversation
Decode re-enqueues the per-token kernel sequence one launch at a time (program::eval generic_eval loop -> one hipExtModuleLaunchKernel per op), adding host dispatch overhead and a GPU-clock throttle from the per-launch bubbles. Add the HIP-graph analog: capture the single-context eval kernel loop into a hipGraph once and replay it with a single hipGraphLaunch per subsequent eval. Opt-in via MIGRAPHX_ENABLE_HIPGRAPH (default off = byte-identical to the prior path). The gpu context owns the hipGraph_t/hipGraphExec_t (RAII) and the begin/end_graph_capture / replay_graph / execute() entry points; program::eval routes the single-context path through context::execute() and caches the output arguments for the replay path (valid because static-shape decode reuses fixed device buffers). The type-erased context gains an execute() hook that by default just runs the loop. Capture is gated to fp16: hipGraph capture/replay regresses low-bit-quantized (int4/fp4) decode substantially (up to ~2x slower than eager on discrete GPUs). fuse_mlir marks a program non-capturable when it contains any quantized/low-bit op (unpack_int4, unpack_fp4, dequantizelinear, quant_dot) -- a cheap pre-lowering instruction-name scan, recorded on the gpu context -- and context::is_graph_enabled() keeps those programs on the eager path while fp16 still captures. No added compile-time or per-token cost, and none when the feature is off.
|
Thank you for your contribution! Since this is an external pull request, a maintainer must review PR and add the "ok-to-test" label if it is approved for testing. |
|
So there is no reason to modify the Either way, see #4956 which implements hip graph and it handles when the pointer change. |
There was a problem hiding this comment.
Pull request overview
This PR introduces an opt-in HIP Graph capture/replay path for GPU execution to reduce per-token host dispatch overhead by capturing the single-context eval kernel sequence once and replaying it with a single hipGraphLaunch on subsequent evals.
Changes:
- Add HIP Graph state + capture/replay plumbing to the GPU
context, exposed via a newcontext::execute()hook. - Route single-context
program::eval()throughcontext::execute()and cache outputarguments for the replay path. - Add an MLIR-pass-time gate that disables capture when low-bit/quantized ops are present, plus a changelog entry for the new env var.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/targets/gpu/include/migraphx/gpu/context.hpp | Adds MIGRAPHX_ENABLE_HIPGRAPH, graph RAII state, capture/replay methods, and context::execute() implementation. |
| src/targets/gpu/fuse_mlir.cpp | Marks programs non-capturable when certain low-bit/quantized ops are present (gate for hipGraph path). |
| src/program.cpp | Routes single-context eval through context::execute() and caches results for replay. |
| src/include/migraphx/context.hpp | Extends the type-erased context interface with an optional execute() hook and a default eager implementation. |
| CHANGELOG.md | Documents the new MIGRAPHX_ENABLE_HIPGRAPH feature and gating behavior. |
| auto status = hipStreamEndCapture(get_stream().get(), &g); | ||
| if(status != hipSuccess or g == nullptr) | ||
| return false; | ||
| captured_graph = hip_graph_ptr{g}; |
| captured_graph.reset(); | ||
| return false; | ||
| } | ||
| graph_exec = hip_graph_exec_ptr{exec}; |
| // First eval: capture the loop into a graph. NOTE under hipStreamBeginCapture | ||
| // the kernel launches are RECORDED, not executed -- so run_kernels() here | ||
| // produces no output; we must launch the instantiated graph once to actually | ||
| // compute this first token. If capture/instantiate fails, fall back to a real | ||
| // eager run so the first token is still correct (and future evals stay eager | ||
| // since graph_exec remains null). |
| // ~2x slower than the eager path on discrete GPUs). Allowlist-by-absence: any quantized | ||
| // program (int4 nibble unpack, fp4, or the int8/int4 dequantize+quant_dot path that has | ||
| // no nibble unpack) takes the eager path; only fp16 (none of these ops) captures. |
| static const std::array<std::string, 4> low_bit_ops = { | ||
| {"unpack_int4", "unpack_fp4", "dequantizelinear", "quant_dot"}}; | ||
| for(const auto& ins : mpm.get_module()) | ||
| { | ||
| if(contains(low_bit_ops, ins.name())) | ||
| { | ||
| ctx->set_graph_not_capturable(); | ||
| break; | ||
| } | ||
| } |
| contexts.front().execute([&] { | ||
| ret = generic_eval(*this, contexts, params, [&](auto&&, auto f) { return f(); }); | ||
| impl->graph_cached_results = ret; | ||
| }); | ||
| if(ret.empty()) | ||
| ret = impl->graph_cached_results; |
| } | ||
| else if(contexts.size() == 1) | ||
| { | ||
| // route the single-context eval (the EP decode path, async or not) |
| // regresses low-bit-quantized (int4/fp4) decode substantially (up to ~2x slower than | ||
| // the eager per-op path on discrete GPUs), so a program that fuses any low-bit dequant | ||
| // op is marked non-capturable and runs the eager path; fp16 still captures. | ||
| bool is_graph_capturable() const { return graph_capturable; } |
There was a problem hiding this comment.
This shouldn't occur and signals a bug - We should be agnostic between what the lower level library is doing on the capture. Since MIGraphX is working on the higher level than MLIR and other libraries we should see a reduction in launch execution overall.
We do fusions to further optimize the model and keep the GPU queue full so it doesnt go idle.
| // no nibble unpack) takes the eager path; only fp16 (none of these ops) captures. | ||
| // Scanned here -- before fusion consumes these into a code_object whose name no longer | ||
| // reveals them -- and recorded on the (shared) context so the hipGraph path (gated by | ||
| // context::is_graph_enabled) skips capture. |
There was a problem hiding this comment.
We want to capture all kernels and not certain non fused ones, otherwise you'll still get idle bubbles in the pipeline of runs.
There was a problem hiding this comment.
We need to discuss this with @pfultz2. He's already started work on this with another draft PR he's mentioned and it seems like we need to discuss architecture of how we'd like to do this. I've already got changes in rocm/onnxruntime and porting these to the GPU EP side.
We shouldn't be picking/dropping kernels from the capture as that causes a bunch of other pointer dependencies/math in the GPU.
In my mind hipGraph is simple
Input run -> capture-> Model capture
Input -> Replay Model capture -> output.
Adding any sort of tap dancing will lead to future failure say if we have an increasing list of "uncapturable" kernels - we don't have an infinite amount of addresses in a GPU that may or may not be running concurrent workloads or parallel instances for the given address range.
Lets pick a time next week to discuss this in detail and the architecture as it seems like you have requirements on how you'd like this to work and I know Paul and I have our own in terms of MIGraphX and Onnxruntime specifically.
LLM decode re-enqueues the per-token kernel sequence one launch at a time -
program::eval'sgeneric_evalloop issues onehipExtModuleLaunchKernelper op (~50+ launches/token for a transformer decode step). The per-launch host overhead, and the GPU-clock throttling caused by the resulting dispatch bubbles, are a meaningful fraction of per-token latency on memory-light decode workloads. This PR adds the HIP-graph analog of what other backends already do (CUDA Graphs / D3D12 command-list replay): capture the single-context eval kernel loop into ahipGraphonce, then replay it with a singlehipGraphLaunchper subsequent eval.The feature is opt-in (
MIGRAPHX_ENABLE_HIPGRAPH, default off) and, when off, the code path is byte-identical to before. When on, it is restricted to FP16 programs - quantized (int4/fp4) decode is automatically excluded because graph capture regresses it (see "Gating" below).Motivation
For decode, the bottleneck on discrete GPUs is not kernel compute - it is host dispatch overhead plus the clock throttle that per-launch gaps induce. Collapsing the per-token launch sequence into one captured/replayed graph removes both. This is a runtime-execution change, not a kernel change, so it benefits any model whose decode step is dispatch-bound.
Technical Details
context:hipGraph_t/hipGraphExec_t.begin_graph_capture()/end_graph_capture()/replay_graph()and a singleexecute(run_kernels)entry point.program::evalroutes the single-context path throughcontext::execute()and caches the output arguments for the replay path (valid because static-shape decode reuses fixed device buffers; the terminal stream sync stays outside the capture region).contextinterface gains anexecute()hook whose default simply runs the loop, so non-GPU targets are unaffected.Performance (measured)
fp16 steady-state decode throughput, hipGraph off vs on, interleaved same-session A/B with md5-verified builds:
Safety / compatibility
MIGRAPHX_ENABLE_HIPGRAPHunset, behavior is byte-identical to the prior path (the single-context branch still runs the eager loop).Validation
How to enable
MIGRAPHX_ENABLE_HIPGRAPH=1
(plus static shapes / fixed buffers, which decode with a shared KV buffer already provides).
Notes for reviewers
program::evalis the only core change; everything else lives in the GPU target.fuse_mlir, so a non-default configuration with MLIR disabled would not set the gate. This is not a concern for the standard build (MLIR is on, and the int4 fusion path requires it), but noting it for completeness.Changelog Category
Add a
CHANGELOG.mdentry for any option other thanNot Applicable