[FEAT] Add replay from trace strategy#620
[FEAT] Add replay from trace strategy#620VincentG1234 wants to merge 5 commits intovllm-project:mainfrom
Conversation
008633f to
a66034b
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Add trace replay capability to GuideLLM for reproducing real-world request patterns from trace files. This enables time-based request rate replay and synthetic prompt generation matching trace token counts. - Add TraceReplayStrategy for scheduling requests at precise timestamps - Add ReplayProfile for configuring trace-based benchmarking - Add TraceSyntheticDatasetDeserializer for generating prompts from traces - Support max_requests truncation to limit trace length This is a minimal implementation to address issue 597. Full Mooncake format support, E2E tests, and documentation will follow in subsequent PRs. Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
7f893fb to
780be20
Compare
|
It will be great to get an example of "How to get the JSONL" because i don't find solutions in litellm for example. |
|
Yeah that’s true, most frameworks won’t produce this exact JSONL directly. That’s kind of intentional. The idea here is to define a minimal, framework-agnostic canonical replay format, not something tied to a specific tracing stack. In practice, the required fields already exist almost everywhere (timestamp, input token count, output token count), just under slightly different names, so a small mapping step is usually enough. I agree it’s not the best UX on its own, but it felt like the right minimal base for the feature. Then we can iterate on top of it with helpers / converters for common sources like LiteLLM or Langfuse. And we can extend it later (e.g. optional prompt field, multiple timestamp formats, richer metadata) without breaking the core idea. But happy to adjust the direction if maintainers prefer something more opinionated or integrated from the start. |
sjmonson
left a comment
There was a problem hiding this comment.
Sorry for the silence on this. There are a few things with this PR that break other use-cases. I am still working on a more complete review but here are a few low hanging problems.
There was a problem hiding this comment.
Since this file is used by data, scheduler, and benchmark move it to utils.
| SynchronousStrategy, | ||
| ThroughputStrategy, | ||
| TraceReplayStrategy, | ||
| load_relative_timestamps, |
There was a problem hiding this comment.
Not a method of this submodule.
| load_relative_timestamps, |
| "UnserializableConstraintInitializer", | ||
| "WorkerProcess", | ||
| "WorkerProcessGroup", | ||
| "load_relative_timestamps", |
There was a problem hiding this comment.
Not a method of this submodule.
| "load_relative_timestamps", |
| "SynchronousStrategy", | ||
| "ThroughputStrategy", | ||
| "TraceReplayStrategy", | ||
| "load_relative_timestamps", |
There was a problem hiding this comment.
Not a method of this submodule.
| "load_relative_timestamps", |
|
|
||
| from pydantic import Field, NonNegativeFloat, NonNegativeInt, PositiveInt, PrivateAttr | ||
|
|
||
| from guidellm.data.trace_io import load_relative_timestamps |
There was a problem hiding this comment.
See comment on trace_io.py
| from guidellm.data.trace_io import load_relative_timestamps | |
| from guidellm.utils.trace_io import load_relative_timestamps |
| # When max_requests is set, limit the first data source to that many rows at load | ||
| if max_requests is not None and data: | ||
| if max_requests < 1: | ||
| raise ValueError( | ||
| "max_requests must be >= 1 when set for data truncation, " | ||
| f"got {max_requests}" | ||
| ) | ||
| data_args = list(data_args) if data_args else [{} for _ in data] | ||
| if len(data_args) >= 1: | ||
| data_args[0] = {**data_args[0], "max_rows": max_requests} | ||
|
|
There was a problem hiding this comment.
Drop this, max_requests is a constraint on the number of requests that are allowed to complete. To limit the data source use --data-samples.
| # When max_requests is set, limit the first data source to that many rows at load | |
| if max_requests is not None and data: | |
| if max_requests < 1: | |
| raise ValueError( | |
| "max_requests must be >= 1 when set for data truncation, " | |
| f"got {max_requests}" | |
| ) | |
| data_args = list(data_args) if data_args else [{} for _ in data] | |
| if len(data_args) >= 1: | |
| data_args[0] = {**data_args[0], "max_rows": max_requests} |
| # For replay profile: resolve profile first to apply max_seconds filtering, | ||
| # then use the filtered count for the data loader. This ensures the data | ||
| # loader and scheduler both work with the same filtered request count. | ||
| if args.profile == "replay": |
There was a problem hiding this comment.
Unless I am missing something, this conditional should be unnecessary. There is no reason to do loader then profile other than its the way things were done before.
| effective_max_requests = ( | ||
| profile.constraints.get("max_requests") | ||
| if profile.constraints | ||
| else args.max_requests | ||
| ) |
There was a problem hiding this comment.
Not a huge fan of this and I also think its unnecessary. The profile can trigger a benchmark end based on the number of requests. Its fine if the request loader reads too many requests ahead.
| __all__ = ["load_relative_timestamps", "load_trace_rows"] | ||
|
|
||
|
|
||
| def load_trace_rows( |
There was a problem hiding this comment.
Replace this function with a call to datasets.load_dataset. Can basically be a copy of JSONFileDatasetDeserializer. I.e.
return load_dataset("json", data_files=str(path), **data_kwargs)| path, | ||
| required_columns=[timestamp_column], | ||
| ) | ||
| timestamps = sorted([float(row[timestamp_column]) for row in raw]) |
There was a problem hiding this comment.
if using datasets can do raw.sort(timestamp_column).
|
Thanks a lot for the detailed review, I really appreciate your time. I’m fully aligned with your feedback, especially on the replay handling in the entrypoint, which is a key part of the PR. I agree that introducing a special case here is not ideal and should be avoided. I’ll refactor this to make it cleaner and better aligned with the existing design. |
Summary
replaybenchmarking strategy that reproduces real-world request patterns from trace log files (.jsonl)max_requestsandmax_secondscli options to limit the number of requests processed from a traceMotivation
This change addresses issue #597 by enabling users to benchmark their vLLM servers using real production traces. Instead of synthetic load patterns, users can now replay exact request arrival times and token distributions from their actual workloads for more realistic performance testing.
Changes
TraceReplayStrategyscheduler strategy for timestamp-based request dispatchingReplayProfileclass for configuring trace-based benchmarking parametersTraceSyntheticDatasetDeserializerto generate prompts matching trace input/output lengthsTraceReaderutility for reading .jsonl trace files with timestamp, input_length, output_length fieldsEntrypointto handle replay profile and dataset configurationmax_requestsandmax_secondstruncation support to limit trace replay lengthTesting
pytest tests/unit/scheduler/test_trace_replay.py(pass)pytest tests/unit/benchmark/test_replay_profile.py(pass)pytest tests/unit/data/deserializers/test_trace_synthetic.py(pass)Added tests: scheduling accuracy, boundary conditions, malformed trace handling, empty trace cases, max_requests truncation
test it in practice quickly with NB COLAB
Next Steps (this PR)
Out of Scope (future PRs or not)