Skip to content

feat: agentx#348

Open
cquil11 wants to merge 32 commits into
masterfrom
feat/agentx
Open

feat: agentx#348
cquil11 wants to merge 32 commits into
masterfrom
feat/agentx

Conversation

@cquil11

@cquil11 cquil11 commented May 14, 2026

Copy link
Copy Markdown
Contributor

Adds agentic trace-replay benchmarks (AgentX) to the dashboard: closed-loop replays of real multi-turn agent conversations (with subagent fan-out), alongside the existing fixed-seq-len sweeps. History is squashed into the logical commit groups below.

Database schema (feat(db): agentic benchmark schema)

New migrations (007_agentic.sql009_dataset_request_stats.sql):

  • benchmark_results extended for agentic rows
    • benchmark_type = 'agentic_traces'; isl/osl nullable (traces have no fixed sequence length)
    • offload_mode column added to the unique key (nulls not distinct) — the same config can run with KV offload on/off
    • trace_replay_id FK into the new sidecar table
    • metrics JSONB gains trace-derived keys: full latency percentile ladders (mean/median/p75/p90/p95/p99/p99.9 × ttft/tpot/itl/e2el), *_intvty (interactivity = 1/ITL), per-GPU throughputs, gpu_kv_cache_usage_pct, server/theoretical cache-hit rates, token totals
  • agentic_trace_replay — one row per benchmark point, keeps 100 MB+ payloads out of the hot query path
    • raw blobs: profile_export_jsonl_gz (every request the load generator sent) and server_metrics_json_gz (vLLM/SGLang Prometheus scrape summary)
    • precomputed, version-stamped JSONBs derived from those blobs at ingest time (see “time-series data” below): chart_series, request_timeline, aggregate_stats
  • datasets / dataset_conversations — the source trace datasets (HF id, slug, variant) and per-conversation structure: turn/subagent tree, token counts, timing, cache-hit distributions
  • run_datasets — links each workflow run to the dataset it replayed, so timeline → dataset deep links resolve
  • latest_benchmarks materialized view updated to include agentic rows (one run per line)

Time-series & per-request data ingested per run

  • chart_series (1 Hz server-side time series, one point per second of the run): KV-cache utilization (aggregate, per-DP-rank/engine, and host/CPU-offload for hicache), prefix-cache hit rate and hit-rate tokens/sec, queue depth (running/waiting), prefill and decode tokens/sec, prompt tokens by source — plus the same series grouped by server source for disagg runs
  • request_timeline (one record per HTTP request, ns-resolution): conversation id + turn index, worker id, subagent depth, warmup/profiling phase, credit-issued/start/ack/end timestamps, TTFT/TPOT, ISL/OSL, cancellation flag, and raw-source provenance (srcTrace/srcOuter/srcInner/srcKind) mapping every replayed request back to the exact original dataset request
  • aggregate_stats — percentile envelopes precomputed across sibling configs so the cross-config aggregate views don’t re-stream blobs
  • Payloads are versioned (CHART_SERIES_VERSION, REQUEST_TIMELINE_VERSION, STATS_VERSION): version-mismatched rows recompute from the raw blob on demand, and db:backfill-* CLIs re-materialize in bulk after an algorithm change
  • Per run this also ingests: server logs per point, availability rows, changelog entries, and the run→dataset link

Remaining commit groups

  • chore: deps/toolchain — stream-json (streaming 100 MB+ blobs without ERR_STRING_TOO_LONG), adm-zip, audit overrides
  • feat(db): ETL — maps aiperf artifacts across all three agg-schema generations (v3: nested request/server metrics, pre-inverted interactivity, cluster:-scoped hw ids, results/server.log); computes the three JSONB payloads at ingest
  • feat(db): query layer + CLIs — fast-path queries over precomputed JSONBs with version-checked blob fallback (guarding Neon’s 64 MiB response cap), shared backfill runner behind the four db:backfill-* scripts, ingest-weka-dataset for loading source datasets
  • ci: ingest workflowrepository_dispatch: ingest-agentic-results so the main InferenceX repo triggers ingestion after an AgentX sweep (mirrors the fixed-seq-len flow; 60-min timeout for blob uploads; Slack alert when a run references a dataset missing from datasets)
  • feat(api): endpoints + hooks — bulk-id v1 routes (aggregates, derived metrics, request timelines, server metrics, histograms, siblings, datasets) with shared route/hook factories for a uniform error contract and caching; matching React Query hooks
  • feat(datasets): dataset browser — dataset list/detail with token & cache-hit distributions; per-conversation flamegraph showing the turn/subagent timing tree; accepts ?turn/raw/inner/sa deep-link params to highlight the exact request
  • feat(agentic): per-point detail — summary cards; Gantt request timeline (per-conversation or per-worker rows, subagent/aux lane nesting, stable row colors across phase toggle, shift+scroll zoom, click-through to the original dataset request); rolling/cumulative time-series charts (interactivity, TTFT/E2E, KV usage, throughput, in-flight ISL/OSL); cross-config aggregates. Exists to debug why identical configs diverge under sustained load
  • feat(inference): dashboard integration — Agentic Traces as a scenario in the main chart: trace-derived x-axis modes, agentic-aware rooflines/tooltips/labels, click-through to point detail; works for official runs and ?unofficialrun= overlays
  • test: coverage + docs — Cypress e2e/component specs for timeline, time-series, datasets, and dashboard integration; data-pipeline docs

Diff breakdown

Category Added Removed
Code (app + db + constants) +15,815 −764
Tests (vitest + Cypress) +5,129 −133
CI / agent docs +376 −0
Docs +12 −0
Lockfile +36 −0

~24% of the diff is test coverage.

Verification

  • Typecheck, oxlint, oxfmt, 2,727 unit tests, production build, Cypress component (163) and targeted e2e — all pass
  • ETL recompute verified byte-identical against stored payloads for three trace replays (no backfill required)
  • Known pre-existing e2e failures (not introduced by this PR, tracked for follow-up):
    • url-params.cy.ts (2): Historical Trends tab defaults to Agentic Traces, which has no trend data, so the legend/high-contrast toggle never renders
    • agentic-point-time-series.cy.ts (3): stale point-count/label expectations from earlier chart changes

Note

High Risk
Large cross-cutting release touching production ingest workflows, Neon writes, and versioned blob caches; regressions could corrupt or stale live benchmark data or break the main inference chart for mixed agentic/fixed-seq models.

Overview
Delivers Agentic Traces as a first-class benchmark scenario: closed-loop replays of real multi-turn agent conversations, with datasets, per-point drill-down, and main-chart integration alongside fixed sequence-length sweeps.

Ingest & CI — Adds ingest-agentic-results (repository dispatch, 60m timeout, multi-target DB secrets, Slack on failure/unmapped entities) plus an ingest agent doc for manual Neon ingests (changelog mandatory, cache purge, interactivity normalization). Data-pipeline docs now cover orchestrator adapters for server metrics and automatic run_datasets linking from artifact provenance.

API layer — New /api/v1/* routes for datasets (list/detail/conversations), agentic blobs (aggregates, derived metrics, request timeline, server metrics, histograms), trace availability, and benchmark siblings; shared id-routes factories and version-derived blob cache keys (with tests so backfills don’t serve stale JSON forever).

Datasets UI/datasets registry and detail pages with distribution cards (P50–P95 guides), paginated conversation search (100-char cap), and a trace flamegraph (subagent expand/collapse, parallel-overlap brackets, deep links from agentic timeline via ?turn/raw/inner/sa).

Inference dashboard — Renames sequence → scenario selector; gates fetches on sequenceResolved so models without agentic data don’t flash Agentic Traces; adds x-axis modes (interactivity, TTFT, E2E, normalized E2E @ 400 tokens), legend points table/inference/agentic/[id], unofficial-run rows carrying benchmark_type/offload_mode, and chart default tweaks (high contrast on, parallelism labels on, line labels off). Agentic detail and GPU-compare tooltips expose normal links to point charts.

Tests & tooling — Large Cypress component/e2e coverage (datasets, flamegraph, agentic time series, orchestrator metric sources, overlay x-axis); .eslintignore for .claude/worktrees/; optional NEXT_DIST_DIR for a second dev server.

Reviewed by Cursor Bugbot for commit 08ea28f. Bugbot is set up for automated code reviews on this repo. Configure here.

@vercel

vercel Bot commented May 14, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
inferencemax-app Ready Ready Preview, Comment Jul 3, 2026 3:55am

Request Review

Comment thread packages/app/src/components/GlobalFilterContext.tsx
Comment thread packages/app/src/components/inference/hooks/useChartData.ts
cquil11 and others added 6 commits July 2, 2026 15:10
Master shipped its own 007 (latest_benchmarks single-run-per-line,
#491) while this branch carried 007_agentic — two migrations with the
same number. Renumber the branch set to 008_agentic /
009_latest_benchmarks_single_run_per_line / 010_dataset_request_stats
so a fresh deploy applies them strictly after master's lineage; 009
supersedes master's 007 with the offload_mode-aware view definition.
Each inference legend row gets a table icon (visible on hover/focus,
faint otherwise) that opens a dialog listing every currently-visible
point for that hardware/framework series: concurrency, parallelism,
offload, tput/GPU, p50/p90 interactivity and TTFT, sorted by
concurrency with sortable columns. Rows link the same way scatter
points do — agentic points to their per-point detail page, fixed-seq
points to the GitHub Actions run — as real anchors so open-in-new-tab
works. Unofficial-run overlay series get the same table (metrics only;
overlay points have no stored benchmark rows) respecting
activeOverlayHwTypes and overlayRunColor.
extractTurn guarded isl<=0 but not osl<=0, so cancelled/empty-output
turns collapsed the whole decode window into one ITL interval and the
@400-token projection became ttft + 399x(latency-ttft) — ~386x
inflation baked into stored p75/p90 aggregates (seeded repro: p90
1104.78s -> 6.01s). STATS_VERSION bumped 4->5 so stored payloads
recompute via the version fallback. Adds regression test.
DISTINCT ON (config_id, conc, isl, osl) collapsed agentic offload
on/off variants (isl/osl both NULL) into one arbitrary winner, so run
views silently dropped half the sweep (seeded repro: 2 rows -> 4).
Adds offload_mode to the SQL DISTINCT ON + ORDER BY and to the
json-provider dedup key (normalized ?? 'off' to match lineKey).
Every other selection path already keyed on it. Adds 4 regression
tests.
Four blob-cached agentic routes had unversioned cache keys; blobSet is
write-once and backfills never purge, so payload-version bumps served
stale blobs indefinitely (the DB version check is bypassed on blob
hits). Keys now derive from the governing VERSION constants
(STATS/REQUEST_TIMELINE/CHART_SERIES), asserted by tests.

Stale/missing recomputes now persist their result via a best-effort
fire-and-forget ::jsonb write-back (no-ops on read replicas), so one
request self-heals a row instead of re-gunzipping the raw blob until a
manual backfill. STATS_VERSION moves to the dependency-free
agentic-shared leaf to avoid an import cycle. Live-verified: stored
payloads healed 4->5 / 11->12 / 4->5 on a single query.
Comment thread packages/db/src/etl/benchmark-mapper.ts
Comment thread packages/app/src/lib/url-state.ts
Comment thread .github/workflows/ingest-agentic-results.yml Fixed
Comment thread packages/app/src/app/api/unofficial-run/route.ts
<td className="px-3 py-2">
<Link
href={`/datasets/${slug}/conversations/${c.conv_id}`}
onClick={() => track('datasets_conversation_clicked', { slug })}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversation links break special IDs

Medium Severity

The dataset conversation table builds href values by interpolating conv_id directly into the path without encoding. Conversation IDs that contain %, /, or other reserved characters produce malformed routes or wrong IDs after routing, so deep links from the list can 404 or open the wrong conversation.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit bd0a490. Configure here.

Comment thread packages/app/src/lib/benchmark-transform.ts
Comment thread .github/workflows/ingest-agentic-results.yml
cquil11 added 2 commits July 2, 2026 18:47
…ions

The head commit message usually describes an unrelated code change;
the workflow display name describes the sweep itself.
Comment thread packages/app/src/components/inference/utils/legend-points-table.ts
cquil11 and others added 6 commits July 2, 2026 20:03
The failed-run guard required num_requests_total > 0, so a config whose
server never came up (total = 0, e.g. dep4 conc32 in run 28617267459)
slipped through as a dataless point. Any row explicitly reporting zero
successful requests is a failure regardless of how many were issued.
The documented DUMP_DIR mode 500'd on every new surface: the four new
tables (agentic_trace_replay, datasets, dataset_conversations,
run_datasets) were missing from TABLE_INSERT_ORDER so dumps never
carried them, json-provider had no mirrors, and ten routes called
getDb() with no JSON_MODE guard.

Tables added in FK-safe order; bytea blobs round-trip through dump/load
(Buffer JSON encoding, ::bytea decode); agentic_trace_replay lazy-loads
like server_logs; mirrors reuse the same pure compute helpers as the
SQL paths for version-stale fallbacks; all ten routes gain the standard
JSON_MODE branch. Verified end-to-end: dump-mode server serves all ten
endpoints 200, byte-identical to Postgres on 9/10 (remaining diffs are
pre-existing benchmarks-mirror nuances). Adds 21 mirror tests.
The AgenticTraces default resolved before availability loaded (static
SEQUENCE_OPTIONS fallback), so fixed-seq-only models flashed 'Agentic
Traces', fired a wasted agentic fetch, then snapped to 1k/1k. New pure
resolveEffectiveSequence helper (mirrors default-precisions pattern)
returns the real scenario only once availability is known; benchmark
fetching gates on the new sequenceResolved flag; non-agentic models
fall back to 8k/1k (master's default) when available.

Fixes the url-params and historical-trends e2e failures the PR
description labels 'pre-existing' — they were caused by this default
and now pass with no assertion changes. ttft-x-axis-toggle gets
spec-scoped agentic intercepts (shared fixtures have no agentic rows).
Verified live: llama70b -> 8K/1K, zero agentic calls, one benchmarks
fetch; dsr1 -> Agentic Traces, one fetch.
The public conversation search embedded user input in ILIKE unescaped
and uncapped: '%' matched every row and long stacked-wildcard patterns
could push Neon to statement timeout (500s). escapeLikePattern escapes
backslash-first then %/_ so searches are literal substring matches
(now agreeing exactly with the dump-mode mirror's .includes semantics);
the route trims and rejects >100 chars with 400 before touching the DB.
Live: ?search=%25 30 -> 0 rows; 150-char input -> 400; real searches
unchanged. Adds 14 tests.
…overlay-mode e2e

Adds the AGENTS.md-required track() calls (agentic_siblings_navigated,
datasets_conversations_page_changed, agentic_chart_expanded) to the
three untracked interaction clusters, and the mandated overlay-path
regression coverage: ttft-x-axis-toggle gains three tests loading an
?unofficialrun= overlay, switching to the ttft x-axis mode (overlay
points still render), and asserting the normalized-e2e suppression
banner. Cypress 8/8.
Playwright MCP page snapshots contain HTML-entity-escaped class strings;
Tailwind 4's auto content detection (which respects gitignore) scanned
them and emitted unresolvable mask-image classes, 500ing the dev server.
Affects any Playwright-MCP-driven review session incl. the @claude CI
review flow.
// fixed-seq-only model while the chart shows its loading skeleton. `8k/1k` is
// the pre-agentic default for non-agentic models. Consumers that must not act on
// an unresolved sequence gate on `sequenceResolved` instead.
const PRE_AVAILABILITY_SEQUENCE = Sequence.EightK_OneK;

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 08ea28f. Configure here.

<main className="relative">
<div className="container mx-auto px-4 pb-8 lg:px-8">
<Suspense>
<ConversationView slug={slug} convId={decodeURIComponent(convId)} />

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double decode breaks conv IDs

Medium Severity

The new dataset conversation page and its API route both call decodeURIComponent on the dynamic convId segment. App Router already supplies decoded path params, so a second decode corrupts conversation IDs that contain % and can throw URIError on malformed sequences, causing 404s or failed API lookups for otherwise valid traces.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 08ea28f. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants