feat: agentx by cquil11 · Pull Request #348 · SemiAnalysisAI/InferenceX-app

cquil11 · 2026-05-14T15:19:52Z

Adds agentic trace-replay benchmarks (AgentX) to the dashboard: closed-loop replays of real multi-turn agent conversations (with subagent fan-out), alongside the existing fixed-seq-len sweeps. History is squashed into the logical commit groups below.

Database schema (`feat(db): agentic benchmark schema`)

New migrations (007_agentic.sql … 009_dataset_request_stats.sql):

benchmark_results extended for agentic rows
- benchmark_type = 'agentic_traces'; isl/osl nullable (traces have no fixed sequence length)
- offload_mode column added to the unique key (nulls not distinct) — the same config can run with KV offload on/off
- trace_replay_id FK into the new sidecar table
- metrics JSONB gains trace-derived keys: full latency percentile ladders (mean/median/p75/p90/p95/p99/p99.9 × ttft/tpot/itl/e2el), *_intvty (interactivity = 1/ITL), per-GPU throughputs, gpu_kv_cache_usage_pct, server/theoretical cache-hit rates, token totals
agentic_trace_replay — one row per benchmark point, keeps 100 MB+ payloads out of the hot query path
- raw blobs: profile_export_jsonl_gz (every request the load generator sent) and server_metrics_json_gz (vLLM/SGLang Prometheus scrape summary)
- precomputed, version-stamped JSONBs derived from those blobs at ingest time (see “time-series data” below): chart_series, request_timeline, aggregate_stats
datasets / dataset_conversations — the source trace datasets (HF id, slug, variant) and per-conversation structure: turn/subagent tree, token counts, timing, cache-hit distributions
run_datasets — links each workflow run to the dataset it replayed, so timeline → dataset deep links resolve
latest_benchmarks materialized view updated to include agentic rows (one run per line)

Time-series & per-request data ingested per run

chart_series (1 Hz server-side time series, one point per second of the run): KV-cache utilization (aggregate, per-DP-rank/engine, and host/CPU-offload for hicache), prefix-cache hit rate and hit-rate tokens/sec, queue depth (running/waiting), prefill and decode tokens/sec, prompt tokens by source — plus the same series grouped by server source for disagg runs
request_timeline (one record per HTTP request, ns-resolution): conversation id + turn index, worker id, subagent depth, warmup/profiling phase, credit-issued/start/ack/end timestamps, TTFT/TPOT, ISL/OSL, cancellation flag, and raw-source provenance (srcTrace/srcOuter/srcInner/srcKind) mapping every replayed request back to the exact original dataset request
aggregate_stats — percentile envelopes precomputed across sibling configs so the cross-config aggregate views don’t re-stream blobs
Payloads are versioned (CHART_SERIES_VERSION, REQUEST_TIMELINE_VERSION, STATS_VERSION): version-mismatched rows recompute from the raw blob on demand, and db:backfill-* CLIs re-materialize in bulk after an algorithm change
Per run this also ingests: server logs per point, availability rows, changelog entries, and the run→dataset link

Remaining commit groups

chore: deps/toolchain — stream-json (streaming 100 MB+ blobs without ERR_STRING_TOO_LONG), adm-zip, audit overrides
feat(db): ETL — maps aiperf artifacts across all three agg-schema generations (v3: nested request/server metrics, pre-inverted interactivity, cluster:-scoped hw ids, results/server.log); computes the three JSONB payloads at ingest
feat(db): query layer + CLIs — fast-path queries over precomputed JSONBs with version-checked blob fallback (guarding Neon’s 64 MiB response cap), shared backfill runner behind the four db:backfill-* scripts, ingest-weka-dataset for loading source datasets
ci: ingest workflow — repository_dispatch: ingest-agentic-results so the main InferenceX repo triggers ingestion after an AgentX sweep (mirrors the fixed-seq-len flow; 60-min timeout for blob uploads; Slack alert when a run references a dataset missing from datasets)
feat(api): endpoints + hooks — bulk-id v1 routes (aggregates, derived metrics, request timelines, server metrics, histograms, siblings, datasets) with shared route/hook factories for a uniform error contract and caching; matching React Query hooks
feat(datasets): dataset browser — dataset list/detail with token & cache-hit distributions; per-conversation flamegraph showing the turn/subagent timing tree; accepts ?turn/raw/inner/sa deep-link params to highlight the exact request
feat(agentic): per-point detail — summary cards; Gantt request timeline (per-conversation or per-worker rows, subagent/aux lane nesting, stable row colors across phase toggle, shift+scroll zoom, click-through to the original dataset request); rolling/cumulative time-series charts (interactivity, TTFT/E2E, KV usage, throughput, in-flight ISL/OSL); cross-config aggregates. Exists to debug why identical configs diverge under sustained load
feat(inference): dashboard integration — Agentic Traces as a scenario in the main chart: trace-derived x-axis modes, agentic-aware rooflines/tooltips/labels, click-through to point detail; works for official runs and ?unofficialrun= overlays
test: coverage + docs — Cypress e2e/component specs for timeline, time-series, datasets, and dashboard integration; data-pipeline docs

Diff breakdown

Category	Added	Removed
Code (app + db + constants)	+15,815	−764
Tests (vitest + Cypress)	+5,129	−133
CI / agent docs	+376	−0
Docs	+12	−0
Lockfile	+36	−0

~24% of the diff is test coverage.

Verification

Typecheck, oxlint, oxfmt, 2,727 unit tests, production build, Cypress component (163) and targeted e2e — all pass
ETL recompute verified byte-identical against stored payloads for three trace replays (no backfill required)
Known pre-existing e2e failures (not introduced by this PR, tracked for follow-up):
- url-params.cy.ts (2): Historical Trends tab defaults to Agentic Traces, which has no trend data, so the legend/high-contrast toggle never renders
- agentic-point-time-series.cy.ts (3): stale point-count/label expectations from earlier chart changes

Note

High Risk
Large cross-cutting release touching production ingest workflows, Neon writes, and versioned blob caches; regressions could corrupt or stale live benchmark data or break the main inference chart for mixed agentic/fixed-seq models.

Overview
Delivers Agentic Traces as a first-class benchmark scenario: closed-loop replays of real multi-turn agent conversations, with datasets, per-point drill-down, and main-chart integration alongside fixed sequence-length sweeps.

Ingest & CI — Adds ingest-agentic-results (repository dispatch, 60m timeout, multi-target DB secrets, Slack on failure/unmapped entities) plus an ingest agent doc for manual Neon ingests (changelog mandatory, cache purge, interactivity normalization). Data-pipeline docs now cover orchestrator adapters for server metrics and automatic run_datasets linking from artifact provenance.

API layer — New /api/v1/* routes for datasets (list/detail/conversations), agentic blobs (aggregates, derived metrics, request timeline, server metrics, histograms), trace availability, and benchmark siblings; shared id-routes factories and version-derived blob cache keys (with tests so backfills don’t serve stale JSON forever).

Datasets UI — /datasets registry and detail pages with distribution cards (P50–P95 guides), paginated conversation search (100-char cap), and a trace flamegraph (subagent expand/collapse, parallel-overlap brackets, deep links from agentic timeline via ?turn/raw/inner/sa).

Inference dashboard — Renames sequence → scenario selector; gates fetches on sequenceResolved so models without agentic data don’t flash Agentic Traces; adds x-axis modes (interactivity, TTFT, E2E, normalized E2E @ 400 tokens), legend points table → /inference/agentic/[id], unofficial-run rows carrying benchmark_type/offload_mode, and chart default tweaks (high contrast on, parallelism labels on, line labels off). Agentic detail and GPU-compare tooltips expose normal links to point charts.

Tests & tooling — Large Cypress component/e2e coverage (datasets, flamegraph, agentic time series, orchestrator metric sources, overlay x-axis); .eslintignore for .claude/worktrees/; optional NEXT_DIST_DIR for a second dev server.

^{Reviewed by Cursor Bugbot for commit 08ea28f. Bugbot is set up for automated code reviews on this repo. Configure here.}

vercel · 2026-05-14T15:19:57Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
inferencemax-app	Ready	Preview, Comment	Jul 3, 2026 3:55am

Master shipped its own 007 (latest_benchmarks single-run-per-line, #491) while this branch carried 007_agentic — two migrations with the same number. Renumber the branch set to 008_agentic / 009_latest_benchmarks_single_run_per_line / 010_dataset_request_stats so a fresh deploy applies them strictly after master's lineage; 009 supersedes master's 007 with the offload_mode-aware view definition.

Each inference legend row gets a table icon (visible on hover/focus, faint otherwise) that opens a dialog listing every currently-visible point for that hardware/framework series: concurrency, parallelism, offload, tput/GPU, p50/p90 interactivity and TTFT, sorted by concurrency with sortable columns. Rows link the same way scatter points do — agentic points to their per-point detail page, fixed-seq points to the GitHub Actions run — as real anchors so open-in-new-tab works. Unofficial-run overlay series get the same table (metrics only; overlay points have no stored benchmark rows) respecting activeOverlayHwTypes and overlayRunColor.

extractTurn guarded isl<=0 but not osl<=0, so cancelled/empty-output turns collapsed the whole decode window into one ITL interval and the @400-token projection became ttft + 399x(latency-ttft) — ~386x inflation baked into stored p75/p90 aggregates (seeded repro: p90 1104.78s -> 6.01s). STATS_VERSION bumped 4->5 so stored payloads recompute via the version fallback. Adds regression test.

DISTINCT ON (config_id, conc, isl, osl) collapsed agentic offload on/off variants (isl/osl both NULL) into one arbitrary winner, so run views silently dropped half the sweep (seeded repro: 2 rows -> 4). Adds offload_mode to the SQL DISTINCT ON + ORDER BY and to the json-provider dedup key (normalized ?? 'off' to match lineKey). Every other selection path already keyed on it. Adds 4 regression tests.

Four blob-cached agentic routes had unversioned cache keys; blobSet is write-once and backfills never purge, so payload-version bumps served stale blobs indefinitely (the DB version check is bypassed on blob hits). Keys now derive from the governing VERSION constants (STATS/REQUEST_TIMELINE/CHART_SERIES), asserted by tests. Stale/missing recomputes now persist their result via a best-effort fire-and-forget ::jsonb write-back (no-ops on read replicas), so one request self-heals a row instead of re-gunzipping the raw blob until a manual backfill. STATS_VERSION moves to the dependency-free agentic-shared leaf to avoid an import cycle. Live-verified: stored payloads healed 4->5 / 11->12 / 4->5 on a single query.

cursor · 2026-07-02T23:26:34Z

+                    <td className="px-3 py-2">
+                      <Link
+                        href={`/datasets/${slug}/conversations/${c.conv_id}`}
+                        onClick={() => track('datasets_conversation_clicked', { slug })}


Conversation links break special IDs

Medium Severity

The dataset conversation table builds href values by interpolating conv_id directly into the path without encoding. Conversation IDs that contain %, /, or other reserved characters produce malformed routes or wrong IDs after routing, so deep links from the list can 404 or open the wrong conversation.

^{Reviewed by Cursor Bugbot for commit bd0a490. Configure here.}

…ions The head commit message usually describes an unrelated code change; the workflow display name describes the sweep itself.

The failed-run guard required num_requests_total > 0, so a config whose server never came up (total = 0, e.g. dep4 conc32 in run 28617267459) slipped through as a dataless point. Any row explicitly reporting zero successful requests is a failure regardless of how many were issued.

The documented DUMP_DIR mode 500'd on every new surface: the four new tables (agentic_trace_replay, datasets, dataset_conversations, run_datasets) were missing from TABLE_INSERT_ORDER so dumps never carried them, json-provider had no mirrors, and ten routes called getDb() with no JSON_MODE guard. Tables added in FK-safe order; bytea blobs round-trip through dump/load (Buffer JSON encoding, ::bytea decode); agentic_trace_replay lazy-loads like server_logs; mirrors reuse the same pure compute helpers as the SQL paths for version-stale fallbacks; all ten routes gain the standard JSON_MODE branch. Verified end-to-end: dump-mode server serves all ten endpoints 200, byte-identical to Postgres on 9/10 (remaining diffs are pre-existing benchmarks-mirror nuances). Adds 21 mirror tests.

The AgenticTraces default resolved before availability loaded (static SEQUENCE_OPTIONS fallback), so fixed-seq-only models flashed 'Agentic Traces', fired a wasted agentic fetch, then snapped to 1k/1k. New pure resolveEffectiveSequence helper (mirrors default-precisions pattern) returns the real scenario only once availability is known; benchmark fetching gates on the new sequenceResolved flag; non-agentic models fall back to 8k/1k (master's default) when available. Fixes the url-params and historical-trends e2e failures the PR description labels 'pre-existing' — they were caused by this default and now pass with no assertion changes. ttft-x-axis-toggle gets spec-scoped agentic intercepts (shared fixtures have no agentic rows). Verified live: llama70b -> 8K/1K, zero agentic calls, one benchmarks fetch; dsr1 -> Agentic Traces, one fetch.

The public conversation search embedded user input in ILIKE unescaped and uncapped: '%' matched every row and long stacked-wildcard patterns could push Neon to statement timeout (500s). escapeLikePattern escapes backslash-first then %/_ so searches are literal substring matches (now agreeing exactly with the dump-mode mirror's .includes semantics); the route trims and rejects >100 chars with 400 before touching the DB. Live: ?search=%25 30 -> 0 rows; 150-char input -> 400; real searches unchanged. Adds 14 tests.

…overlay-mode e2e Adds the AGENTS.md-required track() calls (agentic_siblings_navigated, datasets_conversations_page_changed, agentic_chart_expanded) to the three untracked interaction clusters, and the mandated overlay-path regression coverage: ttft-x-axis-toggle gains three tests loading an ?unofficialrun= overlay, switching to the ttft x-axis mode (overlay points still render), and asserting the normalized-e2e suppression banner. Cypress 8/8.

@claude

Playwright MCP page snapshots contain HTML-entity-escaped class strings; Tailwind 4's auto content detection (which respects gitignore) scanned them and emitted unresolvable mask-image classes, 500ing the dev server. Affects any Playwright-MCP-driven review session incl. the @claude CI review flow.

+// fixed-seq-only model while the chart shows its loading skeleton. `8k/1k` is
+// the pre-agentic default for non-agentic models. Consumers that must not act on
+// an unresolved sequence gate on `sequenceResolved` instead.
+const PRE_AVAILABILITY_SEQUENCE = Sequence.EightK_OneK;


cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 08ea28f. Configure here.}

cursor · 2026-07-03T03:59:07Z

+    <main className="relative">
+      <div className="container mx-auto px-4 pb-8 lg:px-8">
+        <Suspense>
+          <ConversationView slug={slug} convId={decodeURIComponent(convId)} />


Double decode breaks conv IDs

Medium Severity

The new dataset conversation page and its API route both call decodeURIComponent on the dynamic convId segment. App Router already supplies decoded path params, so a second decode corrupts conversation IDs that contain % and can throw URIError on malformed sequences, causing 404s or failed API lookups for otherwise valid traces.

Additional Locations (1)

packages/app/src/app/api/v1/datasets/[slug]/conversations/[convId]/route.ts#L28-L29

^{Reviewed by Cursor Bugbot for commit 08ea28f. Configure here.}

vercel Bot deployed to Preview May 14, 2026 15:36 View deployment

vercel Bot deployed to Preview May 15, 2026 17:26 View deployment

vercel Bot deployed to Preview May 15, 2026 17:28 View deployment

vercel Bot deployed to Preview May 15, 2026 17:31 View deployment

vercel Bot deployed to Preview May 15, 2026 17:32 View deployment

vercel Bot deployed to Preview May 15, 2026 17:33 View deployment

vercel Bot deployed to Preview May 15, 2026 17:39 View deployment

vercel Bot deployed to Preview May 15, 2026 17:42 View deployment

vercel Bot deployed to Preview May 15, 2026 17:46 View deployment

vercel Bot deployed to Preview May 15, 2026 17:48 View deployment

vercel Bot deployed to Preview May 15, 2026 22:05 View deployment

vercel Bot deployed to Preview May 15, 2026 22:08 View deployment

vercel Bot deployed to Preview May 15, 2026 22:16 View deployment

vercel Bot deployed to Preview May 20, 2026 23:54 View deployment

vercel Bot deployed to Preview May 21, 2026 00:02 View deployment

vercel Bot deployed to Preview May 21, 2026 01:10 View deployment

vercel Bot deployed to Preview May 21, 2026 01:16 View deployment

vercel Bot deployed to Preview May 21, 2026 04:19 View deployment

vercel Bot deployed to Preview May 21, 2026 04:21 View deployment

vercel Bot deployed to Preview May 21, 2026 04:43 View deployment

vercel Bot deployed to Preview May 21, 2026 04:49 View deployment

vercel Bot deployed to Preview May 21, 2026 05:12 View deployment

vercel Bot deployed to Preview May 21, 2026 05:21 View deployment

vercel Bot deployed to Preview May 21, 2026 05:24 View deployment

vercel Bot deployed to Preview May 21, 2026 05:30 View deployment

vercel Bot deployed to Preview May 21, 2026 20:41 View deployment

vercel Bot deployed to Preview May 21, 2026 21:14 View deployment

vercel Bot had a problem deploying to Preview May 21, 2026 22:04 Failure

vercel Bot deployed to Preview May 21, 2026 22:08 View deployment

cursor Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread packages/app/src/components/GlobalFilterContext.tsx

Comment thread packages/app/src/components/inference/hooks/useChartData.ts

cquil11 and others added 6 commits July 2, 2026 15:10

chore: exclude package scratch dirs from typecheck

d6cf3a6

cursor Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread packages/db/src/etl/benchmark-mapper.ts

ci: allow manual agentic ingest dispatch

2a14801

cursor Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread packages/app/src/lib/url-state.ts

cquil11 added 2 commits July 2, 2026 18:09

ci: register agentic ingest workflow

a1e94d9

ci: use dev database for agentic ingest test

6d55b95

github-advanced-security AI found potential problems Jul 2, 2026

View reviewed changes

Comment thread .github/workflows/ingest-agentic-results.yml Fixed

ci: use dev write database for agentic ingest test

cc63a73

cursor Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread packages/app/src/app/api/unofficial-run/route.ts

ci: skip ingest wait for manual dispatch

bd0a490

cursor Bot reviewed Jul 2, 2026

View reviewed changes

chore(db): log agentic ingest progress

ddd1a26

cursor Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread .github/workflows/ingest-agentic-results.yml

cquil11 added 2 commits July 2, 2026 18:47

ci: select agentic ingest target

5fc051f

fix(ingest): prefer the workflow name for fallback changelog descript…

7118599

…ions The head commit message usually describes an unrelated code change; the workflow display name describes the sweep itself.

cursor Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread packages/app/src/components/inference/utils/legend-points-table.ts

cquil11 and others added 6 commits July 2, 2026 20:03

github-code-quality Bot found potential problems Jul 3, 2026

View reviewed changes

cursor Bot reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: agentx#348

feat: agentx#348
cquil11 wants to merge 32 commits into
masterfrom
feat/agentx

cquil11 commented May 14, 2026 •

edited by cursor Bot

Loading

Uh oh!

vercel Bot commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jul 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

cquil11 commented May 14, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Database schema (feat(db): agentic benchmark schema)

Time-series & per-request data ingested per run

Remaining commit groups

Diff breakdown

Verification

Uh oh!

vercel Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jul 2, 2026

Choose a reason for hiding this comment

Conversation links break special IDs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jul 3, 2026

Choose a reason for hiding this comment

Double decode breaks conv IDs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cquil11 commented May 14, 2026 •

edited by cursor Bot

Loading

Database schema (`feat(db): agentic benchmark schema`)

vercel Bot commented May 14, 2026 •

edited

Loading