feat: agentx#348
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Master shipped its own 007 (latest_benchmarks single-run-per-line, #491) while this branch carried 007_agentic — two migrations with the same number. Renumber the branch set to 008_agentic / 009_latest_benchmarks_single_run_per_line / 010_dataset_request_stats so a fresh deploy applies them strictly after master's lineage; 009 supersedes master's 007 with the offload_mode-aware view definition.
Each inference legend row gets a table icon (visible on hover/focus, faint otherwise) that opens a dialog listing every currently-visible point for that hardware/framework series: concurrency, parallelism, offload, tput/GPU, p50/p90 interactivity and TTFT, sorted by concurrency with sortable columns. Rows link the same way scatter points do — agentic points to their per-point detail page, fixed-seq points to the GitHub Actions run — as real anchors so open-in-new-tab works. Unofficial-run overlay series get the same table (metrics only; overlay points have no stored benchmark rows) respecting activeOverlayHwTypes and overlayRunColor.
extractTurn guarded isl<=0 but not osl<=0, so cancelled/empty-output turns collapsed the whole decode window into one ITL interval and the @400-token projection became ttft + 399x(latency-ttft) — ~386x inflation baked into stored p75/p90 aggregates (seeded repro: p90 1104.78s -> 6.01s). STATS_VERSION bumped 4->5 so stored payloads recompute via the version fallback. Adds regression test.
DISTINCT ON (config_id, conc, isl, osl) collapsed agentic offload on/off variants (isl/osl both NULL) into one arbitrary winner, so run views silently dropped half the sweep (seeded repro: 2 rows -> 4). Adds offload_mode to the SQL DISTINCT ON + ORDER BY and to the json-provider dedup key (normalized ?? 'off' to match lineKey). Every other selection path already keyed on it. Adds 4 regression tests.
Four blob-cached agentic routes had unversioned cache keys; blobSet is write-once and backfills never purge, so payload-version bumps served stale blobs indefinitely (the DB version check is bypassed on blob hits). Keys now derive from the governing VERSION constants (STATS/REQUEST_TIMELINE/CHART_SERIES), asserted by tests. Stale/missing recomputes now persist their result via a best-effort fire-and-forget ::jsonb write-back (no-ops on read replicas), so one request self-heals a row instead of re-gunzipping the raw blob until a manual backfill. STATS_VERSION moves to the dependency-free agentic-shared leaf to avoid an import cycle. Live-verified: stored payloads healed 4->5 / 11->12 / 4->5 on a single query.
| <td className="px-3 py-2"> | ||
| <Link | ||
| href={`/datasets/${slug}/conversations/${c.conv_id}`} | ||
| onClick={() => track('datasets_conversation_clicked', { slug })} |
There was a problem hiding this comment.
Conversation links break special IDs
Medium Severity
The dataset conversation table builds href values by interpolating conv_id directly into the path without encoding. Conversation IDs that contain %, /, or other reserved characters produce malformed routes or wrong IDs after routing, so deep links from the list can 404 or open the wrong conversation.
Reviewed by Cursor Bugbot for commit bd0a490. Configure here.
…ions The head commit message usually describes an unrelated code change; the workflow display name describes the sweep itself.
The failed-run guard required num_requests_total > 0, so a config whose server never came up (total = 0, e.g. dep4 conc32 in run 28617267459) slipped through as a dataless point. Any row explicitly reporting zero successful requests is a failure regardless of how many were issued.
The documented DUMP_DIR mode 500'd on every new surface: the four new tables (agentic_trace_replay, datasets, dataset_conversations, run_datasets) were missing from TABLE_INSERT_ORDER so dumps never carried them, json-provider had no mirrors, and ten routes called getDb() with no JSON_MODE guard. Tables added in FK-safe order; bytea blobs round-trip through dump/load (Buffer JSON encoding, ::bytea decode); agentic_trace_replay lazy-loads like server_logs; mirrors reuse the same pure compute helpers as the SQL paths for version-stale fallbacks; all ten routes gain the standard JSON_MODE branch. Verified end-to-end: dump-mode server serves all ten endpoints 200, byte-identical to Postgres on 9/10 (remaining diffs are pre-existing benchmarks-mirror nuances). Adds 21 mirror tests.
The AgenticTraces default resolved before availability loaded (static SEQUENCE_OPTIONS fallback), so fixed-seq-only models flashed 'Agentic Traces', fired a wasted agentic fetch, then snapped to 1k/1k. New pure resolveEffectiveSequence helper (mirrors default-precisions pattern) returns the real scenario only once availability is known; benchmark fetching gates on the new sequenceResolved flag; non-agentic models fall back to 8k/1k (master's default) when available. Fixes the url-params and historical-trends e2e failures the PR description labels 'pre-existing' — they were caused by this default and now pass with no assertion changes. ttft-x-axis-toggle gets spec-scoped agentic intercepts (shared fixtures have no agentic rows). Verified live: llama70b -> 8K/1K, zero agentic calls, one benchmarks fetch; dsr1 -> Agentic Traces, one fetch.
The public conversation search embedded user input in ILIKE unescaped and uncapped: '%' matched every row and long stacked-wildcard patterns could push Neon to statement timeout (500s). escapeLikePattern escapes backslash-first then %/_ so searches are literal substring matches (now agreeing exactly with the dump-mode mirror's .includes semantics); the route trims and rejects >100 chars with 400 before touching the DB. Live: ?search=%25 30 -> 0 rows; 150-char input -> 400; real searches unchanged. Adds 14 tests.
…overlay-mode e2e Adds the AGENTS.md-required track() calls (agentic_siblings_navigated, datasets_conversations_page_changed, agentic_chart_expanded) to the three untracked interaction clusters, and the mandated overlay-path regression coverage: ttft-x-axis-toggle gains three tests loading an ?unofficialrun= overlay, switching to the ttft x-axis mode (overlay points still render), and asserting the normalized-e2e suppression banner. Cypress 8/8.
Playwright MCP page snapshots contain HTML-entity-escaped class strings; Tailwind 4's auto content detection (which respects gitignore) scanned them and emitted unresolvable mask-image classes, 500ing the dev server. Affects any Playwright-MCP-driven review session incl. the @claude CI review flow.
| // fixed-seq-only model while the chart shows its loading skeleton. `8k/1k` is | ||
| // the pre-agentic default for non-agentic models. Consumers that must not act on | ||
| // an unresolved sequence gate on `sequenceResolved` instead. | ||
| const PRE_AVAILABILITY_SEQUENCE = Sequence.EightK_OneK; |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 08ea28f. Configure here.
| <main className="relative"> | ||
| <div className="container mx-auto px-4 pb-8 lg:px-8"> | ||
| <Suspense> | ||
| <ConversationView slug={slug} convId={decodeURIComponent(convId)} /> |
There was a problem hiding this comment.
Double decode breaks conv IDs
Medium Severity
The new dataset conversation page and its API route both call decodeURIComponent on the dynamic convId segment. App Router already supplies decoded path params, so a second decode corrupts conversation IDs that contain % and can throw URIError on malformed sequences, causing 404s or failed API lookups for otherwise valid traces.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 08ea28f. Configure here.


Adds agentic trace-replay benchmarks (AgentX) to the dashboard: closed-loop replays of real multi-turn agent conversations (with subagent fan-out), alongside the existing fixed-seq-len sweeps. History is squashed into the logical commit groups below.
Database schema (
feat(db): agentic benchmark schema)New migrations (
007_agentic.sql…009_dataset_request_stats.sql):benchmark_resultsextended for agentic rowsbenchmark_type = 'agentic_traces';isl/oslnullable (traces have no fixed sequence length)offload_modecolumn added to the unique key (nulls not distinct) — the same config can run with KV offload on/offtrace_replay_idFK into the new sidecar tablemetricsJSONB gains trace-derived keys: full latency percentile ladders (mean/median/p75/p90/p95/p99/p99.9×ttft/tpot/itl/e2el),*_intvty(interactivity = 1/ITL), per-GPU throughputs,gpu_kv_cache_usage_pct, server/theoretical cache-hit rates, token totalsagentic_trace_replay— one row per benchmark point, keeps 100 MB+ payloads out of the hot query pathprofile_export_jsonl_gz(every request the load generator sent) andserver_metrics_json_gz(vLLM/SGLang Prometheus scrape summary)chart_series,request_timeline,aggregate_statsdatasets/dataset_conversations— the source trace datasets (HF id, slug, variant) and per-conversation structure: turn/subagent tree, token counts, timing, cache-hit distributionsrun_datasets— links each workflow run to the dataset it replayed, so timeline → dataset deep links resolvelatest_benchmarksmaterialized view updated to include agentic rows (one run per line)Time-series & per-request data ingested per run
chart_series(1 Hz server-side time series, one point per second of the run): KV-cache utilization (aggregate, per-DP-rank/engine, and host/CPU-offload for hicache), prefix-cache hit rate and hit-rate tokens/sec, queue depth (running/waiting), prefill and decode tokens/sec, prompt tokens by source — plus the same series grouped by server source for disagg runsrequest_timeline(one record per HTTP request, ns-resolution): conversation id + turn index, worker id, subagent depth, warmup/profiling phase, credit-issued/start/ack/end timestamps, TTFT/TPOT, ISL/OSL, cancellation flag, and raw-source provenance (srcTrace/srcOuter/srcInner/srcKind) mapping every replayed request back to the exact original dataset requestaggregate_stats— percentile envelopes precomputed across sibling configs so the cross-config aggregate views don’t re-stream blobsCHART_SERIES_VERSION,REQUEST_TIMELINE_VERSION,STATS_VERSION): version-mismatched rows recompute from the raw blob on demand, anddb:backfill-*CLIs re-materialize in bulk after an algorithm changeRemaining commit groups
chore:deps/toolchain — stream-json (streaming 100 MB+ blobs withoutERR_STRING_TOO_LONG), adm-zip, audit overridesfeat(db):ETL — maps aiperf artifacts across all three agg-schema generations (v3: nested request/server metrics, pre-inverted interactivity,cluster:-scoped hw ids,results/server.log); computes the three JSONB payloads at ingestfeat(db):query layer + CLIs — fast-path queries over precomputed JSONBs with version-checked blob fallback (guarding Neon’s 64 MiB response cap), shared backfill runner behind the fourdb:backfill-*scripts,ingest-weka-datasetfor loading source datasetsci:ingest workflow —repository_dispatch: ingest-agentic-resultsso the main InferenceX repo triggers ingestion after an AgentX sweep (mirrors the fixed-seq-len flow; 60-min timeout for blob uploads; Slack alert when a run references a dataset missing fromdatasets)feat(api):endpoints + hooks — bulk-id v1 routes (aggregates, derived metrics, request timelines, server metrics, histograms, siblings, datasets) with shared route/hook factories for a uniform error contract and caching; matching React Query hooksfeat(datasets):dataset browser — dataset list/detail with token & cache-hit distributions; per-conversation flamegraph showing the turn/subagent timing tree; accepts?turn/raw/inner/sadeep-link params to highlight the exact requestfeat(agentic):per-point detail — summary cards; Gantt request timeline (per-conversation or per-worker rows, subagent/aux lane nesting, stable row colors across phase toggle, shift+scroll zoom, click-through to the original dataset request); rolling/cumulative time-series charts (interactivity, TTFT/E2E, KV usage, throughput, in-flight ISL/OSL); cross-config aggregates. Exists to debug why identical configs diverge under sustained loadfeat(inference):dashboard integration — Agentic Traces as a scenario in the main chart: trace-derived x-axis modes, agentic-aware rooflines/tooltips/labels, click-through to point detail; works for official runs and?unofficialrun=overlaystest:coverage + docs — Cypress e2e/component specs for timeline, time-series, datasets, and dashboard integration; data-pipeline docsDiff breakdown
~24% of the diff is test coverage.
Verification
url-params.cy.ts(2): Historical Trends tab defaults to Agentic Traces, which has no trend data, so the legend/high-contrast toggle never rendersagentic-point-time-series.cy.ts(3): stale point-count/label expectations from earlier chart changesNote
High Risk
Large cross-cutting release touching production ingest workflows, Neon writes, and versioned blob caches; regressions could corrupt or stale live benchmark data or break the main inference chart for mixed agentic/fixed-seq models.
Overview
Delivers Agentic Traces as a first-class benchmark scenario: closed-loop replays of real multi-turn agent conversations, with datasets, per-point drill-down, and main-chart integration alongside fixed sequence-length sweeps.
Ingest & CI — Adds
ingest-agentic-results(repository dispatch, 60m timeout, multi-target DB secrets, Slack on failure/unmapped entities) plus aningestagent doc for manual Neon ingests (changelog mandatory, cache purge, interactivity normalization). Data-pipeline docs now cover orchestrator adapters for server metrics and automaticrun_datasetslinking from artifact provenance.API layer — New
/api/v1/*routes for datasets (list/detail/conversations), agentic blobs (aggregates, derived metrics, request timeline, server metrics, histograms), trace availability, and benchmark siblings; sharedid-routesfactories and version-derived blob cache keys (with tests so backfills don’t serve stale JSON forever).Datasets UI —
/datasetsregistry and detail pages with distribution cards (P50–P95 guides), paginated conversation search (100-char cap), and a trace flamegraph (subagent expand/collapse, parallel-overlap brackets, deep links from agentic timeline via?turn/raw/inner/sa).Inference dashboard — Renames sequence → scenario selector; gates fetches on
sequenceResolvedso models without agentic data don’t flash Agentic Traces; adds x-axis modes (interactivity, TTFT, E2E, normalized E2E @ 400 tokens), legend points table →/inference/agentic/[id], unofficial-run rows carryingbenchmark_type/offload_mode, and chart default tweaks (high contrast on, parallelism labels on, line labels off). Agentic detail and GPU-compare tooltips expose normal links to point charts.Tests & tooling — Large Cypress component/e2e coverage (datasets, flamegraph, agentic time series, orchestrator metric sources, overlay x-axis);
.eslintignorefor.claude/worktrees/; optionalNEXT_DIST_DIRfor a second dev server.Reviewed by Cursor Bugbot for commit 08ea28f. Bugbot is set up for automated code reviews on this repo. Configure here.