feat: add self-hosted LLM load test suite (ENG-1107) by pandeymangg · Pull Request #3 · formbricks/performance-test-q2-2025

pandeymangg · 2026-06-11T11:23:09Z

Summary

Load test scaffolding for the self-hosted Qwen/vLLM runtime that powers Formbricks AI (translation, chart generation, survey generation). Companion to scripts/embeddings/; mirrors its layout and operator workflow.

Implements ENG-1107. Blocked on ENG-1103 (model sizing) and ENG-1105 (EU EKS vLLM deployment) for the actual staging run — the suite itself is deployment-agnostic and runs the day staging is up.

What's exercised

Script	Workflow	Path
`k6/llm-direct-translation.js`	AI survey translation	direct → vLLM `/v1/chat/completions`
`k6/llm-direct-surveygen.js`	AI survey generation	direct → vLLM
`k6/llm-direct-chartgen.js`	AI chart generation	direct → vLLM
`k6/web-survey-gen.js`	Survey gen end-to-end	Formbricks web → vLLM
`run-full-suite.sh mixed`	Concurrent mixed traffic	three scenarios in parallel
`run-cold-start.sh`	Cold start / rollout restart	vLLM deployment timing

Three of the four direct-LLM scripts each test a different token shape — prefill-heavy (chart-gen), decode-heavy (survey-gen), and balanced (translation) — so vLLM is stressed across the workload profiles it'll see in production, not just one.

What's in the PR

k6/ — four scenario scripts, profile-driven (smoke|baseline|mid|burst|stress), with per-request prompt_tokens / completion_tokens / tokens_per_sec / schema_valid / malformed_json / missing_keys metrics on top of k6's built-in latency + throughput.
data/ — 37 hand-crafted fixtures. System prompts copied verbatim from apps/web/modules/ee/ai-translation/lib/translate-fields.ts and apps/web/app/api/v3/surveys/generate/prompt.ts so the LLM sees exactly what production sends.
collectors/ — vLLM Prometheus /metrics scraper (queue depth, KV cache %, preemptions, token counters), in-pod nvidia-smi poller (GPU util/memory/temp/power), kubectl-based k8s metrics poller.
run wrappers — single-scenario, full-suite, and cold-start. Prefer native k6, fall back to Docker.
report/template.md — fill-in-the-blank report mirroring ENG-1107's "Required Outputs" and "Acceptance Criteria" 1:1, with a section→artifact mapping so the operator never has to figure out where a number lives.

How an operator will use this

# Env (against staging EKS)
export LLM_URL=http://localhost:8000/v1 LLM_API_KEY=... MODEL=Qwen/Qwen2.5-7B-Instruct
export RESPONSE_FORMAT=json_schema VLLM_METRICS_URL=http://localhost:8000/metrics
export FORMBRICKS_URL=https://app.staging.formbricks.com FORMBRICKS_API_KEY=fbk_... FORMBRICKS_WORKSPACE_ID=...
export NAMESPACE=formbricks-stage POD_SELECTOR='app.kubernetes.io/name in (vllm,formbricks-web)'

cd scripts/llm
PROFILE=baseline ./run-full-suite.sh translation
PROFILE=baseline ./run-full-suite.sh surveygen
PROFILE=baseline ./run-full-suite.sh chartgen
PROFILE=baseline ./run-full-suite.sh web
PROFILE=mid      ./run-full-suite.sh mixed
PROFILE=stress   ./run-full-suite.sh surveygen
./run-cold-start.sh cache-warm

cp report/template.md report/runs/<ts>/report.md   # fill from artifacts

Full README covers env vars, profiles, metrics, and per-run artifact layout.

Note on chart-gen and translation

Both are Next.js Server Actions, not public HTTP routes. We test the LLM behavior under their workload shapes (identical prompts, identical structured-output schemas) by hitting vLLM directly. Web-pod overhead for those two is captured indirectly via the always-on k8s-metrics.sh poller.

The web-mediated path is only exercised for survey-gen, which is the one workflow with a real HTTP route (POST /api/v3/surveys/generate). Worth filing a follow-up to expose chart-gen / translation as HTTP routes too — both for load-test coverage and for third-party integrations.

Test plan

All k6 + lib JS files pass node --check
All shell scripts pass bash -n
All three fixture JSON files parse
No leftover local-only references in README, comments, or scripts
Once EKS staging vLLM exists: smoke run, then baseline on each scenario, then mixed, then fill report/template.md

End-to-end load test tooling for the Hub embeddings pipeline (TEI + hub-worker + River + pgvector), covering the four scenarios from the ENG load-test ticket: - direct TEI baseline (no Hub in path) - async enrichment via POST /v1/feedback-records + hub-worker - query-embedding via POST /v1/feedback-records/search/semantic - cold-start and rollout behavior What's included: - k6/* — three scenario scripts with profile-driven shapes (smoke/baseline/mid/burst/sweep/warm/busy) and a structured PASS/FAIL summary block. - data/* — CSV multilingual augmentation (en/de/es/fr/ja from a static phrase pool) and CSV → JSON payload bucketing for k6 SharedArray. - collectors/* — River queue-depth poller, kubectl top + events poller, post-run SQL for end-to-end embedding latency and 768-dim + model-name verification. - run-k6.sh / run-cold-start.sh / run-full-suite.sh — wrappers that prefer a local k6 binary, fall back to Docker, and orchestrate collectors + scenarios + artifact bundling into report/runs/<ts>/. - report/template.md — fills directly against the ticket's acceptance criteria (p50/p95/p99, throughput, error rate, queue depth, CPU/mem, restarts, cold-start, AWS cost estimate, Helm defaults verdict). The scripts read env vars (HUB_URL, HUB_API_KEY, TEI_URL, TEI_API_KEY, MODEL, DATABASE_URL, NAMESPACE, POD_SELECTOR) so the same code drives staging and production EKS runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Add "Run the full test" section at the top with the complete command sequence so reviewers can grasp the flow without scrolling through the per-step subsections. - Document the required CSV schema (columns + filtering rule) and remove the misleading "dataset shipped at" wording — the PR does not ship a CSV; the operator supplies one via --input. - Add Step 5 explaining how to fill report/template.md from the artifacts in report/runs/<ts>/, with a section-to-artifact mapping. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pandeymangg · 2026-06-11T11:29:15Z

@coderabbitai pls review 🙏

pandeymangg and others added 6 commits May 13, 2026 18:30

fix(embeddings): harden staging load test runner

8e2a12b

test(embeddings): add staging load test artifacts

e32272e

Merge branch 'main' into feat/llm-load-test

3e11f21

adds scripts, collectors and data for llm load test

028dff2

pandeymangg requested a review from BhagyaAmarasinghe June 11, 2026 11:23

BhagyaAmarasinghe approved these changes Jun 19, 2026

View reviewed changes

BhagyaAmarasinghe enabled auto-merge June 19, 2026 13:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add self-hosted LLM load test suite (ENG-1107)#3

feat: add self-hosted LLM load test suite (ENG-1107)#3
pandeymangg wants to merge 6 commits into
mainfrom
feat/llm-load-test

pandeymangg commented Jun 11, 2026 •

edited

Loading

Uh oh!

pandeymangg commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pandeymangg commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's exercised

What's in the PR

How an operator will use this

Note on chart-gen and translation

Test plan

Uh oh!

pandeymangg commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pandeymangg commented Jun 11, 2026 •

edited

Loading