Skip to content

feat: add self-hosted LLM load test suite (ENG-1107)#3

Open
pandeymangg wants to merge 6 commits into
mainfrom
feat/llm-load-test
Open

feat: add self-hosted LLM load test suite (ENG-1107)#3
pandeymangg wants to merge 6 commits into
mainfrom
feat/llm-load-test

Conversation

@pandeymangg

@pandeymangg pandeymangg commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Load test scaffolding for the self-hosted Qwen/vLLM runtime that powers Formbricks AI (translation, chart generation, survey generation). Companion to scripts/embeddings/; mirrors its layout and operator workflow.

Implements ENG-1107. Blocked on ENG-1103 (model sizing) and ENG-1105 (EU EKS vLLM deployment) for the actual staging run — the suite itself is deployment-agnostic and runs the day staging is up.

What's exercised

Script Workflow Path
k6/llm-direct-translation.js AI survey translation direct → vLLM /v1/chat/completions
k6/llm-direct-surveygen.js AI survey generation direct → vLLM
k6/llm-direct-chartgen.js AI chart generation direct → vLLM
k6/web-survey-gen.js Survey gen end-to-end Formbricks web → vLLM
run-full-suite.sh mixed Concurrent mixed traffic three scenarios in parallel
run-cold-start.sh Cold start / rollout restart vLLM deployment timing

Three of the four direct-LLM scripts each test a different token shape — prefill-heavy (chart-gen), decode-heavy (survey-gen), and balanced (translation) — so vLLM is stressed across the workload profiles it'll see in production, not just one.

What's in the PR

  • k6/ — four scenario scripts, profile-driven (smoke|baseline|mid|burst|stress), with per-request prompt_tokens / completion_tokens / tokens_per_sec / schema_valid / malformed_json / missing_keys metrics on top of k6's built-in latency + throughput.
  • data/ — 37 hand-crafted fixtures. System prompts copied verbatim from apps/web/modules/ee/ai-translation/lib/translate-fields.ts and apps/web/app/api/v3/surveys/generate/prompt.ts so the LLM sees exactly what production sends.
  • collectors/ — vLLM Prometheus /metrics scraper (queue depth, KV cache %, preemptions, token counters), in-pod nvidia-smi poller (GPU util/memory/temp/power), kubectl-based k8s metrics poller.
  • run wrappers — single-scenario, full-suite, and cold-start. Prefer native k6, fall back to Docker.
  • report/template.md — fill-in-the-blank report mirroring ENG-1107's "Required Outputs" and "Acceptance Criteria" 1:1, with a section→artifact mapping so the operator never has to figure out where a number lives.

How an operator will use this

# Env (against staging EKS)
export LLM_URL=http://localhost:8000/v1 LLM_API_KEY=... MODEL=Qwen/Qwen2.5-7B-Instruct
export RESPONSE_FORMAT=json_schema VLLM_METRICS_URL=http://localhost:8000/metrics
export FORMBRICKS_URL=https://app.staging.formbricks.com FORMBRICKS_API_KEY=fbk_... FORMBRICKS_WORKSPACE_ID=...
export NAMESPACE=formbricks-stage POD_SELECTOR='app.kubernetes.io/name in (vllm,formbricks-web)'

cd scripts/llm
PROFILE=baseline ./run-full-suite.sh translation
PROFILE=baseline ./run-full-suite.sh surveygen
PROFILE=baseline ./run-full-suite.sh chartgen
PROFILE=baseline ./run-full-suite.sh web
PROFILE=mid      ./run-full-suite.sh mixed
PROFILE=stress   ./run-full-suite.sh surveygen
./run-cold-start.sh cache-warm

cp report/template.md report/runs/<ts>/report.md   # fill from artifacts

Full README covers env vars, profiles, metrics, and per-run artifact layout.

Note on chart-gen and translation

Both are Next.js Server Actions, not public HTTP routes. We test the LLM behavior under their workload shapes (identical prompts, identical structured-output schemas) by hitting vLLM directly. Web-pod overhead for those two is captured indirectly via the always-on k8s-metrics.sh poller.

The web-mediated path is only exercised for survey-gen, which is the one workflow with a real HTTP route (POST /api/v3/surveys/generate). Worth filing a follow-up to expose chart-gen / translation as HTTP routes too — both for load-test coverage and for third-party integrations.

Test plan

  • All k6 + lib JS files pass node --check
  • All shell scripts pass bash -n
  • All three fixture JSON files parse
  • No leftover local-only references in README, comments, or scripts
  • Once EKS staging vLLM exists: smoke run, then baseline on each scenario, then mixed, then fill report/template.md

pandeymangg and others added 6 commits May 13, 2026 18:30
End-to-end load test tooling for the Hub embeddings pipeline (TEI +
hub-worker + River + pgvector), covering the four scenarios from the
ENG load-test ticket:

- direct TEI baseline (no Hub in path)
- async enrichment via POST /v1/feedback-records + hub-worker
- query-embedding via POST /v1/feedback-records/search/semantic
- cold-start and rollout behavior

What's included:

- k6/* — three scenario scripts with profile-driven shapes
  (smoke/baseline/mid/burst/sweep/warm/busy) and a structured
  PASS/FAIL summary block.
- data/* — CSV multilingual augmentation (en/de/es/fr/ja from a
  static phrase pool) and CSV → JSON payload bucketing for k6
  SharedArray.
- collectors/* — River queue-depth poller, kubectl top + events
  poller, post-run SQL for end-to-end embedding latency and
  768-dim + model-name verification.
- run-k6.sh / run-cold-start.sh / run-full-suite.sh — wrappers that
  prefer a local k6 binary, fall back to Docker, and orchestrate
  collectors + scenarios + artifact bundling into report/runs/<ts>/.
- report/template.md — fills directly against the ticket's
  acceptance criteria (p50/p95/p99, throughput, error rate, queue
  depth, CPU/mem, restarts, cold-start, AWS cost estimate, Helm
  defaults verdict).

The scripts read env vars (HUB_URL, HUB_API_KEY, TEI_URL,
TEI_API_KEY, MODEL, DATABASE_URL, NAMESPACE, POD_SELECTOR) so the
same code drives staging and production EKS runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add "Run the full test" section at the top with the complete command
  sequence so reviewers can grasp the flow without scrolling through
  the per-step subsections.
- Document the required CSV schema (columns + filtering rule) and
  remove the misleading "dataset shipped at" wording — the PR does
  not ship a CSV; the operator supplies one via --input.
- Add Step 5 explaining how to fill report/template.md from the
  artifacts in report/runs/<ts>/, with a section-to-artifact mapping.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pandeymangg

Copy link
Copy Markdown
Contributor Author

@coderabbitai pls review 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants