feat: add self-hosted LLM load test suite (ENG-1107)#3
Open
pandeymangg wants to merge 6 commits into
Open
Conversation
End-to-end load test tooling for the Hub embeddings pipeline (TEI + hub-worker + River + pgvector), covering the four scenarios from the ENG load-test ticket: - direct TEI baseline (no Hub in path) - async enrichment via POST /v1/feedback-records + hub-worker - query-embedding via POST /v1/feedback-records/search/semantic - cold-start and rollout behavior What's included: - k6/* — three scenario scripts with profile-driven shapes (smoke/baseline/mid/burst/sweep/warm/busy) and a structured PASS/FAIL summary block. - data/* — CSV multilingual augmentation (en/de/es/fr/ja from a static phrase pool) and CSV → JSON payload bucketing for k6 SharedArray. - collectors/* — River queue-depth poller, kubectl top + events poller, post-run SQL for end-to-end embedding latency and 768-dim + model-name verification. - run-k6.sh / run-cold-start.sh / run-full-suite.sh — wrappers that prefer a local k6 binary, fall back to Docker, and orchestrate collectors + scenarios + artifact bundling into report/runs/<ts>/. - report/template.md — fills directly against the ticket's acceptance criteria (p50/p95/p99, throughput, error rate, queue depth, CPU/mem, restarts, cold-start, AWS cost estimate, Helm defaults verdict). The scripts read env vars (HUB_URL, HUB_API_KEY, TEI_URL, TEI_API_KEY, MODEL, DATABASE_URL, NAMESPACE, POD_SELECTOR) so the same code drives staging and production EKS runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add "Run the full test" section at the top with the complete command sequence so reviewers can grasp the flow without scrolling through the per-step subsections. - Document the required CSV schema (columns + filtering rule) and remove the misleading "dataset shipped at" wording — the PR does not ship a CSV; the operator supplies one via --input. - Add Step 5 explaining how to fill report/template.md from the artifacts in report/runs/<ts>/, with a section-to-artifact mapping. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
@coderabbitai pls review 🙏 |
BhagyaAmarasinghe
approved these changes
Jun 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Load test scaffolding for the self-hosted Qwen/vLLM runtime that powers Formbricks AI (translation, chart generation, survey generation). Companion to
scripts/embeddings/; mirrors its layout and operator workflow.Implements ENG-1107. Blocked on ENG-1103 (model sizing) and ENG-1105 (EU EKS vLLM deployment) for the actual staging run — the suite itself is deployment-agnostic and runs the day staging is up.
What's exercised
k6/llm-direct-translation.js/v1/chat/completionsk6/llm-direct-surveygen.jsk6/llm-direct-chartgen.jsk6/web-survey-gen.jsrun-full-suite.sh mixedrun-cold-start.shThree of the four direct-LLM scripts each test a different token shape — prefill-heavy (chart-gen), decode-heavy (survey-gen), and balanced (translation) — so vLLM is stressed across the workload profiles it'll see in production, not just one.
What's in the PR
smoke|baseline|mid|burst|stress), with per-requestprompt_tokens/completion_tokens/tokens_per_sec/schema_valid/malformed_json/missing_keysmetrics on top of k6's built-in latency + throughput.apps/web/modules/ee/ai-translation/lib/translate-fields.tsandapps/web/app/api/v3/surveys/generate/prompt.tsso the LLM sees exactly what production sends./metricsscraper (queue depth, KV cache %, preemptions, token counters), in-podnvidia-smipoller (GPU util/memory/temp/power), kubectl-based k8s metrics poller.k6, fall back to Docker.How an operator will use this
Full README covers env vars, profiles, metrics, and per-run artifact layout.
Note on chart-gen and translation
Both are Next.js Server Actions, not public HTTP routes. We test the LLM behavior under their workload shapes (identical prompts, identical structured-output schemas) by hitting vLLM directly. Web-pod overhead for those two is captured indirectly via the always-on
k8s-metrics.shpoller.The web-mediated path is only exercised for survey-gen, which is the one workflow with a real HTTP route (
POST /api/v3/surveys/generate). Worth filing a follow-up to expose chart-gen / translation as HTTP routes too — both for load-test coverage and for third-party integrations.Test plan
node --checkbash -nbaselineon each scenario, thenmixed, then fillreport/template.md