feat(autoscaling): scale on request rate past the VPA ceiling, cap replicas at 3 by devantler · Pull Request #2030 · devantler-tech/platform

devantler · 2026-06-11T17:30:02Z

Summary

Implements "scale vertically up to a sane limit, then horizontally" and codifies the 3-pod-per-workload ceiling.

Why the PDB ask landed elsewhere

PodDisruptionBudget has no maxAvailable field — its API is only minAvailable / maxUnavailable, and a PDB only gates voluntary evictions; it cannot cap how many pods a controller creates. The repo's PDBs already follow the drain-safe maxUnavailable: 1 house pattern (completed by #1991), so no PDB was changed here. The "never more than three pods per Deployment/StatefulSet/ReplicaSet" rule is instead enforced where replica counts actually live:

autoscaler ceilings — every scaler now tops out at 3, and
validate-replica-ceiling — a new Audit-mode Kyverno ClusterPolicy (inverse of validate-replica-floor) flagging any Deployment/StatefulSet/ReplicaSet declaring >3 replicas and any HPA / ScaledObject / HTTPScaledObject whose ceiling exceeds 3 (including KEDA's implicit default of 100 when the ceiling is unset). Audit, not Enforce, for the same reason as the floor policy: replicas are owned by KEDA/Flagger/HPA at runtime and an admission deny would fight them. Escape hatch: platform.devantler.tech/replica-ceiling: exempt.

Horizontal scaling on a metric that complements VPA

The vertical axis already exists: auto-vpa right-sizes every workload in place up to maxAllowed (3 CPU / 6Gi) — that is the "sane limit". The horizontal trigger must therefore be a metric VPA does not control, or the two fight. The platform's established answer is HTTP request rate, and this PR completes it:

Workload	Before	After
homepage	pinned 3 replicas, no scaler	KEDA `ScaledObject` 2–3 on Coroot RPS, via Flagger `autoscalerRef`
umami	pinned 3 replicas, no scaler (`# No autoscalerRef`)	same pattern, 2–3 on RPS
whoami, fleetdm	HTTPScaledObject 0–2	ceiling raised to 3
headlamp, actual-budget, infra UIs	max 1	unchanged — single-writer PVC / in-memory-session apps cannot run concurrent pods (vertical-only)

For homepage/umami this uses the pattern docs/progressive-delivery.md documented but never used: a named prometheus trigger on container_http_inbound_requests_total (Coroot's eBPF node-agent series, same source as the Flagger MetricTemplates), referenced from the Canary via autoscalerRef. Flagger clones each ScaledObject to <name>-primary (query rewritten for -primary- pods via primaryScalerQueries; the source query's vowel-free hash regex can't match -primary-) and pauses the source's scaler at 0 between rollouts — preserving today's behavior. The obsolete postRenderer replicas pins were removed; fallback: replicas: 3 fails high to the old static HA count if Prometheus is unreachable.

The mechanics: when sustained load outgrows what right-sized pods absorb, RPS-per-pod crosses the per-pod threshold (homepage 10 rps — single-threaded Next.js SSR; umami 50 rps — cheap /api/send ingest) and the 2→3 scale-out fires. Thresholds are marked for tuning against live coroot-prometheus data. Two replicas still survive a node drain (PDB maxUnavailable: 1 + topology spread); the steady-state drop from 3→2 returns memory on a memory-tight cluster.

Network path: allow-keda gains egress to observability:9090 and allow-coroot gains the matching ingress — mirroring the existing Flagger/OpenCost rules.

Open question on the "5"

The request said maxAvailable: 5 but also "never more than three pods" — since the field doesn't exist, I implemented the explicit 3-pod rule throughout. If 5 was meant as the autoscaler ceiling instead, it's a one-line change in each scaler plus the policy's three value: 3 comparisons.

Validation

ksail workload validate: green (304 files)
ksail --config ksail.prod.yaml workload validate: sole failure is the pre-existing upstream datreeio CRDs-catalog coroot schema gap (notificationIntegrations — fixed by CRDs-catalog#896), unrelated
kubectl kustomize of both cluster overlays + providers/hetzner/{apps,infrastructure} and bases/infrastructure/controllers: all build; ScaledObjects, autoscalerRefs, netpol rules, and the new ClusterPolicy verified in the rendered output
Existing prod state audited: nothing currently exceeds 3 replicas (kyverno HA = 3, overprovisioning RS = 1), so the Audit policy starts clean

⚠️ Watch the first prod reconcile: Flagger will create the two -primary ScaledObjects and KEDA takes over primary replica counts (expected steady-state 2 unless RPS demands 3). The PromQL is written against Coroot's documented schema but not validated against live data — same caveat as the existing MetricTemplates.

🤖 Generated with Claude Code

…plicas at 3 - homepage/umami: KEDA ScaledObjects (min 2 / max 3) on Coroot's eBPF inbound request-rate series, wired through Flagger's autoscalerRef so the primary scales and the canary source stays paused at 0 between rollouts; drop the now-obsolete postRenderer replicas pins - whoami/fleetdm: raise HTTPScaledObject ceilings 2 -> 3 - netpols: KEDA operator -> coroot-prometheus:9090 (egress + ingress) - validate-replica-ceiling (Audit): flags workloads declaring >3 replicas and autoscaler ceilings >3. A PDB cannot express this — it has no maxAvailable field and only gates voluntary evictions — so the cap lives on the autoscalers and is audited by policy. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

botantler · 2026-06-11T18:00:50Z

🎉 This PR is included in version 1.52.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

github-project-automation Bot added this to 🌊 Project Board Jun 11, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board Jun 11, 2026

devantler temporarily deployed to ci June 11, 2026 17:30 — with GitHub Actions Inactive

devantler marked this pull request as ready for review June 11, 2026 17:35

devantler enabled auto-merge June 11, 2026 17:35

devantler added this pull request to the merge queue Jun 11, 2026

Merged via the queue into main with commit 88827c6 Jun 11, 2026
10 checks passed

devantler deleted the claude/recursing-lederberg-86ebd6 branch June 11, 2026 18:00

github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board Jun 11, 2026

botantler Bot added the released label Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(autoscaling): scale on request rate past the VPA ceiling, cap replicas at 3#2030

feat(autoscaling): scale on request rate past the VPA ceiling, cap replicas at 3#2030
devantler merged 1 commit into
mainfrom
claude/recursing-lederberg-86ebd6

devantler commented Jun 11, 2026

Uh oh!

Uh oh!

botantler Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devantler commented Jun 11, 2026

Summary

Why the PDB ask landed elsewhere

Horizontal scaling on a metric that complements VPA

Open question on the "5"

Validation

Uh oh!

Uh oh!

botantler Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant