Skip to content

fix(keda): spread http-add-on interceptor and scaler replicas across nodes#1992

Merged
devantler merged 2 commits into
mainfrom
claude/repo-assist-keda-http-addon-topology-spread
Jun 11, 2026
Merged

fix(keda): spread http-add-on interceptor and scaler replicas across nodes#1992
devantler merged 2 commits into
mainfrom
claude/repo-assist-keda-http-addon-topology-spread

Conversation

@devantler

Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

Problem

On prod, 2 of 3 keda-add-ons-http-interceptor replicas were co-located on a single short-lived autoscaler node (autoscale-cx33-78d4e4c568a5d877). The interceptor sits on the request path of every scale-to-zero UI, and the external scaler's queue pinger polls every interceptor admin endpoint — it reports NOT_SERVING (and gets liveness-killed) while any one of them is unreachable. Losing that one node would therefore have degraded all scale-to-zero routing at once.

Context: the external scaler's chronic restarts (~2-4/hr since June 7, 233 total) stopped after #1975's cilium node-encryption fix rolled out (zero restarts 18:36–21:21 UTC, confirmed via Coroot Prometheus). This PR addresses the remaining availability gap, not the (already fixed) network root cause.

Fix

Add soft (whenUnsatisfiable: ScheduleAnyway) hostname topologySpreadConstraints to both the interceptor and scaler via chart values, so replicas land on distinct nodes whenever capacity allows. The default scheduler spread (maxSkew 3) tolerates the observed co-location; maxSkew 1 does not. Soft strength keeps single-node local/CI clusters schedulable.

Validation

  • ksail workload validate — ✅ 315 files validated
  • ksail --config ksail.prod.yaml workload validate — ✅ except the pre-existing, unrelated datreeio coroot_v1 schema failure (upstream Update coroot.com/coroot_v1 schema from coroot-operator 0.9.7 datreeio/CRDs-catalog#896)
  • helm template against keda-add-ons-http@0.14.1 confirms both Deployments render the constraints with matching pod labels (app.kubernetes.io/component: interceptor|scaler, app.kubernetes.io/instance: keda-http-add-on verified against live prod pod labels)

🤖 Generated with Claude Code

…nodes

The default scheduler skew tolerance (maxSkew 3) co-located 2 of 3
interceptor replicas on a single short-lived autoscaler node on prod.
The interceptor is on the request path of every scale-to-zero UI, and
the external scaler goes NOT_SERVING (liveness kill) while any
interceptor admin endpoint is unreachable — so one node loss degrades
all scale-to-zero routing at once.

Add soft (ScheduleAnyway) hostname topology spread constraints to both
components so replicas land on distinct nodes when possible, while
single-node local/CI clusters still schedule.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

The System Test failure here is pre-existing on main, not caused by this values-only change: the same failure hit the three Renovate PRs and claude/repo-assist-kyverno-mutate-create-only in the same window (22:14–22:47 UTC), all wedged on the infrastructure health gate waiting on PersistentVolumeClaim/openbao/vault-snapshots (WaitForFirstConsumer, no consumer until the 03:30 CronJob) plus the downstream OpenBao seeding chain.

#2002 fixes exactly this (initial snapshot Job so the PVC binds at deploy time) — its branch passed the full System Test at 23:44 UTC and it is in the merge queue now. I'll update this branch once it lands to rerun CI.

@devantler devantler added this pull request to the merge queue Jun 11, 2026
Merged via the queue into main with commit 84b5b43 Jun 11, 2026
10 checks passed
@devantler devantler deleted the claude/repo-assist-keda-http-addon-topology-spread branch June 11, 2026 06:34
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board Jun 11, 2026
@botantler

botantler Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

🎉 This PR is included in version 1.49.1 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@botantler botantler Bot added the released label Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

1 participant