fix(keda): spread http-add-on interceptor and scaler replicas across nodes#1992
Conversation
…nodes The default scheduler skew tolerance (maxSkew 3) co-located 2 of 3 interceptor replicas on a single short-lived autoscaler node on prod. The interceptor is on the request path of every scale-to-zero UI, and the external scaler goes NOT_SERVING (liveness kill) while any interceptor admin endpoint is unreachable — so one node loss degrades all scale-to-zero routing at once. Add soft (ScheduleAnyway) hostname topology spread constraints to both components so replicas land on distinct nodes when possible, while single-node local/CI clusters still schedule. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The System Test failure here is pre-existing on main, not caused by this values-only change: the same failure hit the three Renovate PRs and #2002 fixes exactly this (initial snapshot Job so the PVC binds at deploy time) — its branch passed the full System Test at 23:44 UTC and it is in the merge queue now. I'll update this branch once it lands to rerun CI. |
|
🎉 This PR is included in version 1.49.1 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
Problem
On prod, 2 of 3
keda-add-ons-http-interceptorreplicas were co-located on a single short-lived autoscaler node (autoscale-cx33-78d4e4c568a5d877). The interceptor sits on the request path of every scale-to-zero UI, and the external scaler's queue pinger polls every interceptor admin endpoint — it reportsNOT_SERVING(and gets liveness-killed) while any one of them is unreachable. Losing that one node would therefore have degraded all scale-to-zero routing at once.Context: the external scaler's chronic restarts (~2-4/hr since June 7, 233 total) stopped after #1975's cilium node-encryption fix rolled out (zero restarts 18:36–21:21 UTC, confirmed via Coroot Prometheus). This PR addresses the remaining availability gap, not the (already fixed) network root cause.
Fix
Add soft (
whenUnsatisfiable: ScheduleAnyway) hostnametopologySpreadConstraintsto both the interceptor and scaler via chart values, so replicas land on distinct nodes whenever capacity allows. The default scheduler spread (maxSkew 3) tolerates the observed co-location; maxSkew 1 does not. Soft strength keeps single-node local/CI clusters schedulable.Validation
ksail workload validate— ✅ 315 files validatedksail --config ksail.prod.yaml workload validate— ✅ except the pre-existing, unrelated datreeio coroot_v1 schema failure (upstream Update coroot.com/coroot_v1 schema from coroot-operator 0.9.7 datreeio/CRDs-catalog#896)helm templateagainstkeda-add-ons-http@0.14.1confirms both Deployments render the constraints with matching pod labels (app.kubernetes.io/component: interceptor|scaler,app.kubernetes.io/instance: keda-http-add-onverified against live prod pod labels)🤖 Generated with Claude Code