fix: Slow status_check probes to 10s and increase workers io pool#322
Conversation
🔍 Kubeconform Validation Results✅ All cloud provider configurations passed Kubernetes API schema validation!
The rendered Kubernetes manifests conform to the Kubernetes API specification across all cloud providers. |
🔍 Kubeconform Validation Results✅ All cloud provider configurations passed Kubernetes API schema validation!
The rendered Kubernetes manifests conform to the Kubernetes API specification across all cloud providers. |
🔍 Kubeconform Validation Results✅ All cloud provider configurations passed Kubernetes API schema validation!
The rendered Kubernetes manifests conform to the Kubernetes API specification across all cloud providers. |
fe32e14 to
9c46457
Compare
🔍 Kubeconform Validation Results✅ All cloud provider configurations passed Kubernetes API schema validation!
The rendered Kubernetes manifests conform to the Kubernetes API specification across all cloud providers. |
9c46457 to
51e3818
Compare
🔍 Kubeconform Validation Results✅ All cloud provider configurations passed Kubernetes API schema validation!
The rendered Kubernetes manifests conform to the Kubernetes API specification across all cloud providers. |
Summary
Dial back the server
/status_checkprobe rate from 5s to 10s and raise the IO temporal worker's per-pod thread count (TEMPORAL_MAX_CONCURRENCY) from 5 to 35. The earlier version of this PR scaled replicas (keda.maxReplicas) instead — that was the wrong lever; the bottleneck is threads per pod, not pod count.ENG-3818
Companion PR: datafold-operator#101 (submodule pointer bump; semantic-release publishes the
1.1.70operator image this PR pre-references).Why
After #321 (ENG-3739) lowered the server's
startupProbeandreadinessProbetoperiodSeconds: 5, and after datafold#12284 added direct Postgres + ClickHouse Alembic migration-version queries to/status_check, the server now executes those heavy checks every 5 seconds via two concurrent probes per pod. Bumping the period to 10s halves that query rate while keeping the rollout-safety wins from #321.Separately,
worker-iowas running withTEMPORAL_MAX_CONCURRENCY=5(subchart default fromcharts/datafold/charts/worker-temporal/values.yaml) →ThreadPoolExecutor(max_workers=5)andWorker(max_concurrent_activities=5)per pod (datafold/temporal/temporal_worker.py:138,155,181). Withkeda.maxReplicas: 10(subchart default), the in-flight IO ceiling is 50. The actual constraint is per-pod concurrency: thread-pool overhead is cheap for I/O-bound activities (mostly DB/git ops), and bumping threads is the right lever before scaling replicas.Changes
charts/datafold/charts/server/templates/deployment.yamlstartupProbe.periodSeconds:5 → 10,failureThreshold:12 → 6(keeps the same ~60s startup budget)readinessProbe.periodSeconds:5 → 10livenessProbeunchanged (already at 15s)charts/datafold/values.yamlworker-io.temporal.maxConcurrency: "35"(new field; previously inherited the worker-temporal subchart default"5")keda.maxReplicasno longer overridden in worker-io block — clusters keep the subchart-default cap of 10. The "scaledobject relationship" the ticket called out stays intact: KEDA still scales 0→10 pods based on queue depth; each pod now has 7× the thread capacity.charts/datafold/Chart.yaml:0.10.82 → 0.10.83— chart-releaser cuts a newdatafoldchart release with the probe + values changes.charts/datafold-manager/Chart.yaml:0.1.99 → 0.1.100— chart-releaser cuts a newdatafold-managerchart release becausevalues.yamlchanged.charts/datafold-manager/values.yamloperator.image.tag:"1.1.69" → "1.1.70"— predicts the next operator semver-release tag from the companion PR.Not bumped:
charts/datafold/charts/operator/Chart.yamlappVersion. That subchart is the legacykopf-based operator at{global.datafoldRepository}/operator:<appVersion>; no build pipeline I could find produces a new image tag for it, so the bump would reference a non-existent image.No CRD changes —
TemporalWorkerSettings.MaxConcurrencyis already in the schema, and clusters can still overridetemporal.maxConcurrencyperDatafoldApplication(e.g. set saas-eu higher, leave dedicated clouds on chart default).Merge order
v1.1.70operator image.datafold-operatorlocally; runupdate-df-mgrper operator-managed cluster.Test plan
v1.1.70operator image existsupdate-df-mgrper cluster pulls the new manager chart and runs cleanlykubectl exec datafold-worker-io-<pod> -n <ns> -- env | grep TEMPORAL_MAX_CONCURRENCYreturns35(worker-io-enabled clusters:saas-eu,ecolab,disney,test4)kubectl get deploy datafold-server -n <ns> -o yaml | yq '.spec.template.spec.containers[0].readinessProbe.periodSeconds'returns10/status_checkrequest rate (Datadog APM) drops ~50%