fix: Slow status_check probes to 10s and increase workers io pool by cfernhout · Pull Request #322 · datafold/helm-charts

cfernhout · 2026-05-11T15:19:36Z

Summary

Dial back the server /status_check probe rate from 5s to 10s and raise the IO temporal worker's per-pod thread count (TEMPORAL_MAX_CONCURRENCY) from 5 to 35. The earlier version of this PR scaled replicas (keda.maxReplicas) instead — that was the wrong lever; the bottleneck is threads per pod, not pod count.

ENG-3818

Companion PR: datafold-operator#101 (submodule pointer bump; semantic-release publishes the 1.1.70 operator image this PR pre-references).

Why

After #321 (ENG-3739) lowered the server's startupProbe and readinessProbe to periodSeconds: 5, and after datafold#12284 added direct Postgres + ClickHouse Alembic migration-version queries to /status_check, the server now executes those heavy checks every 5 seconds via two concurrent probes per pod. Bumping the period to 10s halves that query rate while keeping the rollout-safety wins from #321.

Separately, worker-io was running with TEMPORAL_MAX_CONCURRENCY=5 (subchart default from charts/datafold/charts/worker-temporal/values.yaml) → ThreadPoolExecutor(max_workers=5) and Worker(max_concurrent_activities=5) per pod (datafold/temporal/temporal_worker.py:138,155,181). With keda.maxReplicas: 10 (subchart default), the in-flight IO ceiling is 50. The actual constraint is per-pod concurrency: thread-pool overhead is cheap for I/O-bound activities (mostly DB/git ops), and bumping threads is the right lever before scaling replicas.

Changes

charts/datafold/charts/server/templates/deployment.yaml
- startupProbe.periodSeconds: 5 → 10, failureThreshold: 12 → 6 (keeps the same ~60s startup budget)
- readinessProbe.periodSeconds: 5 → 10
- livenessProbe unchanged (already at 15s)
charts/datafold/values.yaml
- worker-io.temporal.maxConcurrency: "35" (new field; previously inherited the worker-temporal subchart default "5")
- keda.maxReplicas no longer overridden in worker-io block — clusters keep the subchart-default cap of 10. The "scaledobject relationship" the ticket called out stays intact: KEDA still scales 0→10 pods based on queue depth; each pod now has 7× the thread capacity.
Version bumps:
- charts/datafold/Chart.yaml: 0.10.82 → 0.10.83 — chart-releaser cuts a new datafold chart release with the probe + values changes.
- charts/datafold-manager/Chart.yaml: 0.1.99 → 0.1.100 — chart-releaser cuts a new datafold-manager chart release because values.yaml changed.
- charts/datafold-manager/values.yaml operator.image.tag: "1.1.69" → "1.1.70" — predicts the next operator semver-release tag from the companion PR.

Not bumped: charts/datafold/charts/operator/Chart.yaml appVersion. That subchart is the legacy kopf-based operator at {global.datafoldRepository}/operator:<appVersion>; no build pipeline I could find produces a new image tag for it, so the bump would reference a non-existent image.

No CRD changes — TemporalWorkerSettings.MaxConcurrency is already in the schema, and clusters can still override temporal.maxConcurrency per DatafoldApplication (e.g. set saas-eu higher, leave dedicated clouds on chart default).

Merge order

Review + merge this PR.
datafold-operator#101: re-bump the submodule pointer to this PR's merge commit, then merge it. semantic-release publishes v1.1.70 operator image.
Pull latest datafold-operator locally; run update-df-mgr per operator-managed cluster.