Skip to content

fix: Slow status_check probes to 10s and increase workers io pool#322

Merged
gtoonstra merged 1 commit into
mainfrom
chiel-eng-3818-improve-datafold-stability
May 12, 2026
Merged

fix: Slow status_check probes to 10s and increase workers io pool#322
gtoonstra merged 1 commit into
mainfrom
chiel-eng-3818-improve-datafold-stability

Conversation

@cfernhout
Copy link
Copy Markdown
Collaborator

@cfernhout cfernhout commented May 11, 2026

Summary

Dial back the server /status_check probe rate from 5s to 10s and raise the IO temporal worker's per-pod thread count (TEMPORAL_MAX_CONCURRENCY) from 5 to 35. The earlier version of this PR scaled replicas (keda.maxReplicas) instead — that was the wrong lever; the bottleneck is threads per pod, not pod count.

ENG-3818

Companion PR: datafold-operator#101 (submodule pointer bump; semantic-release publishes the 1.1.70 operator image this PR pre-references).

Why

After #321 (ENG-3739) lowered the server's startupProbe and readinessProbe to periodSeconds: 5, and after datafold#12284 added direct Postgres + ClickHouse Alembic migration-version queries to /status_check, the server now executes those heavy checks every 5 seconds via two concurrent probes per pod. Bumping the period to 10s halves that query rate while keeping the rollout-safety wins from #321.

Separately, worker-io was running with TEMPORAL_MAX_CONCURRENCY=5 (subchart default from charts/datafold/charts/worker-temporal/values.yaml) → ThreadPoolExecutor(max_workers=5) and Worker(max_concurrent_activities=5) per pod (datafold/temporal/temporal_worker.py:138,155,181). With keda.maxReplicas: 10 (subchart default), the in-flight IO ceiling is 50. The actual constraint is per-pod concurrency: thread-pool overhead is cheap for I/O-bound activities (mostly DB/git ops), and bumping threads is the right lever before scaling replicas.

Changes

  • charts/datafold/charts/server/templates/deployment.yaml
    • startupProbe.periodSeconds: 5 → 10, failureThreshold: 12 → 6 (keeps the same ~60s startup budget)
    • readinessProbe.periodSeconds: 5 → 10
    • livenessProbe unchanged (already at 15s)
  • charts/datafold/values.yaml
    • worker-io.temporal.maxConcurrency: "35" (new field; previously inherited the worker-temporal subchart default "5")
    • keda.maxReplicas no longer overridden in worker-io block — clusters keep the subchart-default cap of 10. The "scaledobject relationship" the ticket called out stays intact: KEDA still scales 0→10 pods based on queue depth; each pod now has 7× the thread capacity.
  • Version bumps:
    • charts/datafold/Chart.yaml: 0.10.82 → 0.10.83 — chart-releaser cuts a new datafold chart release with the probe + values changes.
    • charts/datafold-manager/Chart.yaml: 0.1.99 → 0.1.100 — chart-releaser cuts a new datafold-manager chart release because values.yaml changed.
    • charts/datafold-manager/values.yaml operator.image.tag: "1.1.69" → "1.1.70" — predicts the next operator semver-release tag from the companion PR.

Not bumped: charts/datafold/charts/operator/Chart.yaml appVersion. That subchart is the legacy kopf-based operator at {global.datafoldRepository}/operator:<appVersion>; no build pipeline I could find produces a new image tag for it, so the bump would reference a non-existent image.

No CRD changes — TemporalWorkerSettings.MaxConcurrency is already in the schema, and clusters can still override temporal.maxConcurrency per DatafoldApplication (e.g. set saas-eu higher, leave dedicated clouds on chart default).

Merge order

  1. Review + merge this PR.
  2. datafold-operator#101: re-bump the submodule pointer to this PR's merge commit, then merge it. semantic-release publishes v1.1.70 operator image.
  3. Pull latest datafold-operator locally; run update-df-mgr per operator-managed cluster.

Test plan

  • CI lint + kubeconform AWS/GCP/Azure green
  • After this merges, datafold-operator#101 gets re-bumped to the merge commit and lands
  • After datafold-operator#101 merges, v1.1.70 operator image exists
  • update-df-mgr per cluster pulls the new manager chart and runs cleanly
  • Per cluster verification:
    • kubectl exec datafold-worker-io-<pod> -n <ns> -- env | grep TEMPORAL_MAX_CONCURRENCY returns 35 (worker-io-enabled clusters: saas-eu, ecolab, disney, test4)
    • kubectl get deploy datafold-server -n <ns> -o yaml | yq '.spec.template.spec.containers[0].readinessProbe.periodSeconds' returns 10
    • /status_check request rate (Datadog APM) drops ~50%
    • Worker-io memory usage doesn't approach the 4Gi limit under load (thread pool is cheap, but worth verifying)

@github-actions
Copy link
Copy Markdown

🔍 Kubeconform Validation Results

All cloud provider configurations passed Kubernetes API schema validation!

Cloud Provider Status
AWS ✅ Passed
GCP ✅ Passed
Azure ✅ Passed

The rendered Kubernetes manifests conform to the Kubernetes API specification across all cloud providers.

@github-actions
Copy link
Copy Markdown

🔍 Kubeconform Validation Results

All cloud provider configurations passed Kubernetes API schema validation!

Cloud Provider Status
AWS ✅ Passed
GCP ✅ Passed
Azure ✅ Passed

The rendered Kubernetes manifests conform to the Kubernetes API specification across all cloud providers.

@github-actions
Copy link
Copy Markdown

🔍 Kubeconform Validation Results

All cloud provider configurations passed Kubernetes API schema validation!

Cloud Provider Status
AWS ✅ Passed
GCP ✅ Passed
Azure ✅ Passed

The rendered Kubernetes manifests conform to the Kubernetes API specification across all cloud providers.

@cfernhout cfernhout changed the title [ENG-3818]: Slow status_check probes to 10s and lift worker-io KEDA cap to 35 [ENG-3818]: Slow status_check probes to 10s and bump worker-io thread concurrency to 35 May 11, 2026
@gtoonstra gtoonstra force-pushed the chiel-eng-3818-improve-datafold-stability branch from fe32e14 to 9c46457 Compare May 12, 2026 13:24
@github-actions
Copy link
Copy Markdown

🔍 Kubeconform Validation Results

All cloud provider configurations passed Kubernetes API schema validation!

Cloud Provider Status
AWS ✅ Passed
GCP ✅ Passed
Azure ✅ Passed

The rendered Kubernetes manifests conform to the Kubernetes API specification across all cloud providers.

@gtoonstra gtoonstra force-pushed the chiel-eng-3818-improve-datafold-stability branch from 9c46457 to 51e3818 Compare May 12, 2026 13:25
@gtoonstra gtoonstra changed the title [ENG-3818]: Slow status_check probes to 10s and bump worker-io thread concurrency to 35 fix: Slow status_check probes to 10s and increase workers io pool May 12, 2026
@github-actions
Copy link
Copy Markdown

🔍 Kubeconform Validation Results

All cloud provider configurations passed Kubernetes API schema validation!

Cloud Provider Status
AWS ✅ Passed
GCP ✅ Passed
Azure ✅ Passed

The rendered Kubernetes manifests conform to the Kubernetes API specification across all cloud providers.

@gtoonstra gtoonstra merged commit 6cc5091 into main May 12, 2026
8 checks passed
@gtoonstra gtoonstra deleted the chiel-eng-3818-improve-datafold-stability branch May 12, 2026 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants