Skip to content

Helm chart improvements for GA readiness #381

@WentingWu666666

Description

@WentingWu666666

Summary

Pre-GA audit of operator/documentdb-helm-chart. The chart works for the current demo/preview use cases, but several gaps will block enterprise adoption (multi-install collisions, Pod Security Admission rejection, broad RBAC, no CRD upgrade story, missing scheduling/security knobs).

This is a tracking issue. Individual fixes can be split into sub-PRs; some items may have better designs than what is suggested here.

Related: #380 (CNPG version compatibility for BYO mode).


Critical (GA blockers)

  • C1. Cluster-scoped resources have hardcoded names chart cannot be installed twice in the same cluster. ClusterRole/documentdb-operator-cluster-role, ClusterRoleBinding/..., ValidatingWebhookConfiguration/documentdb-validating-webhook, Service/documentdb-webhook-service, all three Deployments, and the wal-replica RBAC are all hardcoded. Needs a proper documentdb-operator.fullname helper and consistent use across every cluster-scoped resource. templates/10_documentdb_webhook.yaml even has a comment admitting this.
  • C2. No CRD upgrade story. CRDs live in crds/ (Helm 3 special dir) installed on first install, never updated by helm upgrade. Pick a strategy: hook-based install in templates/, separate documentdb-operator-crds chart (Strimzi/CNPG pattern), or documented manual kubectl apply step in release notes.
  • C3. Deployment.spec.selector.matchLabels: { app: {{ .Release.Name }} } selectors are immutable; coupling to release name risks helm upgrade failures. Standardize on app.kubernetes.io/* labels with selectorLabels / labels helpers (what helm create ships).
  • C4. No resources: requests/limits on operator, sidecar-injector, or wal-replica deployments. LimitRange/ResourceQuota enforcement will reject the install. Expose via values.yaml with sane defaults.
  • C5. ClusterRole is over-broad. templates/05_clusterrole.yaml grants cluster-wide CRUD on secrets, namespaces, rolebindings, clusterrolebindings, serviceaccounts, configmaps. Several are not justifiable for a DocumentDB operator and will fail security review. Audit each rule against actual controller code.
  • C6. wal-replica-manager ClusterRole has a literal # TODO trim this down to what's actually needed comment (templates/03_documentdb_wal_replica.yaml:145). Ships permissions the author already flagged.
  • C7. CNPG subchart version drift. Chart.yaml / Chart.lock / bundled charts/cloudnative-pg-0.27.0.tgz are pinned to 0.27.0, but AGENTS.md compatibility matrix declares CNPG 1.28.0. Resolve to a single source of truth and add CI to keep them in sync.

Major

  • M1. No templates/NOTES.txt no post-install guidance, verification, or first-step instructions.
  • M2. No values.schema.json every value typo becomes a runtime error instead of an install-time validation error.
  • M3. No helm template rendering CI. helm lint is wired up but doesn't catch templates that fail to render under different value combinations (e.g., walReplica: true, custom namespace, missing values). Add CI that renders + kubectl apply --dry-run=server.
  • M4. cert-manager is an implicit hard dependency, never declared. Chart unconditionally creates cert-manager.io/v1 resources. If cert-manager isn't installed the chart "succeeds" but the operator never becomes ready. Add a preflight: {{- if not (.Capabilities.APIVersions.Has "cert-manager.io/v1") }}{{- fail "..." }}{{- end }}, and/or make it optional via certManager.enabled.
  • M5. No affinity, nodeSelector, tolerations, topologySpreadConstraints, priorityClassName exposed. Standard scheduling controls absent.
  • M6. No imagePullSecrets air-gapped / private-registry customers can't use the chart.
  • M7. pullPolicy: Always for pinned tags for appVersion-derived tags, IfNotPresent is the convention. Always only makes sense for floating tags.
  • M8. Missing securityContext / podSecurityContext. No runAsNonRoot, seccompProfile, readOnlyRootFilesystem, allowPrivilegeEscalation: false. Pod Security Admission with restricted rejects these pods. Borrow defaults from the upstream CNPG chart's values.yaml.
  • M9. No PodDisruptionBudget for the operator. replicaCount: 1 with no PDB operator stops reconciling during node drains.
  • M10. Three snowflake webhook/cert mechanisms (operator, sidecar injector, wal-replica) each with its own Issuer/Certificate/Secret/dnsName. Consolidate into a shared template.
  • M11. Numbered template filenames (01_, 02_, ..., 10_) Kustomize convention, meaningless in Helm. Rename to descriptive names (deployment.yaml, rbac.yaml, webhook.yaml, ...).
  • M12. Sidecar injector and wal-replica deployments have no probes marked Ready as soon as the container starts. Add readiness/liveness probes that actually verify the gRPC plugin endpoint.
  • M13. creationTimestamp: null artifacts in templates/02_documentdb_sidecar_injector.yaml:37 and templates/03_documentdb_wal_replica.yaml:87 leftover from kubectl get -o yaml. Remove.
  • M14. Wrong app.kubernetes.io/managed-by: kustomize labels in templates/03_documentdb_wal_replica.yaml. Helm sets this to Helm automatically; the hardcoded kustomize value misleads GitOps tooling and dashboards.

Minor / polish

  • N1. UNIFIED-VERSION-CONTROL.md in the chart dir is unreferenced. Link from README or move under docs/.
  • N2. No chart README.md Artifact Hub renders this; without it the listing page is empty. Consider helm-docs (auto-generates from values.yaml comments).
  • N3. Missing trailing newlines in Chart.yaml, templates/01_namespace.yaml, templates/_helpers.tpl. helm lint --strict would catch.
  • N4. _helpers.tpl is anemic single helper returning a literal. GA charts usually have name / fullname / chart / labels / selectorLabels / serviceAccountName / namespace.
  • N5. namespace: value + .Release.Namespace fallback two ways to set namespace is confusing. Either drop the chart value (let Helm own it via --namespace) or document precedence.
  • N6. documentDbVersion (values.yaml) vs. CRD field documentDBVersion casing mismatch for the same concept across two surfaces.
  • N7. walReplica as top-level boolean. If wal-replica grows config (image/resources/replicas) the API has to be restructured. Prefer walReplica.enabled: false from day one.
  • N8. RBAC verbs granularity most rules effectively grant *. Narrow each rule to what controller code actually uses.

Suggested phasing

Pre-GA must-fix: C1, C2, C5, C6, C7, M4, M8 these move the chart from "works in maintainers' demo cluster" to "passes enterprise security review and installs cleanly in customer environments."

Pre-GA strongly recommended: C3, C4, M2, M5, M6, M7, M9, M12, M3.

GA+1 polish: M1, M10, M11, M13, M14, all N items.

Happy to break this into per-area sub-issues if that helps assignment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions