You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pre-GA audit of operator/documentdb-helm-chart. The chart works for the current demo/preview use cases, but several gaps will block enterprise adoption (multi-install collisions, Pod Security Admission rejection, broad RBAC, no CRD upgrade story, missing scheduling/security knobs).
This is a tracking issue. Individual fixes can be split into sub-PRs; some items may have better designs than what is suggested here.
Related: #380 (CNPG version compatibility for BYO mode).
Critical (GA blockers)
C1. Cluster-scoped resources have hardcoded names chart cannot be installed twice in the same cluster. ClusterRole/documentdb-operator-cluster-role, ClusterRoleBinding/..., ValidatingWebhookConfiguration/documentdb-validating-webhook, Service/documentdb-webhook-service, all three Deployments, and the wal-replica RBAC are all hardcoded. Needs a proper documentdb-operator.fullname helper and consistent use across every cluster-scoped resource. templates/10_documentdb_webhook.yaml even has a comment admitting this.
C2. No CRD upgrade story. CRDs live in crds/ (Helm 3 special dir) installed on first install, never updated by helm upgrade. Pick a strategy: hook-based install in templates/, separate documentdb-operator-crds chart (Strimzi/CNPG pattern), or documented manual kubectl apply step in release notes.
C3. Deployment.spec.selector.matchLabels: { app: {{ .Release.Name }} } selectors are immutable; coupling to release name risks helm upgrade failures. Standardize on app.kubernetes.io/* labels with selectorLabels / labels helpers (what helm create ships).
C4. No resources: requests/limits on operator, sidecar-injector, or wal-replica deployments. LimitRange/ResourceQuota enforcement will reject the install. Expose via values.yaml with sane defaults.
C5. ClusterRole is over-broad.templates/05_clusterrole.yaml grants cluster-wide CRUD on secrets, namespaces, rolebindings, clusterrolebindings, serviceaccounts, configmaps. Several are not justifiable for a DocumentDB operator and will fail security review. Audit each rule against actual controller code.
C6. wal-replica-manager ClusterRole has a literal # TODO trim this down to what's actually needed comment (templates/03_documentdb_wal_replica.yaml:145). Ships permissions the author already flagged.
C7. CNPG subchart version drift.Chart.yaml / Chart.lock / bundled charts/cloudnative-pg-0.27.0.tgz are pinned to 0.27.0, but AGENTS.md compatibility matrix declares CNPG 1.28.0. Resolve to a single source of truth and add CI to keep them in sync.
Major
M1. No templates/NOTES.txt no post-install guidance, verification, or first-step instructions.
M2. No values.schema.json every value typo becomes a runtime error instead of an install-time validation error.
M3. No helm template rendering CI.helm lint is wired up but doesn't catch templates that fail to render under different value combinations (e.g., walReplica: true, custom namespace, missing values). Add CI that renders + kubectl apply --dry-run=server.
M4. cert-manager is an implicit hard dependency, never declared. Chart unconditionally creates cert-manager.io/v1 resources. If cert-manager isn't installed the chart "succeeds" but the operator never becomes ready. Add a preflight: {{- if not (.Capabilities.APIVersions.Has "cert-manager.io/v1") }}{{- fail "..." }}{{- end }}, and/or make it optional via certManager.enabled.
M5. No affinity, nodeSelector, tolerations, topologySpreadConstraints, priorityClassName exposed. Standard scheduling controls absent.
M6. No imagePullSecrets air-gapped / private-registry customers can't use the chart.
M7. pullPolicy: Always for pinned tags for appVersion-derived tags, IfNotPresent is the convention. Always only makes sense for floating tags.
M8. Missing securityContext / podSecurityContext. No runAsNonRoot, seccompProfile, readOnlyRootFilesystem, allowPrivilegeEscalation: false. Pod Security Admission with restricted rejects these pods. Borrow defaults from the upstream CNPG chart's values.yaml.
M9. No PodDisruptionBudget for the operator. replicaCount: 1 with no PDB operator stops reconciling during node drains.
M10. Three snowflake webhook/cert mechanisms (operator, sidecar injector, wal-replica) each with its own Issuer/Certificate/Secret/dnsName. Consolidate into a shared template.
M12. Sidecar injector and wal-replica deployments have no probes marked Ready as soon as the container starts. Add readiness/liveness probes that actually verify the gRPC plugin endpoint.
M13. creationTimestamp: null artifacts in templates/02_documentdb_sidecar_injector.yaml:37 and templates/03_documentdb_wal_replica.yaml:87 leftover from kubectl get -o yaml. Remove.
M14. Wrong app.kubernetes.io/managed-by: kustomize labels in templates/03_documentdb_wal_replica.yaml. Helm sets this to Helm automatically; the hardcoded kustomize value misleads GitOps tooling and dashboards.
Minor / polish
N1. UNIFIED-VERSION-CONTROL.md in the chart dir is unreferenced. Link from README or move under docs/.
N2. No chart README.md Artifact Hub renders this; without it the listing page is empty. Consider helm-docs (auto-generates from values.yaml comments).
N3. Missing trailing newlines in Chart.yaml, templates/01_namespace.yaml, templates/_helpers.tpl. helm lint --strict would catch.
N4. _helpers.tpl is anemic single helper returning a literal. GA charts usually have name / fullname / chart / labels / selectorLabels / serviceAccountName / namespace.
N5. namespace: value + .Release.Namespace fallback two ways to set namespace is confusing. Either drop the chart value (let Helm own it via --namespace) or document precedence.
N6. documentDbVersion (values.yaml) vs. CRD field documentDBVersion casing mismatch for the same concept across two surfaces.
N7. walReplica as top-level boolean. If wal-replica grows config (image/resources/replicas) the API has to be restructured. Prefer walReplica.enabled: false from day one.
N8. RBAC verbs granularity most rules effectively grant *. Narrow each rule to what controller code actually uses.
Suggested phasing
Pre-GA must-fix: C1, C2, C5, C6, C7, M4, M8 these move the chart from "works in maintainers' demo cluster" to "passes enterprise security review and installs cleanly in customer environments."
Summary
Pre-GA audit of
operator/documentdb-helm-chart. The chart works for the current demo/preview use cases, but several gaps will block enterprise adoption (multi-install collisions, Pod Security Admission rejection, broad RBAC, no CRD upgrade story, missing scheduling/security knobs).This is a tracking issue. Individual fixes can be split into sub-PRs; some items may have better designs than what is suggested here.
Related: #380 (CNPG version compatibility for BYO mode).
Critical (GA blockers)
ClusterRole/documentdb-operator-cluster-role,ClusterRoleBinding/...,ValidatingWebhookConfiguration/documentdb-validating-webhook,Service/documentdb-webhook-service, all three Deployments, and the wal-replica RBAC are all hardcoded. Needs a properdocumentdb-operator.fullnamehelper and consistent use across every cluster-scoped resource.templates/10_documentdb_webhook.yamleven has a comment admitting this.crds/(Helm 3 special dir) installed on first install, never updated byhelm upgrade. Pick a strategy: hook-based install intemplates/, separatedocumentdb-operator-crdschart (Strimzi/CNPG pattern), or documented manualkubectl applystep in release notes.Deployment.spec.selector.matchLabels: { app: {{ .Release.Name }} }selectors are immutable; coupling to release name riskshelm upgradefailures. Standardize onapp.kubernetes.io/*labels withselectorLabels/labelshelpers (whathelm createships).resources:requests/limits on operator, sidecar-injector, or wal-replica deployments. LimitRange/ResourceQuota enforcement will reject the install. Expose viavalues.yamlwith sane defaults.templates/05_clusterrole.yamlgrants cluster-wide CRUD onsecrets,namespaces,rolebindings,clusterrolebindings,serviceaccounts,configmaps. Several are not justifiable for a DocumentDB operator and will fail security review. Audit each rule against actual controller code.wal-replica-managerClusterRole has a literal# TODO trim this down to what's actually neededcomment (templates/03_documentdb_wal_replica.yaml:145). Ships permissions the author already flagged.Chart.yaml/Chart.lock/ bundledcharts/cloudnative-pg-0.27.0.tgzare pinned to0.27.0, butAGENTS.mdcompatibility matrix declares CNPG1.28.0. Resolve to a single source of truth and add CI to keep them in sync.Major
templates/NOTES.txtno post-install guidance, verification, or first-step instructions.values.schema.jsonevery value typo becomes a runtime error instead of an install-time validation error.helm templaterendering CI.helm lintis wired up but doesn't catch templates that fail to render under different value combinations (e.g.,walReplica: true, custom namespace, missing values). Add CI that renders +kubectl apply --dry-run=server.cert-manager.io/v1resources. If cert-manager isn't installed the chart "succeeds" but the operator never becomes ready. Add a preflight:{{- if not (.Capabilities.APIVersions.Has "cert-manager.io/v1") }}{{- fail "..." }}{{- end }}, and/or make it optional viacertManager.enabled.affinity,nodeSelector,tolerations,topologySpreadConstraints,priorityClassNameexposed. Standard scheduling controls absent.imagePullSecretsair-gapped / private-registry customers can't use the chart.pullPolicy: Alwaysfor pinned tags forappVersion-derived tags,IfNotPresentis the convention.Alwaysonly makes sense for floating tags.securityContext/podSecurityContext. NorunAsNonRoot,seccompProfile,readOnlyRootFilesystem,allowPrivilegeEscalation: false. Pod Security Admission withrestrictedrejects these pods. Borrow defaults from the upstream CNPG chart's values.yaml.PodDisruptionBudgetfor the operator.replicaCount: 1with no PDB operator stops reconciling during node drains.01_,02_, ...,10_) Kustomize convention, meaningless in Helm. Rename to descriptive names (deployment.yaml,rbac.yaml,webhook.yaml, ...).creationTimestamp: nullartifacts intemplates/02_documentdb_sidecar_injector.yaml:37andtemplates/03_documentdb_wal_replica.yaml:87leftover fromkubectl get -o yaml. Remove.app.kubernetes.io/managed-by: kustomizelabels intemplates/03_documentdb_wal_replica.yaml. Helm sets this toHelmautomatically; the hardcodedkustomizevalue misleads GitOps tooling and dashboards.Minor / polish
UNIFIED-VERSION-CONTROL.mdin the chart dir is unreferenced. Link from README or move underdocs/.README.mdArtifact Hub renders this; without it the listing page is empty. Considerhelm-docs(auto-generates from values.yaml comments).Chart.yaml,templates/01_namespace.yaml,templates/_helpers.tpl.helm lint --strictwould catch._helpers.tplis anemic single helper returning a literal. GA charts usually havename/fullname/chart/labels/selectorLabels/serviceAccountName/namespace.namespace:value +.Release.Namespacefallback two ways to set namespace is confusing. Either drop the chart value (let Helm own it via--namespace) or document precedence.documentDbVersion(values.yaml) vs. CRD fielddocumentDBVersioncasing mismatch for the same concept across two surfaces.walReplicaas top-level boolean. If wal-replica grows config (image/resources/replicas) the API has to be restructured. PreferwalReplica.enabled: falsefrom day one.*. Narrow each rule to what controller code actually uses.Suggested phasing
Pre-GA must-fix: C1, C2, C5, C6, C7, M4, M8 these move the chart from "works in maintainers' demo cluster" to "passes enterprise security review and installs cleanly in customer environments."
Pre-GA strongly recommended: C3, C4, M2, M5, M6, M7, M9, M12, M3.
GA+1 polish: M1, M10, M11, M13, M14, all N items.
Happy to break this into per-area sub-issues if that helps assignment.