Skip to content

refactor(secrets): stop SOPS-seeding non-bootstrap secrets, feed them via OpenBao#2034

Merged
devantler merged 1 commit into
mainfrom
claude/exciting-kirch-7f48ff
Jun 12, 2026
Merged

refactor(secrets): stop SOPS-seeding non-bootstrap secrets, feed them via OpenBao#2034
devantler merged 1 commit into
mainfrom
claude/exciting-kirch-7f48ff

Conversation

@devantler

Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

Summary

Implements the SOPS bootstrap-only policy: SOPS seeds only what the cluster needs to boot and recover; everything else is fed into OpenBao by an operator (upstream tokens) or by ESO generators (random values), persisting via the raft snapshot mirror (#1996).

Removed from SOPS → user-fed OpenBao writes

Key New home Consumers
cloudflare_api_token bao kv put secret/infrastructure/dns/cloudflare api_token=… cert-manager DNS01 (already an ExternalSecret), external-dns (converted here)
fleetdm_license_key bao kv put secret/apps/fleetdm/license license-key=… fleetdm (currently disabled)
base r2_* pair deleted — dead weight none (stale truncated placeholder; live creds are per-env in variables-cluster)

Keys were pruned with sops unset (values never decrypted). The seed-cloudflare / seed-fleetdm PushSecrets are removed; PushSecret's default deletionPolicy: None means the existing prod KV entries persist, so nothing breaks on merge — the values just stop being repo-managed.

external-dns: controllers → infrastructure layer

Its token now arrives via an OpenBao ExternalSecret (same KV entry as the cert-manager solver). In the wait-gated infrastructure-controllers layer that ExternalSecret would deadlock bootstrap (the openbao ClusterSecretStore is an infrastructure-layer resource), so the whole dir moves — the OpenCost precedent. external-dns is not bootstrap-critical: existing DNS records keep serving while it waits.

Deliberately KEPT in SOPS (bootstrap seeds)

  • hcloud_token, per-env r2_* (DR chicken-and-egg: they make the BSL available before any restore can run)
  • OIDC quartet (dex/flux-web/oauth2-proxy/github client secrets — consumed via Flux substitution by controllers-layer workloads; post-2026-06-10 single-source design)
  • alertmanager_heartbeat_url (the one ungated external dead-man) and alertmanager_webhook_url (inline Coroot CR field, no secretRef support)
  • ghcr_dockerconfigjsonreclassified as bootstrap during implementation: the verify-image-signatures ClusterPolicy needs the kyverno ghcr-auth Secret to fetch signature manifests of the private ksail-operator image in every cluster, including ephemeral CI ones where no operator exists to feed the vault (verified: anonymous GHCR token grant for devantler-tech/ksail-operator is DENIED). Its now-unused substituteFrom entries are dropped; the Secret is seed-only.

Docs truth-up

  • runbook scenario 7 + velero-cnpg.md pointed R2 rotation at the stale base placeholders; now point at the real per-env variables-cluster files (+ fully-qualified backups.velero.io).
  • runbook scenario 4 gains step 4b: re-feed user-fed secrets after a fresh-vault rebuild (no-op when the feat(openbao): mirror raft snapshots off-cluster and auto-restore from them #1996 snapshot restore ran).
  • fleetdm README: license rotation is now a vault write.

Rollout notes

  • Prod: zero immediate behaviour change (KV entries already populated). external-dns briefly reinstalls as Flux ownership moves between Kustomizations — DNS records persist (policy: sync, txt registry), worst case a few minutes of paused reconciliation.
  • From-zero rebuild without a vault restore now requires the two bao kv put writes before cert-manager DNS01 / external-dns / fleetdm go green (documented in runbook 4b). With the feat(openbao): mirror raft snapshots off-cluster and auto-restore from them #1996 mirror restored, no manual steps.
  • Left as observed: seed-github-app pushes infrastructure/oidc/github which nothing currently reads back — kept as a durability mirror of a bootstrap-seeded value.

Validation

  • ksail workload validate — ✅ 305 files (includes all moved external-dns manifests)
  • ksail --config ksail.prod.yaml workload validate — sole failure is the known pre-existing coroot notificationIntegrations schema gap (upstream Update coroot.com/coroot_v1 schema from coroot-operator 0.9.7 datreeio/CRDs-catalog#896), unrelated
  • kubectl kustomize — local, prod, hetzner infra, hetzner controllers, docker infra all build
  • Reference sweep: zero remaining references to cloudflare_api_token / fleetdm_license_key / seed-cloudflare / seed-fleetdm

🤖 Generated with Claude Code

… via OpenBao

SOPS now seeds bootstrap-critical values only. Upstream credentials whose
consumers can safely wait are user-fed into OpenBao once and persist via
the raft snapshot mirror:

- Remove the seed-cloudflare and seed-fleetdm PushSecrets; document the
  bao kv put paths in the push-secrets.yaml header and DR runbook.
- Prune cloudflare_api_token plus the dead r2_* placeholder keys from
  variables-base, and fleetdm_license_key from both variables-cluster
  files (sops unset; values never decrypted).
- Move external-dns from the infrastructure-controllers layer to the
  hetzner infrastructure layer and source its Cloudflare token from an
  OpenBao ExternalSecret (the same KV entry cert-manager DNS01 reads)
  instead of Flux substitution; in the wait-gated controllers layer the
  ExternalSecret would deadlock before the ClusterSecretStore exists.
- Drop the now-unused variables-base Secret substituteFrom entries; the
  Secret remains seed-only (ghcr_dockerconfigjson stays SOPS-seeded:
  Kyverno image verification of the private ksail-operator image must
  work in ephemeral CI clusters where no operator can feed the vault).
- Truth up the R2 rotation docs: the live creds are per-environment in
  variables-cluster, not the (now removed) stale base placeholders.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@devantler devantler marked this pull request as ready for review June 12, 2026 17:13
@devantler devantler added this pull request to the merge queue Jun 12, 2026
Merged via the queue into main with commit 5a50e65 Jun 12, 2026
10 checks passed
@devantler devantler deleted the claude/exciting-kirch-7f48ff branch June 12, 2026 17:20
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board Jun 12, 2026
@botantler

botantler Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

🎉 This PR is included in version 1.56.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@botantler botantler Bot added the released label Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

1 participant