feat(dr): one-button prod rebuild workflow (runbook scenario 4, executable)#1997
Merged
Conversation
…table) Adds .github/workflows/dr-rebuild.yaml — a workflow_dispatch encoding of the full-cluster-rebuild runbook so recovery needs no operator machine, no local credentials and no stale CI secrets: 1. ksail cluster create (fresh kubeconfig/talosconfig ON the runner, so the workflow is immune to the stale KUBE_CONFIG/TALOS_CONFIG problem that breaks the regular deploy pipeline after a rebuild) 2. workload push + reconcile, then wait for all four Flux Kustomizations 3. (restore=true) Velero resource restore from the newest Completed backup synced from R2 (PartiallyFailed accepted: most resources already exist on the freshly-converged cluster) 4. (restore=true) OpenBao data recovery: an in-cluster pod fetches the newest raft snapshot from the R2 openbao-snapshots/ mirror onto the vault-snapshots PVC (creds never leave the cluster — it reuses the vault-snapshot-r2 Secret and the CronJob's NetworkPolicy via its app label); the pre-incident openbao-unseal Secret is restored from the Velero backup; the fresh vault is reset (HR suspended, sts scaled to 0, data PVCs deleted) and the vault-config Job's automated snapshot-restore path brings the old vault back; ExternalSecrets + PushSecrets are force-synced afterwards 5. KUBE_CONFIG/TALOS_CONFIG refreshed automatically when a DR_GH_ADMIN_TOKEN secret is configured, with a loud warning otherwise Guard rails: requires typing REBUILD-PROD, shares the prod-deploy concurrency group with ci.yaml/cd.yaml so it can never race a deploy, and aborts the OpenBao surgery before touching data PVCs if the old unseal Secret did not come back. Runbook truth-ups: scenario 4 leads with the workflow and documents that DNS is NOT a manual step (external-dns policy:sync repoints Cloudflare records automatically — the old step predated external-dns); scenario 9 documents the automatic secret refresh; the OpenBao artifacts row points at the snapshot mirror + automated restore. Depends on #1996 (vault-snapshot-r2 Secret, R2 mirror, vault-config auto-restore path). Validated: workflow YAML parses (yq), all 7 step scripts pass bash -n, embedded Pod manifest parses as valid YAML. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
|
🎉 This PR is included in version 1.48.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
Contributor
Author
Applied the lesson from the restore drill's first real run on #1995: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes runbook Scenario 4 (full cluster rebuild from zero) a single
workflow_dispatchrun instead of a sequence of local manual commands — completing the "GitHub Actions can recover the platform, including persisted data and the vault" goal.Depends on #1996 (the
vault-snapshot-r2Secret, the R2 snapshot mirror, and the vault-config auto-restore path).What the workflow does
REBUILD-PROD; shares theprod-deployconcurrency group with ci/cd so it can never race a deployksail cluster create→workload push→reconcile→ wait for all four Flux Kustomizations. Fresh kubeconfig/talosconfig are written on the runner, so the workflow is immune to the staleKUBE_CONFIG/TALOS_CONFIGproblem that breaks the regular pipeline after a rebuildCompletedbackup synced from R2;PartiallyFailedaccepted (most resources already exist on the freshly-converged cluster — the restore's job is what Flux doesn't own)openbao-snapshots/mirror onto the PVC (credentials never leave the cluster — reuses thevault-snapshot-r2Secret and the CronJob's NetworkPolicy via itsapplabel) → pre-incidentopenbao-unsealrestored from the Velero backup → fresh vault reset (HR suspended, sts→0, data PVCs deleted) → vault-config's automated snapshot-restore brings the old vault back → ExternalSecrets/PushSecrets force-syncedKUBE_CONFIG/TALOS_CONFIGrefreshed automatically when aDR_GH_ADMIN_TOKENsecret (fine-grained PAT, environment-secrets write) is configured; loud warning otherwiseSafety properties
dr-r2-targetConfigMap are cleaned up either way.restore=falseinput gives a bare rebuild (fresh vault, SOPS-seeded) with no restore steps.Runbook truth-ups
policy: sync,gateway-httproutesource) repoints the Cloudflare records automatically after a rebuild — the runbook step predated external-dns. Now it's "verify, intervene only if external-dns is broken".Honest limits (stated in the workflow header + runbook)
kubectl cnpg restore).bash -n, but treat the first real run as supervised.Validation
bash -n; embedded Pod manifest parses as valid YAML.🤖 Generated with Claude Code