Skip to content

feat(dr): one-button prod rebuild workflow (runbook scenario 4, executable)#1997

Merged
devantler merged 1 commit into
mainfrom
claude/dr-rebuild-workflow
Jun 10, 2026
Merged

feat(dr): one-button prod rebuild workflow (runbook scenario 4, executable)#1997
devantler merged 1 commit into
mainfrom
claude/dr-rebuild-workflow

Conversation

@devantler

Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

Summary

Makes runbook Scenario 4 (full cluster rebuild from zero) a single workflow_dispatch run instead of a sequence of local manual commands — completing the "GitHub Actions can recover the platform, including persisted data and the vault" goal.

Depends on #1996 (the vault-snapshot-r2 Secret, the R2 snapshot mirror, and the vault-config auto-restore path).

What the workflow does

Phase Detail
Confirm Requires typing REBUILD-PROD; shares the prod-deploy concurrency group with ci/cd so it can never race a deploy
Rebuild ksail cluster createworkload pushreconcile → wait for all four Flux Kustomizations. Fresh kubeconfig/talosconfig are written on the runner, so the workflow is immune to the stale KUBE_CONFIG/TALOS_CONFIG problem that breaks the regular pipeline after a rebuild
Velero restore Newest Completed backup synced from R2; PartiallyFailed accepted (most resources already exist on the freshly-converged cluster — the restore's job is what Flux doesn't own)
OpenBao recovery In-cluster pod fetches the newest raft snapshot from the R2 openbao-snapshots/ mirror onto the PVC (credentials never leave the cluster — reuses the vault-snapshot-r2 Secret and the CronJob's NetworkPolicy via its app label) → pre-incident openbao-unseal restored from the Velero backup → fresh vault reset (HR suspended, sts→0, data PVCs deleted) → vault-config's automated snapshot-restore brings the old vault back → ExternalSecrets/PushSecrets force-synced
Credentials KUBE_CONFIG/TALOS_CONFIG refreshed automatically when a DR_GH_ADMIN_TOKEN secret (fine-grained PAT, environment-secrets write) is configured; loud warning otherwise
Summary LB IP, external-dns note, CNPG + per-app data pointers

Safety properties

  • Aborts the OpenBao surgery before touching data PVCs if the old unseal Secret didn't come back from the backup.
  • The snapshot-fetch pod and dr-r2-target ConfigMap are cleaned up either way.
  • restore=false input gives a bare rebuild (fresh vault, SOPS-seeded) with no restore steps.

Runbook truth-ups

  • Scenario 4 leads with the workflow; the manual path stays as the GitHub-Actions-is-down fallback.
  • The manual DNS step is gone: external-dns (policy: sync, gateway-httproute source) repoints the Cloudflare records automatically after a rebuild — the runbook step predated external-dns. Now it's "verify, intervene only if external-dns is broken".
  • Scenario 9 documents the automatic secret refresh; the OpenBao artifacts row points at the snapshot mirror + automated restore.

Honest limits (stated in the workflow header + runbook)

  • Per-app PVC data (headlamp, actual-budget) isn't rehydrated into already-running pods — Velero skips existing resources; per-app reset per Scenario 5.
  • CNPG (umami-db) recovers from its own barman/R2 backups (Scenario 5's kubectl cnpg restore).
  • The workflow can't be end-to-end tested without sacrificing a prod cluster; every step mirrors the runbook commands and all step scripts pass bash -n, but treat the first real run as supervised.

Validation

  • Workflow YAML parses (yq); all 7 step scripts pass bash -n; embedded Pod manifest parses as valid YAML.
  • No k8s manifests changed → no overlay rebuild needed (ci.yaml path filter doesn't trigger system-test for this file).

🤖 Generated with Claude Code

…table)

Adds .github/workflows/dr-rebuild.yaml — a workflow_dispatch encoding of
the full-cluster-rebuild runbook so recovery needs no operator machine,
no local credentials and no stale CI secrets:

1. ksail cluster create (fresh kubeconfig/talosconfig ON the runner, so
   the workflow is immune to the stale KUBE_CONFIG/TALOS_CONFIG problem
   that breaks the regular deploy pipeline after a rebuild)
2. workload push + reconcile, then wait for all four Flux Kustomizations
3. (restore=true) Velero resource restore from the newest Completed
   backup synced from R2 (PartiallyFailed accepted: most resources
   already exist on the freshly-converged cluster)
4. (restore=true) OpenBao data recovery: an in-cluster pod fetches the
   newest raft snapshot from the R2 openbao-snapshots/ mirror onto the
   vault-snapshots PVC (creds never leave the cluster — it reuses the
   vault-snapshot-r2 Secret and the CronJob's NetworkPolicy via its
   app label); the pre-incident openbao-unseal Secret is restored from
   the Velero backup; the fresh vault is reset (HR suspended, sts scaled
   to 0, data PVCs deleted) and the vault-config Job's automated
   snapshot-restore path brings the old vault back; ExternalSecrets +
   PushSecrets are force-synced afterwards
5. KUBE_CONFIG/TALOS_CONFIG refreshed automatically when a
   DR_GH_ADMIN_TOKEN secret is configured, with a loud warning otherwise

Guard rails: requires typing REBUILD-PROD, shares the prod-deploy
concurrency group with ci.yaml/cd.yaml so it can never race a deploy,
and aborts the OpenBao surgery before touching data PVCs if the old
unseal Secret did not come back.

Runbook truth-ups: scenario 4 leads with the workflow and documents that
DNS is NOT a manual step (external-dns policy:sync repoints Cloudflare
records automatically — the old step predated external-dns); scenario 9
documents the automatic secret refresh; the OpenBao artifacts row points
at the snapshot mirror + automated restore.

Depends on #1996 (vault-snapshot-r2 Secret, R2 mirror, vault-config
auto-restore path).

Validated: workflow YAML parses (yq), all 7 step scripts pass bash -n,
embedded Pod manifest parses as valid YAML.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@devantler devantler marked this pull request as ready for review June 10, 2026 22:38
@devantler devantler added this pull request to the merge queue Jun 10, 2026
Merged via the queue into main with commit 6d2f3e8 Jun 10, 2026
10 checks passed
@devantler devantler deleted the claude/dr-rebuild-workflow branch June 10, 2026 22:39
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board Jun 10, 2026
@botantler

botantler Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

🎉 This PR is included in version 1.48.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@botantler botantler Bot added the released label Jun 10, 2026
@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

Applied the lesson from the restore drill's first real run on #1995: kubectl get backups resolves to CNPG's backups.postgresql.cnpg.io on this cluster, not Velero's. In a real DR, this workflow's backup-discovery loop would have listed CNPG Backups, found no Completed ones, and aborted the entire restore with 'No Completed Velero backup found'. All velero get/describe/wait calls now use fully-qualified *.velero.io names. All step scripts still pass bash -n.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

1 participant