feat(dr): one-button prod rebuild workflow (runbook scenario 4, executable) by devantler · Pull Request #1997 · devantler-tech/platform

devantler · 2026-06-10T22:36:11Z

🤖 Generated by the Daily AI Assistant

Summary

Makes runbook Scenario 4 (full cluster rebuild from zero) a single workflow_dispatch run instead of a sequence of local manual commands — completing the "GitHub Actions can recover the platform, including persisted data and the vault" goal.

Depends on #1996 (the vault-snapshot-r2 Secret, the R2 snapshot mirror, and the vault-config auto-restore path).

What the workflow does

Phase	Detail
Confirm	Requires typing `REBUILD-PROD`; shares the `prod-deploy` concurrency group with ci/cd so it can never race a deploy
Rebuild	`ksail cluster create` → `workload push` → `reconcile` → wait for all four Flux Kustomizations. Fresh kubeconfig/talosconfig are written on the runner, so the workflow is immune to the stale `KUBE_CONFIG`/`TALOS_CONFIG` problem that breaks the regular pipeline after a rebuild
Velero restore	Newest `Completed` backup synced from R2; `PartiallyFailed` accepted (most resources already exist on the freshly-converged cluster — the restore's job is what Flux doesn't own)
OpenBao recovery	In-cluster pod fetches the newest raft snapshot from the R2 `openbao-snapshots/` mirror onto the PVC (credentials never leave the cluster — reuses the `vault-snapshot-r2` Secret and the CronJob's NetworkPolicy via its `app` label) → pre-incident `openbao-unseal` restored from the Velero backup → fresh vault reset (HR suspended, sts→0, data PVCs deleted) → vault-config's automated snapshot-restore brings the old vault back → ExternalSecrets/PushSecrets force-synced
Credentials	`KUBE_CONFIG`/`TALOS_CONFIG` refreshed automatically when a `DR_GH_ADMIN_TOKEN` secret (fine-grained PAT, environment-secrets write) is configured; loud warning otherwise
Summary	LB IP, external-dns note, CNPG + per-app data pointers

Safety properties

Aborts the OpenBao surgery before touching data PVCs if the old unseal Secret didn't come back from the backup.
The snapshot-fetch pod and dr-r2-target ConfigMap are cleaned up either way.
restore=false input gives a bare rebuild (fresh vault, SOPS-seeded) with no restore steps.

Runbook truth-ups

Scenario 4 leads with the workflow; the manual path stays as the GitHub-Actions-is-down fallback.
The manual DNS step is gone: external-dns (policy: sync, gateway-httproute source) repoints the Cloudflare records automatically after a rebuild — the runbook step predated external-dns. Now it's "verify, intervene only if external-dns is broken".
Scenario 9 documents the automatic secret refresh; the OpenBao artifacts row points at the snapshot mirror + automated restore.

Honest limits (stated in the workflow header + runbook)

Per-app PVC data (headlamp, actual-budget) isn't rehydrated into already-running pods — Velero skips existing resources; per-app reset per Scenario 5.
CNPG (umami-db) recovers from its own barman/R2 backups (Scenario 5's kubectl cnpg restore).
The workflow can't be end-to-end tested without sacrificing a prod cluster; every step mirrors the runbook commands and all step scripts pass bash -n, but treat the first real run as supervised.

Validation

Workflow YAML parses (yq); all 7 step scripts pass bash -n; embedded Pod manifest parses as valid YAML.
No k8s manifests changed → no overlay rebuild needed (ci.yaml path filter doesn't trigger system-test for this file).

🤖 Generated with Claude Code

…table) Adds .github/workflows/dr-rebuild.yaml — a workflow_dispatch encoding of the full-cluster-rebuild runbook so recovery needs no operator machine, no local credentials and no stale CI secrets: 1. ksail cluster create (fresh kubeconfig/talosconfig ON the runner, so the workflow is immune to the stale KUBE_CONFIG/TALOS_CONFIG problem that breaks the regular deploy pipeline after a rebuild) 2. workload push + reconcile, then wait for all four Flux Kustomizations 3. (restore=true) Velero resource restore from the newest Completed backup synced from R2 (PartiallyFailed accepted: most resources already exist on the freshly-converged cluster) 4. (restore=true) OpenBao data recovery: an in-cluster pod fetches the newest raft snapshot from the R2 openbao-snapshots/ mirror onto the vault-snapshots PVC (creds never leave the cluster — it reuses the vault-snapshot-r2 Secret and the CronJob's NetworkPolicy via its app label); the pre-incident openbao-unseal Secret is restored from the Velero backup; the fresh vault is reset (HR suspended, sts scaled to 0, data PVCs deleted) and the vault-config Job's automated snapshot-restore path brings the old vault back; ExternalSecrets + PushSecrets are force-synced afterwards 5. KUBE_CONFIG/TALOS_CONFIG refreshed automatically when a DR_GH_ADMIN_TOKEN secret is configured, with a loud warning otherwise Guard rails: requires typing REBUILD-PROD, shares the prod-deploy concurrency group with ci.yaml/cd.yaml so it can never race a deploy, and aborts the OpenBao surgery before touching data PVCs if the old unseal Secret did not come back. Runbook truth-ups: scenario 4 leads with the workflow and documents that DNS is NOT a manual step (external-dns policy:sync repoints Cloudflare records automatically — the old step predated external-dns); scenario 9 documents the automatic secret refresh; the OpenBao artifacts row points at the snapshot mirror + automated restore. Depends on #1996 (vault-snapshot-r2 Secret, R2 mirror, vault-config auto-restore path). Validated: workflow YAML parses (yq), all 7 step scripts pass bash -n, embedded Pod manifest parses as valid YAML. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

botantler · 2026-06-10T22:39:47Z

🎉 This PR is included in version 1.48.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

devantler · 2026-06-11T18:16:46Z

🤖 Generated by the Daily AI Assistant

Applied the lesson from the restore drill's first real run on #1995: kubectl get backups resolves to CNPG's backups.postgresql.cnpg.io on this cluster, not Velero's. In a real DR, this workflow's backup-discovery loop would have listed CNPG Backups, found no Completed ones, and aborted the entire restore with 'No Completed Velero backup found'. All velero get/describe/wait calls now use fully-qualified *.velero.io names. All step scripts still pass bash -n.

github-project-automation Bot added this to 🌊 Project Board Jun 10, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board Jun 10, 2026

devantler marked this pull request as ready for review June 10, 2026 22:38

devantler added this pull request to the merge queue Jun 10, 2026

Merged via the queue into main with commit 6d2f3e8 Jun 10, 2026
10 checks passed

devantler deleted the claude/dr-rebuild-workflow branch June 10, 2026 22:39

github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board Jun 10, 2026

botantler Bot added the released label Jun 10, 2026

devantler mentioned this pull request Jun 11, 2026

fix(dr): fully qualify Velero resource names in the rebuild workflow #2031

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dr): one-button prod rebuild workflow (runbook scenario 4, executable)#1997

feat(dr): one-button prod rebuild workflow (runbook scenario 4, executable)#1997
devantler merged 1 commit into
mainfrom
claude/dr-rebuild-workflow

devantler commented Jun 10, 2026

Uh oh!

Uh oh!

botantler Bot commented Jun 10, 2026

Uh oh!

devantler commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devantler commented Jun 10, 2026

Summary

What the workflow does

Safety properties

Runbook truth-ups

Honest limits (stated in the workflow header + runbook)

Validation

Uh oh!

Uh oh!

botantler Bot commented Jun 10, 2026

Uh oh!

devantler commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant