ci: implement the documented Velero restore drill in the system-test job#1995
Conversation
docs/dr/restore-drill.md has documented a CI restore drill since it was written, but no workflow ever implemented it -- the backup -> data-loss -> restore path was never regression-tested, so a Velero chart bump, RBAC drift or MinIO credential break would only surface during a real disaster (the worst possible time, as the 2026-06-10 vault incident demonstrated for the adjacent snapshot path). Implement the drill as steps inside the existing system-test job, reusing the Talos+Docker cluster it just reconciled (a separate job would pay a second 10-minute cluster bootstrap for no extra signal): 1. wait for BackupStorageLocation/default to be Available (Velero -> in-cluster MinIO, the local R2 stand-in) 2. create a dr-drill namespace + marker ConfigMap carrying run-id/sha 3. Backup CR scoped to the namespace, wait for Completed (fail fast on Failed/PartiallyFailed/FailedValidation) 4. delete the namespace and assert it is gone 5. Restore CR from the backup, wait for Completed 6. assert the restored ConfigMap's run-id matches GITHUB_RUN_ID Velero CRs are created with kubectl (no velero CLI install, no version drift). On any drill failure the step dumps BSL/Backup/Restore state and the Velero server log before exiting. Also truth up the docs: restore-drill.md described a standalone job with its own cluster and a timeout-minutes: 240 budget that never existed; runbook.md claimed the drill asserts etcd encryption-at-rest, which restore-drill.md itself explicitly scopes out. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
System-test status: the first run failed in ~2 min on a transient schema-fetch error during |
The drill's first real run (after main's reconcile wedge was fixed)
exposed a resource-name collision: CNPG also defines a 'backups'
resource, and kubectl resolves an unqualified 'backup' to
backups.postgresql.cnpg.io on this cluster -- so wait_phase polled the
wrong API group for its entire 300s timeout while the actual Velero
backup ran unobserved ('backups.postgresql.cnpg.io "dr-drill" not
found'). Qualify every get/describe/wait in the drill with the
velero.io group so the resolution can never be ambiguous, and note the
collision in a comment.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The drill's first real run (now that main's reconcile wedge is fixed) exposed a genuine bug — exactly the kind of regression this drill exists to catch, just in itself this time: CNPG also defines a Fixed in |
|
🎉 This PR is included in version 1.52.1 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
Summary
docs/dr/restore-drill.md has documented a CI restore drill since it was written — but no workflow ever implemented it. The backup → data-loss → restore path was never regression-tested, so a Velero chart bump, RBAC drift, or MinIO credential break would only surface during a real disaster (as 2026-06-10 demonstrated for the adjacent vault-snapshot path, which turned out to be a health check in a trench coat).
This implements the drill as steps inside the existing
system-testjob, reusing the Talos+Docker cluster it just reconciled — a separate job (as the doc originally described) would pay a second ~10-minute cluster bootstrap for no extra signal. Added wall-clock: ~2-3 minutes.The drill
BackupStorageLocation/default→Available(Velero → in-cluster MinIO, the local R2 stand-in — same S3 code path as prod)dr-drillnamespace + marker ConfigMap carryingrun-id/shaBackupCR scoped to the namespace; wait forCompleted, failing fast onFailed/PartiallyFailed/FailedValidationRestoreCR from the backup; wait forCompletedrun-idmatchesGITHUB_RUN_IDVelero CRs are created with
kubectl(no velero CLI install, can't drift from the deployed Velero version). On failure the step dumps BSL/Backup/Restore state + the Velero server log.Doc truth-ups in the same change
restore-drill.mddescribed a standalone job with its own cluster and atimeout-minutes: 240budget that never existed → now describes the implemented steps.runbook.mdclaimed the drill asserts etcd encryption-at-rest;restore-drill.mditself explicitly scopes that out → corrected.Validation
bash -n; the heredoc Backup/Restore payloads verified as valid YAML at column 0.🤖 Generated with Claude Code