Skip to content

ci: implement the documented Velero restore drill in the system-test job#1995

Merged
devantler merged 4 commits into
mainfrom
claude/ci-restore-drill
Jun 11, 2026
Merged

ci: implement the documented Velero restore drill in the system-test job#1995
devantler merged 4 commits into
mainfrom
claude/ci-restore-drill

Conversation

@devantler

Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

Summary

docs/dr/restore-drill.md has documented a CI restore drill since it was written — but no workflow ever implemented it. The backup → data-loss → restore path was never regression-tested, so a Velero chart bump, RBAC drift, or MinIO credential break would only surface during a real disaster (as 2026-06-10 demonstrated for the adjacent vault-snapshot path, which turned out to be a health check in a trench coat).

This implements the drill as steps inside the existing system-test job, reusing the Talos+Docker cluster it just reconciled — a separate job (as the doc originally described) would pay a second ~10-minute cluster bootstrap for no extra signal. Added wall-clock: ~2-3 minutes.

The drill

  1. Wait for BackupStorageLocation/defaultAvailable (Velero → in-cluster MinIO, the local R2 stand-in — same S3 code path as prod)
  2. Create dr-drill namespace + marker ConfigMap carrying run-id/sha
  3. Backup CR scoped to the namespace; wait for Completed, failing fast on Failed/PartiallyFailed/FailedValidation
  4. Delete the namespace, assert it's gone
  5. Restore CR from the backup; wait for Completed
  6. Assert the restored ConfigMap's run-id matches GITHUB_RUN_ID

Velero CRs are created with kubectl (no velero CLI install, can't drift from the deployed Velero version). On failure the step dumps BSL/Backup/Restore state + the Velero server log.

Doc truth-ups in the same change

  • restore-drill.md described a standalone job with its own cluster and a timeout-minutes: 240 budget that never existed → now describes the implemented steps.
  • runbook.md claimed the drill asserts etcd encryption-at-rest; restore-drill.md itself explicitly scopes that out → corrected.

Validation

  • Workflow YAML parses (yq); the 90-line drill script passes bash -n; the heredoc Backup/Restore payloads verified as valid YAML at column 0.
  • ⚠️ Expected CI status: system-test is currently red repo-wide from the pre-existing OpenBao label-patch 422 regression (being probed by ci: probe the OpenBao label-patch 422 with kubectl on system-test failure #1990); the drill steps run after reconcile, so this PR can't go green until that's resolved — the drill itself is unaffected.

🤖 Generated with Claude Code

docs/dr/restore-drill.md has documented a CI restore drill since it was
written, but no workflow ever implemented it -- the backup -> data-loss ->
restore path was never regression-tested, so a Velero chart bump, RBAC
drift or MinIO credential break would only surface during a real
disaster (the worst possible time, as the 2026-06-10 vault incident
demonstrated for the adjacent snapshot path).

Implement the drill as steps inside the existing system-test job,
reusing the Talos+Docker cluster it just reconciled (a separate job
would pay a second 10-minute cluster bootstrap for no extra signal):

1. wait for BackupStorageLocation/default to be Available (Velero ->
   in-cluster MinIO, the local R2 stand-in)
2. create a dr-drill namespace + marker ConfigMap carrying run-id/sha
3. Backup CR scoped to the namespace, wait for Completed (fail fast on
   Failed/PartiallyFailed/FailedValidation)
4. delete the namespace and assert it is gone
5. Restore CR from the backup, wait for Completed
6. assert the restored ConfigMap's run-id matches GITHUB_RUN_ID

Velero CRs are created with kubectl (no velero CLI install, no version
drift). On any drill failure the step dumps BSL/Backup/Restore state and
the Velero server log before exiting.

Also truth up the docs: restore-drill.md described a standalone job with
its own cluster and a timeout-minutes: 240 budget that never existed;
runbook.md claimed the drill asserts etcd encryption-at-rest, which
restore-drill.md itself explicitly scopes out.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

System-test status: the first run failed in ~2 min on a transient schema-fetch error during ksail workload validate (bases/apps/actual-budget ... validation failed: EOF — also hit other PRs and branches with no manifest changes). On rerun, validate passed and the job ran the full ~53 min, failing at reconcile with infrastructure/apps NotReady — the known repo-wide OpenBao label-patch 422 wedge being probed in #1990 (same signature and duration as #1987/#1989). The drill steps added by this PR run after reconcile, so they haven't executed yet; this PR can go green once the 422 regression is fixed.

The drill's first real run (after main's reconcile wedge was fixed)
exposed a resource-name collision: CNPG also defines a 'backups'
resource, and kubectl resolves an unqualified 'backup' to
backups.postgresql.cnpg.io on this cluster -- so wait_phase polled the
wrong API group for its entire 300s timeout while the actual Velero
backup ran unobserved ('backups.postgresql.cnpg.io "dr-drill" not
found'). Qualify every get/describe/wait in the drill with the
velero.io group so the resolution can never be ambiguous, and note the
collision in a comment.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

The drill's first real run (now that main's reconcile wedge is fixed) exposed a genuine bug — exactly the kind of regression this drill exists to catch, just in itself this time: CNPG also defines a backups resource, and kubectl resolves an unqualified backup to backups.postgresql.cnpg.io on this cluster. wait_phase therefore polled the wrong API group for its whole 300s timeout (backups.postgresql.cnpg.io \"dr-drill\" not found) while the actual Velero backup ran unobserved.

Fixed in 65815e7f: every get/describe/wait in the drill now uses the fully-qualified *.velero.io resource names, with a comment documenting the collision. The next system-test run will exercise the full drill with correct polling.

@devantler devantler added this pull request to the merge queue Jun 11, 2026
Merged via the queue into main with commit 68d075e Jun 11, 2026
10 checks passed
@devantler devantler deleted the claude/ci-restore-drill branch June 11, 2026 18:15
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board Jun 11, 2026
@botantler

botantler Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

🎉 This PR is included in version 1.52.1 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@botantler botantler Bot added the released label Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

1 participant