ci: implement the documented Velero restore drill in the system-test job by devantler · Pull Request #1995 · devantler-tech/platform

devantler · 2026-06-10T22:21:10Z

🤖 Generated by the Daily AI Assistant

Summary

docs/dr/restore-drill.md has documented a CI restore drill since it was written — but no workflow ever implemented it. The backup → data-loss → restore path was never regression-tested, so a Velero chart bump, RBAC drift, or MinIO credential break would only surface during a real disaster (as 2026-06-10 demonstrated for the adjacent vault-snapshot path, which turned out to be a health check in a trench coat).

This implements the drill as steps inside the existing system-test job, reusing the Talos+Docker cluster it just reconciled — a separate job (as the doc originally described) would pay a second ~10-minute cluster bootstrap for no extra signal. Added wall-clock: ~2-3 minutes.

The drill

Wait for BackupStorageLocation/default → Available (Velero → in-cluster MinIO, the local R2 stand-in — same S3 code path as prod)
Create dr-drill namespace + marker ConfigMap carrying run-id/sha
Backup CR scoped to the namespace; wait for Completed, failing fast on Failed/PartiallyFailed/FailedValidation
Delete the namespace, assert it's gone
Restore CR from the backup; wait for Completed
Assert the restored ConfigMap's run-id matches GITHUB_RUN_ID

Velero CRs are created with kubectl (no velero CLI install, can't drift from the deployed Velero version). On failure the step dumps BSL/Backup/Restore state + the Velero server log.

Doc truth-ups in the same change

restore-drill.md described a standalone job with its own cluster and a timeout-minutes: 240 budget that never existed → now describes the implemented steps.
runbook.md claimed the drill asserts etcd encryption-at-rest; restore-drill.md itself explicitly scopes that out → corrected.

Validation

Workflow YAML parses (yq); the 90-line drill script passes bash -n; the heredoc Backup/Restore payloads verified as valid YAML at column 0.
⚠️ Expected CI status: system-test is currently red repo-wide from the pre-existing OpenBao label-patch 422 regression (being probed by ci: probe the OpenBao label-patch 422 with kubectl on system-test failure #1990); the drill steps run after reconcile, so this PR can't go green until that's resolved — the drill itself is unaffected.

🤖 Generated with Claude Code

docs/dr/restore-drill.md has documented a CI restore drill since it was written, but no workflow ever implemented it -- the backup -> data-loss -> restore path was never regression-tested, so a Velero chart bump, RBAC drift or MinIO credential break would only surface during a real disaster (the worst possible time, as the 2026-06-10 vault incident demonstrated for the adjacent snapshot path). Implement the drill as steps inside the existing system-test job, reusing the Talos+Docker cluster it just reconciled (a separate job would pay a second 10-minute cluster bootstrap for no extra signal): 1. wait for BackupStorageLocation/default to be Available (Velero -> in-cluster MinIO, the local R2 stand-in) 2. create a dr-drill namespace + marker ConfigMap carrying run-id/sha 3. Backup CR scoped to the namespace, wait for Completed (fail fast on Failed/PartiallyFailed/FailedValidation) 4. delete the namespace and assert it is gone 5. Restore CR from the backup, wait for Completed 6. assert the restored ConfigMap's run-id matches GITHUB_RUN_ID Velero CRs are created with kubectl (no velero CLI install, no version drift). On any drill failure the step dumps BSL/Backup/Restore state and the Velero server log before exiting. Also truth up the docs: restore-drill.md described a standalone job with its own cluster and a timeout-minutes: 240 budget that never existed; runbook.md claimed the drill asserts etcd encryption-at-rest, which restore-drill.md itself explicitly scopes out. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

devantler · 2026-06-10T23:34:35Z

🤖 Generated by the Daily AI Assistant

System-test status: the first run failed in ~2 min on a transient schema-fetch error during ksail workload validate (bases/apps/actual-budget ... validation failed: EOF — also hit other PRs and branches with no manifest changes). On rerun, validate passed and the job ran the full ~53 min, failing at reconcile with infrastructure/apps NotReady — the known repo-wide OpenBao label-patch 422 wedge being probed in #1990 (same signature and duration as #1987/#1989). The drill steps added by this PR run after reconcile, so they haven't executed yet; this PR can go green once the 422 regression is fixed.

The drill's first real run (after main's reconcile wedge was fixed) exposed a resource-name collision: CNPG also defines a 'backups' resource, and kubectl resolves an unqualified 'backup' to backups.postgresql.cnpg.io on this cluster -- so wait_phase polled the wrong API group for its entire 300s timeout while the actual Velero backup ran unobserved ('backups.postgresql.cnpg.io "dr-drill" not found'). Qualify every get/describe/wait in the drill with the velero.io group so the resolution can never be ambiguous, and note the collision in a comment. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

devantler · 2026-06-11T17:44:15Z

🤖 Generated by the Daily AI Assistant

The drill's first real run (now that main's reconcile wedge is fixed) exposed a genuine bug — exactly the kind of regression this drill exists to catch, just in itself this time: CNPG also defines a backups resource, and kubectl resolves an unqualified backup to backups.postgresql.cnpg.io on this cluster. wait_phase therefore polled the wrong API group for its whole 300s timeout (backups.postgresql.cnpg.io \"dr-drill\" not found) while the actual Velero backup ran unobserved.

Fixed in 65815e7f: every get/describe/wait in the drill now uses the fully-qualified *.velero.io resource names, with a comment documenting the collision. The next system-test run will exercise the full drill with correct polling.

botantler · 2026-06-11T18:48:32Z

🎉 This PR is included in version 1.52.1 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

github-project-automation Bot added this to 🌊 Project Board Jun 10, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board Jun 10, 2026

devantler had a problem deploying to ci June 10, 2026 22:21 — with GitHub Actions Failure

devantler marked this pull request as ready for review June 10, 2026 22:23

devantler enabled auto-merge June 10, 2026 22:23

devantler had a problem deploying to ci June 10, 2026 22:40 — with GitHub Actions Failure

devantler mentioned this pull request Jun 10, 2026

feat(ha): add drain-safe PDBs to the remaining critical controllers and apps #1991

Open

Merge branch 'main' into claude/ci-restore-drill

ec9a888

devantler had a problem deploying to ci June 11, 2026 15:35 — with GitHub Actions Failure

botantler Bot approved these changes Jun 11, 2026

View reviewed changes

Merge branch 'main' into claude/ci-restore-drill

650b4ef

devantler had a problem deploying to ci June 11, 2026 17:10 — with GitHub Actions Failure

botantler Bot approved these changes Jun 11, 2026

View reviewed changes

devantler temporarily deployed to ci June 11, 2026 17:44 — with GitHub Actions Inactive

botantler Bot approved these changes Jun 11, 2026

View reviewed changes

devantler added this pull request to the merge queue Jun 11, 2026

Merged via the queue into main with commit 68d075e Jun 11, 2026
10 checks passed

devantler deleted the claude/ci-restore-drill branch June 11, 2026 18:15

github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board Jun 11, 2026

This was referenced Jun 11, 2026

feat(dr): one-button prod rebuild workflow (runbook scenario 4, executable) #1997

Merged

fix(dr): fully qualify Velero resource names in the rebuild workflow #2031

Merged

botantler Bot added the released label Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: implement the documented Velero restore drill in the system-test job#1995

ci: implement the documented Velero restore drill in the system-test job#1995
devantler merged 4 commits into
mainfrom
claude/ci-restore-drill

devantler commented Jun 10, 2026

Uh oh!

devantler commented Jun 10, 2026

Uh oh!

devantler commented Jun 11, 2026

Uh oh!

Uh oh!

botantler Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devantler commented Jun 10, 2026

Summary

The drill

Doc truth-ups in the same change

Validation

Uh oh!

devantler commented Jun 10, 2026

Uh oh!

devantler commented Jun 11, 2026

Uh oh!

Uh oh!

botantler Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant