devantler-tech · devantler · Jun 11, 2026 · Jun 10, 2026 · Jun 11, 2026 · Jun 11, 2026
@@ -82,6 +82,108 @@ jobs:
           reconcile: "true"
           delete: "false"
 
+      - name: 💾 DR restore drill (Velero backup → delete → restore)
+        # Validates the full backup → data-loss → restore cycle against the
+        # in-cluster MinIO (the local R2 stand-in) on every k8s PR, so the
+        # Velero code path is regression-tested before changes reach prod.
+        # Reuses the cluster the System Test step just reconciled; adds ~2-3
+        # minutes. See docs/dr/restore-drill.md for design + manual run.
+        run: |
+          set -euo pipefail
+
+          # Resource names are fully qualified with the velero.io group
+          # throughout: CNPG also defines a `backups` resource, and kubectl
+          # resolves an unqualified `backup` to backups.postgresql.cnpg.io on
+          # this cluster — the drill's first run polled the wrong API group
+          # for its entire timeout while the actual Velero backup completed.
+          dump_velero_state() {
+            echo "::group::Velero state (drill failure)"
+            kubectl -n velero get backupstoragelocations.velero.io,backups.velero.io,restores.velero.io -o wide || true
+            kubectl -n velero describe backups.velero.io dr-drill || true
+            kubectl -n velero describe restores.velero.io dr-drill || true
+            kubectl -n velero logs deploy/velero --tail=200 || true
+            echo "::endgroup::"
+          }
+          trap dump_velero_state ERR
+
+          wait_phase() {
+            # wait_phase <kind> <name> <timeout-seconds> — poll a Velero CR
+            # until .status.phase is Completed; fail fast on a terminal
+            # failure phase instead of burning the whole timeout.
+            local kind="$1" name="$2" timeout="$3" phase=""
+            local deadline=$((SECONDS + timeout))
+            while [ "$SECONDS" -lt "$deadline" ]; do
+              phase=$(kubectl -n velero get "$kind" "$name" -o jsonpath='{.status.phase}' 2>/dev/null || true)
+              case "$phase" in
+                Completed) echo "$kind/$name: Completed"; return 0 ;;
+                Failed|PartiallyFailed|FailedValidation)
+                  echo "::error::$kind/$name entered terminal phase $phase"
+                  return 1 ;;
+                *) sleep 5 ;;
+              esac
+            done
+            echo "::error::$kind/$name did not complete within ${timeout}s (last phase: ${phase:-<none>})"
+            return 1
+          }
+
+          echo "::group::Wait for BackupStorageLocation default to be Available"
+          kubectl -n velero wait backupstoragelocations.velero.io/default \
+            --for=jsonpath='{.status.phase}'=Available --timeout=10m
+          echo "::endgroup::"
+
+          echo "::group::Create marker namespace + ConfigMap"
+          kubectl create namespace dr-drill
+          kubectl -n dr-drill create configmap dr-marker \
+            --from-literal=run-id="${GITHUB_RUN_ID}" \
+            --from-literal=sha="${GITHUB_SHA}"
+          echo "::endgroup::"
+
+          echo "::group::Back up the marker namespace"
+          kubectl -n velero create -f - <<EOF
+          apiVersion: velero.io/v1
+          kind: Backup
+          metadata:
+            name: dr-drill
+            namespace: velero
+          spec:
+            includedNamespaces:
+              - dr-drill
+            storageLocation: default
+            ttl: 1h0m0s
+          EOF
+          wait_phase backups.velero.io dr-drill 300
+          echo "::endgroup::"
+
+          echo "::group::Simulate data loss (delete the namespace)"
+          kubectl delete namespace dr-drill --wait=true --timeout=2m
+          if kubectl get namespace dr-drill >/dev/null 2>&1; then
+            echo "::error::namespace dr-drill still exists after deletion"
+            exit 1
+          fi
+          echo "::endgroup::"
+
+          echo "::group::Restore from the backup"
+          kubectl -n velero create -f - <<EOF
+          apiVersion: velero.io/v1
+          kind: Restore
+          metadata:
+            name: dr-drill
+            namespace: velero
+          spec:
+            backupName: dr-drill
+          EOF
+          wait_phase restores.velero.io dr-drill 300
+          echo "::endgroup::"
+
+          echo "::group::Verify restored marker"
+          restored=$(kubectl -n dr-drill get configmap dr-marker -o jsonpath='{.data.run-id}')
+          if [ "${restored}" != "${GITHUB_RUN_ID}" ]; then
+            echo "::error::restored run-id '${restored}' does not match expected '${GITHUB_RUN_ID}'"
+            exit 1
+          fi
+          echo "✅ Restore drill passed: marker ConfigMap restored with matching run-id."
+          echo "::endgroup::"
+
       - name: 🩺 Diagnose Flux on failure
         if: failure()
         run: |

@@ -1,27 +1,33 @@
 # DR restore drill (CI)
 
-`.github/workflows/ci.yaml` runs a `restore-drill` job on every PR that
-touches `k8s/**` or the cluster configs. The job validates the full
-backup → data-loss → restore cycle end-to-end on a local Talos+Docker
-cluster, so the Velero code path is regression-tested **before** changes
-reach `prod`.
+`.github/workflows/ci.yaml` runs restore-drill steps inside the
+`system-test` job on every PR that touches `k8s/**` or the cluster
+configs. The drill validates the full backup → data-loss → restore cycle
+end-to-end on the local Talos+Docker cluster the job just reconciled, so
+the Velero code path is regression-tested **before** changes reach
+`prod`. (Reusing the system-test cluster instead of creating a second
+one keeps the added wall-clock to ~2-3 minutes.)
 
 ## What it does
 
-1. `ksail cluster create` and reconcile all workloads.
-2. Wait for **Velero** + **MinIO** (the local R2 stand-in) to be ready
-   and `BackupStorageLocation/default` `Available`.
+1. Reuse the cluster the `system-test` job created and reconciled.
+2. Wait for `BackupStorageLocation/default` to report `Available`
+   (Velero validates against **MinIO**, the local R2 stand-in).
 3. Create a marker `Namespace`/`ConfigMap` carrying the GitHub
    `run-id` and `sha` (so identity can be proved later).
-4. `velero backup create` against the marker namespace, `--wait` for
-   `Completed`.
+4. Create a `Backup` CR scoped to the marker namespace and wait for
+   phase `Completed` (failing fast on `Failed`/`PartiallyFailed`).
 5. **Simulate data loss**: delete the marker namespace (`kubectl delete
    namespace`).
 6. Assert the marker namespace does **not** exist after deletion.
-7. `velero restore create --from-backup ... --wait` for `Completed`.
+7. Create a `Restore` CR from the backup and wait for `Completed`.
 8. Assert the marker `ConfigMap` is back and `data.run-id` matches the
    current `GITHUB_RUN_ID`.
-9. Tear down the cluster (`if: always()`).
+9. The job tears down the cluster (`if: always()`) as usual.
+
+> Velero CRs are created with `kubectl` rather than the `velero` CLI so
+> the drill needs no extra tool install and can never drift from the
+> deployed Velero version.
 
 > **Why namespace deletion instead of full cluster rebuild?** MinIO runs
 > in-cluster with ephemeral storage, so destroying the cluster would also
@@ -31,11 +37,13 @@ reach `prod`.
 
 ## Wall-clock budget
 
-`timeout-minutes: 240` on the job — matches the **4 h RTO** documented
-in [`runbook.md`](./runbook.md). In practice the drill runs in ~15 min.
-The 4 h ceiling is the operator promise for the manual prod path; CI
-keeps that promise honest by failing fast if the local round trip
-explodes.
+The drill itself is bounded: 10 min for the `BackupStorageLocation` to
+go `Available`, then 5 min each for the backup and the restore to reach
+`Completed` (terminal failure phases abort immediately). In practice the
+whole sequence takes ~2-3 minutes on top of the system test. The **4 h
+RTO** in [`runbook.md`](./runbook.md) is the operator promise for the
+manual prod path; CI keeps that promise honest by failing fast if the
+local round trip explodes.
 
 ## What this catches
 

@@ -302,7 +302,10 @@ etcdctl --endpoints unix:///tmp/etcd.snapshot \
 # Kubernetes Secret YAML, the EncryptionConfiguration was lost.
 ```
 
-This check is also asserted by the CI restore drill (see [restore-drill.md](./restore-drill.md)).
+This check is deliberately **not** part of the CI restore drill — Talos
+verifies the encryption key at install time, so a CI assertion would add
+complexity for a structurally-enforced property (see
+[restore-drill.md](./restore-drill.md) for the full rationale).
 
 ---