ci: probe the OpenBao label-patch 422 with kubectl on system-test failure#1990
Queued
devantler wants to merge 2 commits into
Queued
ci: probe the OpenBao label-patch 422 with kubectl on system-test failure#1990devantler wants to merge 2 commits into
devantler wants to merge 2 commits into
Conversation
…lure The 2026-06-10 diagnostics (#1986) narrowed the repo-wide system-test failure to OpenBao's Kubernetes service registration: the initial label write succeeds, but every state UPDATE patch is rejected by the apiserver with HTTP 422 forever (150+ consecutive retries observed), freezing the pod at openbao-active=false so the active Service never gains endpoints and the whole vault seeding chain times out. OpenBao logs only the status code, not the response body, so WHAT is invalid remains unknown. Replay the same RFC 6902 replace (single label and the full openbao-* batch, using current values so the probe is a semantic no-op) with kubectl, which prints the apiserver's full error message, and add a server-side dry-run re-apply of the unmodified pod to surface latent object-validation failures. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
Author
🎯 The probe delivered. The 422 is not the label patch — the apiserver diff shows the |
devantler
added a commit
that referenced
this pull request
Jun 10, 2026
The add-security-context ClusterPolicy matched Pods without an operations scope, so Kyverno also applied the securityContext mutation on every pod UPDATE. Pod spec is immutable: for any pod created while the policy/webhook was not yet active (exactly what fresh-cluster bring-up ordering produces), every later update gets the mutation bolted on and the apiserver rejects the whole request with HTTP 422 'pod updates may not change fields other than image...'. That bricked OpenBao's Kubernetes service registration in CI: the label-state updates (openbao-active/sealed/initialized) 422'd forever, the openbao-active Service never gained endpoints, the entire vault seeding chain timed out, and every system test since the 2026-06-10 active-service cutover failed. Probe evidence in #1990's run: the 422 diff shows the webhook's own securityContext injection, not the label patch. Prod was unaffected only because its pods happened to be recreated while the policy was live (mutation already in the spec -> no-op on update). Scope both rules to operations: [CREATE]. Pods created before the policy stay unmutated until their next natural recreation, which is strictly better than being permanently un-updatable. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Any commits made after this event will not be merged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What #1986's first dump revealed
The repo-wide system-test failure is now narrowed to one precise mechanism:
openbao-0is initialized + unsealed, and the initial service-registration label write succeeds (openbao-active=false,openbao-sealed=true, … all present; the standby Service has endpoints)every subsequent state update patch is rejected by the apiserver with HTTP 422, retried every 5s forever:
so the labels freeze at their startup values,
openbao-activenever gains endpoints (<unset>in the EndpointSlice dump), and the whole vault seeding chain times out withconnect: no route to hostOpenBao logs only the status code — not the 422 response body that names the invalid field. Prod (k8s v1.36.1 / Talos 1.13.3) accepts identical patches; the CI cluster (v1.36.0 / Talos 1.13.4) rejects them.
This PR
Extends the
OpenBao statediagnostics group with a probe that replays the same RFC 6902replacevia kubectl (single label + the fullopenbao-*batch, using current values → semantically a no-op), which prints the apiserver's full explanation, plus a server-side dry-run re-apply of the unmodified pod to catch latent object-validation errors. This PR's own failing system test will produce the answer.Validation
🤖 Generated with Claude Code