Skip to content

ci: probe the OpenBao label-patch 422 with kubectl on system-test failure#1990

Queued
devantler wants to merge 2 commits into
mainfrom
claude/repo-assist-ci-probe-label-patch
Queued

ci: probe the OpenBao label-patch 422 with kubectl on system-test failure#1990
devantler wants to merge 2 commits into
mainfrom
claude/repo-assist-ci-probe-label-patch

Conversation

@devantler

Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

What #1986's first dump revealed

The repo-wide system-test failure is now narrowed to one precise mechanism:

  • openbao-0 is initialized + unsealed, and the initial service-registration label write succeeds (openbao-active=false, openbao-sealed=true, … all present; the standby Service has endpoints)

  • every subsequent state update patch is rejected by the apiserver with HTTP 422, retried every 5s forever:

    [WARN] service_registration.kubernetes: unable to update state for due to PATCH
    https://10.96.0.1:443/api/v1/namespaces/openbao/pods/openbao-0 ... resp statuscode: 422, will retry
    
  • so the labels freeze at their startup values, openbao-active never gains endpoints (<unset> in the EndpointSlice dump), and the whole vault seeding chain times out with connect: no route to host

OpenBao logs only the status code — not the 422 response body that names the invalid field. Prod (k8s v1.36.1 / Talos 1.13.3) accepts identical patches; the CI cluster (v1.36.0 / Talos 1.13.4) rejects them.

This PR

Extends the OpenBao state diagnostics group with a probe that replays the same RFC 6902 replace via kubectl (single label + the full openbao-* batch, using current values → semantically a no-op), which prints the apiserver's full explanation, plus a server-side dry-run re-apply of the unmodified pod to catch latent object-validation errors. This PR's own failing system test will produce the answer.

Validation

  • YAML parses (yq) ✅

🤖 Generated with Claude Code

…lure

The 2026-06-10 diagnostics (#1986) narrowed the repo-wide system-test
failure to OpenBao's Kubernetes service registration: the initial label
write succeeds, but every state UPDATE patch is rejected by the
apiserver with HTTP 422 forever (150+ consecutive retries observed),
freezing the pod at openbao-active=false so the active Service never
gains endpoints and the whole vault seeding chain times out. OpenBao
logs only the status code, not the response body, so WHAT is invalid
remains unknown.

Replay the same RFC 6902 replace (single label and the full openbao-*
batch, using current values so the probe is a semantic no-op) with
kubectl, which prints the apiserver's full error message, and add a
server-side dry-run re-apply of the unmodified pod to surface latent
object-validation failures.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

🎯 The probe delivered. The 422 is not the label patch — the apiserver diff shows the add-security-context Kyverno policy injecting capabilities.drop/runAsNonRoot/readOnlyRootFilesystem/seccompProfile into the immutable pod spec on UPDATE (no operations: scope on the policy). Any pod created before the webhook was active becomes permanently un-updatable, which bricked OpenBao's service registration in every fresh CI cluster. Root fix: #1999 (operations: [CREATE]). This PR's probe + state dump are worth keeping as permanent diagnostics.

devantler added a commit that referenced this pull request Jun 10, 2026
The add-security-context ClusterPolicy matched Pods without an
operations scope, so Kyverno also applied the securityContext mutation
on every pod UPDATE. Pod spec is immutable: for any pod created while
the policy/webhook was not yet active (exactly what fresh-cluster
bring-up ordering produces), every later update gets the mutation
bolted on and the apiserver rejects the whole request with HTTP 422
'pod updates may not change fields other than image...'.

That bricked OpenBao's Kubernetes service registration in CI: the
label-state updates (openbao-active/sealed/initialized) 422'd forever,
the openbao-active Service never gained endpoints, the entire vault
seeding chain timed out, and every system test since the 2026-06-10
active-service cutover failed. Probe evidence in #1990's run: the 422
diff shows the webhook's own securityContext injection, not the label
patch. Prod was unaffected only because its pods happened to be
recreated while the policy was live (mutation already in the spec ->
no-op on update).

Scope both rules to operations: [CREATE]. Pods created before the
policy stay unmutated until their next natural recreation, which is
strictly better than being permanently un-updatable.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@devantler devantler added this pull request to the merge queue Jun 11, 2026
Any commits made after this event will not be merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

1 participant