Skip to content

fix(cluster-policies): port image verification to ImageValidatingPolicy so cosign v3 bundles verify#2038

Merged
devantler merged 2 commits into
mainfrom
claude/reverent-noyce-aa5a1f
Jun 12, 2026
Merged

fix(cluster-policies): port image verification to ImageValidatingPolicy so cosign v3 bundles verify#2038
devantler merged 2 commits into
mainfrom
claude/reverent-noyce-aa5a1f

Conversation

@devantler

Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

Root cause — why every merge-queue run fails

The verify-image-signatures ClusterPolicy (#2020, merged 2026-06-11 17:04) can never pass for any matched image: both signing pipelines (ksail cd.yaml, reusable-workflows publish-app.yaml) run cosign v3, which attaches signatures in the new Sigstore bundle format as OCI referrers (sha256-<digest> fallback tags on GHCR, descriptor artifactType application/vnd.oci.empty.v1+json) and writes no legacy .sig tags. ClusterPolicy verifyImages reads neither variant — type: Cosign only looks for legacy tags, and type: SigstoreBundle filters referrer descriptors by a bundle artifactType cosign v3 doesn't set (both verified empirically, see below).

The policy merged green because it only bites at Pod CREATE. That evening node churn evicted the tenant-app pods, and from the 20:15 queue run onward every first-party pod admission is denied (mutate.kyverno.svc-fail), wedding-app / ascoachingogvaner Deployments stall Failed, the prod apps Kustomization health check times out after 20m, and every merge-queue Deploy-to-Prod run fails (5 consecutive failures, nothing has merged since #2025).

Fix

Replace the ClusterPolicy with two policies.kyverno.io/v1 ImageValidatingPolicies — the Kyverno engine path with cosign v3 bundle support (kyverno#14652) — preserving the design exactly: same two keyless identities, pods-only (no autogen), admission-only (no background), ghcr-auth credentials, no digest pinning, fail-open for third-party images. Scoping notes baked into comments:

  • the app policy's match excludes ksail images via a CEL ref. expression — matched images are verified before the validation expression runs, so a glob that swallowed ksail images would fail them against the wrong identity;
  • validation expressions cover containers + initContainers + ephemeralContainers;
  • no spec.webhookConfiguration: a per-policy timeout flips Kyverno to per-policy "finegrained" webhooks, which are broken for multiple IVPOLs in v1.18.1 (concatenated URL path → every admission fails "policy not evaluated").

Comment-only updates in the kyverno netpol and the kyverno ghcr-auth ExternalSecret to point at the IVPOLs.

Validation

On a throwaway kind cluster running the exact prod chart (kyverno 3.8.1 / v1.18.1) with the shipped file:

case result
wedding-app:1.10.0 (signed, private) ✅ admitted
ksail:v7.55.0 (signed) ✅ admitted
wedding-app:sha-1c279c9 (pre-signing tag) ❌ denied with policy message
unsigned first-party initContainer ❌ denied
busybox:1.36 (third-party) ✅ admitted unverified (fail-open)

Also: cosign verify passes for all three deployed images against the exact policy identities; kubectl kustomize local+prod build; ksail workload validate green (304 files); prod-config validate fails only on the pre-existing datreeio coroot schema gap (datreeio/CRDs-catalog#896). The datreeio catalog already carries the imagevalidatingpolicy_v1.json schema.

Risk & contingency

Verification adds ~1–5s to first-party pod admission (registry + TUF round-trips; bounded by the shared 10s ivpol webhook timeout, fail-closed — same trade-off the ClusterPolicy design accepted). Prod evidence this fits: source-controller performs the identical sigstore bundle verification against GHCR on every OCIRepository reconcile. If prod admissions were ever to exceed the timeout persistently, the fallback lever is failurePolicy: Ignore on the IVPOLs.

Merging this PR heals prod via the queue's own deploy: Flux applies the IVPOLs, prunes the ClusterPolicy, the stalled ReplicaSets' next retry admits, and the apps health gate goes green.

Closes the merge-queue outage; once merged, re-queue #2037 #2036 #2035 #2033 #2023 #1996 #1991 (and let Renovate re-queue #2032).

🤖 Generated with Claude Code

…cy so cosign v3 bundles verify

The verify-image-signatures ClusterPolicy (#2020) can never pass: both
signing pipelines run cosign v3, which attaches signatures as Sigstore
bundle OCI referrers (no legacy .sig tags), a format ClusterPolicy
verifyImages cannot read with either type Cosign or type SigstoreBundle.
Every first-party Pod CREATE is denied at admission, wedding-app and
ascoachingogvaner Deployments stall, the prod apps Kustomization health
check times out, and every merge-queue Deploy-to-Prod run fails.

Replace the ClusterPolicy with two ImageValidatingPolicies (the engine
path with cosign v3 support), same two keyless identities, pods-only,
admission-only, fail-open for third-party images.

Verified on a throwaway kind cluster running the exact prod chart
(kyverno 3.8.1 / v1.18.1): signed app + ksail images admit, unsigned
first-party tags and initContainers are denied, third-party images
admit unverified.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@devantler devantler marked this pull request as ready for review June 12, 2026 14:32
@devantler devantler enabled auto-merge June 12, 2026 14:44
@devantler devantler added this pull request to the merge queue Jun 12, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 12, 2026
@devantler devantler added this pull request to the merge queue Jun 12, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 12, 2026
…edirect

In prod the IVPOLs denied every first-party pod with "no signatures
found" / webhook deadline exceeded, while the identical policy verified
the same digests on an unrestricted-egress test cluster. GHCR answers
the OCI referrers API with a 303 redirect to
https://github.com/-/v2/packages/..., and github.com is not in the
FQDN-pinned kyverno egress (#2019) — the redirect-following lookup dies
on the dropped connection instead of falling back to the
sha256-<digest> tag scheme. Confirmed live: kyverno logs show
"Get https://ghcr.io/v2/: context canceled" and "no signatures found"
for digests that cosign verifies, with apid/auth/secret all healthy.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

Two more failure modes diagnosed and addressed after the first queue attempt:

  1. Run 27423267474 (14:49) — transient: the Talos config apply to cp-1/cp-2 stalled (authentication handshake failed: context deadline exceeded). apid logs prove the runner's connections never arrived (cp-2 logged zero connections 08:52→14:57) while cp-3 + workers applied fine in the same run — a per-flow Azure→Hetzner drop, amplified by ksail's single-shot, no-retry per-node apply. Remedy was re-queue; a ksail-side retry improvement is tracked separately.

  2. Run 27423913017 (15:00) — real, prod-only: the IVPOLs deployed and denied first-party pods with no signatures found / Get "https://ghcr.io/v2/": context canceled, although the same digests verify with cosign locally and admitted on the kind test cluster. Root cause: GHCR answers the OCI referrers API with a 303 redirect to https://github.com/-/v2/packages/..., and github.com was not in the FQDN-pinned kyverno egress (fix(security): pin world:443 egress to explicit FQDN allow-lists #2019). With the redirect target dropped by Cilium, go-containerregistry's referrers lookup dies instead of falling back to the sha256-<digest> tag scheme. The kind cluster passed because its egress was unrestricted (redirect → harmless 404 → tag fallback). Fixed in c6cd9b8 by allowing github.com:443 in the kyverno CNP.

@devantler devantler added this pull request to the merge queue Jun 12, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 12, 2026
@devantler devantler added this pull request to the merge queue Jun 12, 2026
Merged via the queue into main with commit 4b24a27 Jun 12, 2026
12 of 14 checks passed
@devantler devantler deleted the claude/reverent-noyce-aa5a1f branch June 12, 2026 16:30
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board Jun 12, 2026
@botantler

botantler Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

🎉 This PR is included in version 1.52.4 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@botantler botantler Bot added the released label Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

1 participant