fix(cluster-policies): port image verification to ImageValidatingPolicy so cosign v3 bundles verify#2038
Conversation
…cy so cosign v3 bundles verify The verify-image-signatures ClusterPolicy (#2020) can never pass: both signing pipelines run cosign v3, which attaches signatures as Sigstore bundle OCI referrers (no legacy .sig tags), a format ClusterPolicy verifyImages cannot read with either type Cosign or type SigstoreBundle. Every first-party Pod CREATE is denied at admission, wedding-app and ascoachingogvaner Deployments stall, the prod apps Kustomization health check times out, and every merge-queue Deploy-to-Prod run fails. Replace the ClusterPolicy with two ImageValidatingPolicies (the engine path with cosign v3 support), same two keyless identities, pods-only, admission-only, fail-open for third-party images. Verified on a throwaway kind cluster running the exact prod chart (kyverno 3.8.1 / v1.18.1): signed app + ksail images admit, unsigned first-party tags and initContainers are denied, third-party images admit unverified. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…edirect In prod the IVPOLs denied every first-party pod with "no signatures found" / webhook deadline exceeded, while the identical policy verified the same digests on an unrestricted-egress test cluster. GHCR answers the OCI referrers API with a 303 redirect to https://github.com/-/v2/packages/..., and github.com is not in the FQDN-pinned kyverno egress (#2019) — the redirect-following lookup dies on the dropped connection instead of falling back to the sha256-<digest> tag scheme. Confirmed live: kyverno logs show "Get https://ghcr.io/v2/: context canceled" and "no signatures found" for digests that cosign verifies, with apid/auth/secret all healthy. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two more failure modes diagnosed and addressed after the first queue attempt:
|
|
🎉 This PR is included in version 1.52.4 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
Root cause — why every merge-queue run fails
The
verify-image-signaturesClusterPolicy (#2020, merged 2026-06-11 17:04) can never pass for any matched image: both signing pipelines (ksailcd.yaml, reusable-workflowspublish-app.yaml) run cosign v3, which attaches signatures in the new Sigstore bundle format as OCI referrers (sha256-<digest>fallback tags on GHCR, descriptor artifactTypeapplication/vnd.oci.empty.v1+json) and writes no legacy.sigtags. ClusterPolicyverifyImagesreads neither variant —type: Cosignonly looks for legacy tags, andtype: SigstoreBundlefilters referrer descriptors by a bundle artifactType cosign v3 doesn't set (both verified empirically, see below).The policy merged green because it only bites at Pod CREATE. That evening node churn evicted the tenant-app pods, and from the 20:15 queue run onward every first-party pod admission is denied (
mutate.kyverno.svc-fail),wedding-app/ascoachingogvanerDeployments stallFailed, the prodappsKustomization health check times out after 20m, and every merge-queue Deploy-to-Prod run fails (5 consecutive failures, nothing has merged since #2025).Fix
Replace the ClusterPolicy with two
policies.kyverno.io/v1ImageValidatingPolicies — the Kyverno engine path with cosign v3 bundle support (kyverno#14652) — preserving the design exactly: same two keyless identities, pods-only (no autogen), admission-only (no background),ghcr-authcredentials, no digest pinning, fail-open for third-party images. Scoping notes baked into comments:ref.expression — matched images are verified before the validation expression runs, so a glob that swallowed ksail images would fail them against the wrong identity;containers + initContainers + ephemeralContainers;spec.webhookConfiguration: a per-policy timeout flips Kyverno to per-policy "finegrained" webhooks, which are broken for multiple IVPOLs in v1.18.1 (concatenated URL path → every admission fails "policy not evaluated").Comment-only updates in the kyverno netpol and the kyverno
ghcr-authExternalSecret to point at the IVPOLs.Validation
On a throwaway kind cluster running the exact prod chart (kyverno 3.8.1 / v1.18.1) with the shipped file:
wedding-app:1.10.0(signed, private)ksail:v7.55.0(signed)wedding-app:sha-1c279c9(pre-signing tag)busybox:1.36(third-party)Also:
cosign verifypasses for all three deployed images against the exact policy identities;kubectl kustomizelocal+prod build;ksail workload validategreen (304 files); prod-config validate fails only on the pre-existing datreeio coroot schema gap (datreeio/CRDs-catalog#896). The datreeio catalog already carries theimagevalidatingpolicy_v1.jsonschema.Risk & contingency
Verification adds ~1–5s to first-party pod admission (registry + TUF round-trips; bounded by the shared 10s ivpol webhook timeout, fail-closed — same trade-off the ClusterPolicy design accepted). Prod evidence this fits: source-controller performs the identical sigstore bundle verification against GHCR on every OCIRepository reconcile. If prod admissions were ever to exceed the timeout persistently, the fallback lever is
failurePolicy: Ignoreon the IVPOLs.Merging this PR heals prod via the queue's own deploy: Flux applies the IVPOLs, prunes the ClusterPolicy, the stalled ReplicaSets' next retry admits, and the
appshealth gate goes green.Closes the merge-queue outage; once merged, re-queue #2037 #2036 #2035 #2033 #2023 #1996 #1991 (and let Renovate re-queue #2032).
🤖 Generated with Claude Code