🤖 Generated by the Daily AI Assistant
What happened today (2026-06-11)
PR #1991 entered the merge queue. ci.yaml's deploy-prod job (which runs on merge_group, i.e. on the speculative merge ref, before the PR actually merges) pushed the OCI artifact and reconciled prod at ~16:08. The merge group then failed → the PR was ejected, but prod kept reconciling the unmerged artifact: the kyverno HelmRelease picked up #1991's then-broken PDB values, the upgrade failed every interval (Cannot set both .minAvailable and .maxUnavailable), and the whole infrastructure-controllers → infrastructure → apps chain sat blocked for hours. Nothing on main reflected the state prod was running.
The general flaw
deploy-prod on merge_group is a deliberate and useful gate (a deploy that fails prevents the merge). The gap is the failure path: any merge-group failure after workload push leaves latest pointing at code that never landed on main, until some future merge overwrites it. Prod state silently diverges from Git — the core thing GitOps is supposed to prevent.
Options (maintainer decision)
- Heal on failure (keeps the gate): add an
if: failure() job/step to the merge-group run that re-checkouts main and re-runs ksail --config ksail.prod.yaml workload push + reconcile, so an ejected PR's artifact is immediately replaced by main's.
- Tag/push artifacts by SHA + promote on merge: push the merge-group artifact under a non-
latest tag, and only re-tag to latest after the merge actually completes (needs a small post-merge workflow).
- Accept the risk (status quo): the next successful merge self-heals; document the failure mode in AGENTS.md / the DR runbook so the on-call knows that a red merge queue can mean prod is running unmerged code.
Option 1 is the smallest change that closes the gap without giving up the pre-merge deploy gate. Happy to PR whichever direction is preferred.
What happened today (2026-06-11)
PR #1991 entered the merge queue.
ci.yaml'sdeploy-prodjob (which runs onmerge_group, i.e. on the speculative merge ref, before the PR actually merges) pushed the OCI artifact and reconciled prod at ~16:08. The merge group then failed → the PR was ejected, but prod kept reconciling the unmerged artifact: the kyverno HelmRelease picked up #1991's then-broken PDB values, the upgrade failed every interval (Cannot set both .minAvailable and .maxUnavailable), and the wholeinfrastructure-controllers → infrastructure → appschain sat blocked for hours. Nothing onmainreflected the state prod was running.The general flaw
deploy-prodonmerge_groupis a deliberate and useful gate (a deploy that fails prevents the merge). The gap is the failure path: any merge-group failure afterworkload pushleaveslatestpointing at code that never landed onmain, until some future merge overwrites it. Prod state silently diverges from Git — the core thing GitOps is supposed to prevent.Options (maintainer decision)
if: failure()job/step to the merge-group run that re-checkoutsmainand re-runsksail --config ksail.prod.yaml workload push+reconcile, so an ejected PR's artifact is immediately replaced by main's.latesttag, and only re-tag tolatestafter the merge actually completes (needs a small post-merge workflow).Option 1 is the smallest change that closes the gap without giving up the pre-merge deploy gate. Happy to PR whichever direction is preferred.