Skip to content

Merge-queue deploys leave prod running unmerged code when the merge group fails #2026

@devantler

Description

@devantler

🤖 Generated by the Daily AI Assistant

What happened today (2026-06-11)

PR #1991 entered the merge queue. ci.yaml's deploy-prod job (which runs on merge_group, i.e. on the speculative merge ref, before the PR actually merges) pushed the OCI artifact and reconciled prod at ~16:08. The merge group then failed → the PR was ejected, but prod kept reconciling the unmerged artifact: the kyverno HelmRelease picked up #1991's then-broken PDB values, the upgrade failed every interval (Cannot set both .minAvailable and .maxUnavailable), and the whole infrastructure-controllers → infrastructure → apps chain sat blocked for hours. Nothing on main reflected the state prod was running.

The general flaw

deploy-prod on merge_group is a deliberate and useful gate (a deploy that fails prevents the merge). The gap is the failure path: any merge-group failure after workload push leaves latest pointing at code that never landed on main, until some future merge overwrites it. Prod state silently diverges from Git — the core thing GitOps is supposed to prevent.

Options (maintainer decision)

  1. Heal on failure (keeps the gate): add an if: failure() job/step to the merge-group run that re-checkouts main and re-runs ksail --config ksail.prod.yaml workload push + reconcile, so an ejected PR's artifact is immediately replaced by main's.
  2. Tag/push artifacts by SHA + promote on merge: push the merge-group artifact under a non-latest tag, and only re-tag to latest after the merge actually completes (needs a small post-merge workflow).
  3. Accept the risk (status quo): the next successful merge self-heals; document the failure mode in AGENTS.md / the DR runbook so the on-call knows that a red merge queue can mean prod is running unmerged code.

Option 1 is the smallest change that closes the gap without giving up the pre-merge deploy gate. Happy to PR whichever direction is preferred.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions