Skip to content

Decide correct behaviour after a failed promotion (primary unhealthy, canary healthy) #1932

Description

@pedrampdd

Background

#1898 reported that when a primary pod fails to initialize after a canary
promotion, Flagger routes all traffic to the broken primary and scales the
healthy canary to zero, causing an outage.

#1931 fixes the immediate safety problem: when the primary is not ready and the
canary is in the Promoting/Finalising phase, Flagger now halts the
promotion, keeps the canary running, and leaves routing untouched instead of
performing the destructive rollback. The rollout is marked Failed and alerts.

That is a minimal, non-destructive stop-gap. This issue is to decide the
correct long-term behaviour, which is not obvious.

Key consideration

A promotion only starts after the canary analysis succeeds. That means the
canary pods running the new revision are healthy — it is the primary's
copy that fails to come up. The primary's pod template is not identical to the
canary's: getPrimaryDeploymentTemplateSpec rewrites config/secret refs and
affinity/topology labels to their -primary variants, the primary may use a
different update strategy, land on different nodes, etc. The two real cases in
#1898 (image pull slower than the timeout; a sidecar failing only on the
primary) are both "new revision is fine, primary's copy hiccupped".

So "revert the primary to the last-known-good spec" would roll back a
revision that just passed canary analysis and is running fine on the canary
.
That may be the wrong trade in exactly the cases that triggered #1898.

State left after #1931 (what's still imperfect)

  • The primary deployment still has the new (failing) pod spec applied;
    Promote() overwrites the primary in place and Flagger keeps no copy of the
    previous spec (only a hash in status.lastPromotedSpec).
  • The canary stays up and is marked Failed, rather than the normal invariant
    of "healthy primary serving, canary scaled to zero".
  • In RollingUpdate mode the orphaned failing replica on the primary is not
    cleaned up. Note this can't be removed without reverting the primary template
    (the Deployment still targets the new spec), so cleanup and revert are coupled.

Options under consideration

A. Self-heal + keep healthy canary (extend #1931). Keep the healthy canary
serving and let the primary keep retrying so transient failures (slow pull)
recover and promotion eventually finishes; alert on a stalled promotion.
Pro: serves the analysis-passing revision; transient failures self-heal;
small, low-risk. Con: doesn't clean up the orphan replica; can sit off the
model invariant indefinitely (relies on alerting).

B. Revert the primary to last-known-good. Snapshot the previous primary
template before Promote; on failure revert primary, wait healthy, route back,
scale canary to zero, mark Failed. Pro: restores the model invariant and
removes the orphan. Con: discards an analysis-passing revision; no self-heal;
largest change (must persist the full previous template — annotation vs new
status field; multi-reconcile orchestration that must not re-enter the success
path and loop; also DaemonSet + promoted ConfigMaps/Secrets; in Recreate the
revert itself has a gap while the old pods come back).

C. Make revert opt-in. Default to A; add a Canary spec field to opt into B.
Pro: per-workload choice, safe default. Con: most surface area (all of B
plus a CRD field, validation, docs, two tested paths).

Implementation notes (if B/C is chosen)

  • Promote() (pkg/canary/deployment_controller.go) does an in-place Update
    of the primary; reverting needs the previous primary template stored
    somewhere (annotation on the primary, or a new status field — Flagger keeps
    only a hash today). The Deployment's ReplicaSet revision history (rollout-undo
    style) is an alternative but is awkward since Flagger manages the spec
    declaratively.
  • Equivalent path needed for daemonset_controller.go, promoted
    ConfigMaps/Secrets, the Finalising phase, and skipAnalysis promotions.

Would appreciate maintainer steer on A vs B vs C before implementation.

Related: #1898, #1931

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions