Decide correct behaviour after a failed promotion (primary unhealthy, canary healthy)

### Background

#1898 reported that when a primary pod fails to initialize *after* a canary
promotion, Flagger routes all traffic to the broken primary and scales the
healthy canary to zero, causing an outage.

#1931 fixes the immediate safety problem: when the primary is not ready and the
canary is in the `Promoting`/`Finalising` phase, Flagger now halts the
promotion, keeps the canary running, and leaves routing untouched instead of
performing the destructive rollback. The rollout is marked `Failed` and alerts.

That is a minimal, non-destructive stop-gap. This issue is to decide the
*correct* long-term behaviour, which is not obvious.

### Key consideration

A promotion only starts **after** the canary analysis succeeds. That means the
canary pods running the new revision are healthy — it is the **primary's**
copy that fails to come up. The primary's pod template is not identical to the
canary's: `getPrimaryDeploymentTemplateSpec` rewrites config/secret refs and
affinity/topology labels to their `-primary` variants, the primary may use a
different update strategy, land on different nodes, etc. The two real cases in
#1898 (image pull slower than the timeout; a sidecar failing only on the
primary) are both "new revision is fine, primary's copy hiccupped".

So "revert the primary to the last-known-good spec" would **roll back a
revision that just passed canary analysis and is running fine on the canary**.
That may be the wrong trade in exactly the cases that triggered #1898.

### State left after #1931 (what's still imperfect)

- The primary deployment still has the new (failing) pod spec applied;
  `Promote()` overwrites the primary in place and Flagger keeps no copy of the
  previous spec (only a hash in `status.lastPromotedSpec`).
- The canary stays up and is marked `Failed`, rather than the normal invariant
  of "healthy primary serving, canary scaled to zero".
- In `RollingUpdate` mode the orphaned failing replica on the primary is not
  cleaned up. Note this can't be removed without reverting the primary template
  (the Deployment still targets the new spec), so cleanup and revert are coupled.

### Options under consideration

**A. Self-heal + keep healthy canary (extend #1931).** Keep the healthy canary
serving and let the primary keep retrying so transient failures (slow pull)
recover and promotion eventually finishes; alert on a stalled promotion.
*Pro:* serves the analysis-passing revision; transient failures self-heal;
small, low-risk. *Con:* doesn't clean up the orphan replica; can sit off the
model invariant indefinitely (relies on alerting).

**B. Revert the primary to last-known-good.** Snapshot the previous primary
template before `Promote`; on failure revert primary, wait healthy, route back,
scale canary to zero, mark `Failed`. *Pro:* restores the model invariant and
removes the orphan. *Con:* discards an analysis-passing revision; no self-heal;
largest change (must persist the full previous template — annotation vs new
status field; multi-reconcile orchestration that must not re-enter the success
path and loop; also DaemonSet + promoted ConfigMaps/Secrets; in `Recreate` the
revert itself has a gap while the old pods come back).

**C. Make revert opt-in.** Default to A; add a Canary spec field to opt into B.
*Pro:* per-workload choice, safe default. *Con:* most surface area (all of B
plus a CRD field, validation, docs, two tested paths).

### Implementation notes (if B/C is chosen)

- `Promote()` (`pkg/canary/deployment_controller.go`) does an in-place `Update`
  of the primary; reverting needs the previous primary template stored
  somewhere (annotation on the primary, or a new status field — Flagger keeps
  only a hash today). The Deployment's ReplicaSet revision history (rollout-undo
  style) is an alternative but is awkward since Flagger manages the spec
  declaratively.
- Equivalent path needed for `daemonset_controller.go`, promoted
  ConfigMaps/Secrets, the `Finalising` phase, and `skipAnalysis` promotions.

Would appreciate maintainer steer on A vs B vs C before implementation.

Related: #1898, #1931


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide correct behaviour after a failed promotion (primary unhealthy, canary healthy) #1932

Background

Key consideration

State left after #1931 (what's still imperfect)

Options under consideration

Implementation notes (if B/C is chosen)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Decide correct behaviour after a failed promotion (primary unhealthy, canary healthy) #1932

Description

Background

Key consideration

State left after #1931 (what's still imperfect)

Options under consideration

Implementation notes (if B/C is chosen)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions