Background
#1898 reported that when a primary pod fails to initialize after a canary
promotion, Flagger routes all traffic to the broken primary and scales the
healthy canary to zero, causing an outage.
#1931 fixes the immediate safety problem: when the primary is not ready and the
canary is in the Promoting/Finalising phase, Flagger now halts the
promotion, keeps the canary running, and leaves routing untouched instead of
performing the destructive rollback. The rollout is marked Failed and alerts.
That is a minimal, non-destructive stop-gap. This issue is to decide the
correct long-term behaviour, which is not obvious.
Key consideration
A promotion only starts after the canary analysis succeeds. That means the
canary pods running the new revision are healthy — it is the primary's
copy that fails to come up. The primary's pod template is not identical to the
canary's: getPrimaryDeploymentTemplateSpec rewrites config/secret refs and
affinity/topology labels to their -primary variants, the primary may use a
different update strategy, land on different nodes, etc. The two real cases in
#1898 (image pull slower than the timeout; a sidecar failing only on the
primary) are both "new revision is fine, primary's copy hiccupped".
So "revert the primary to the last-known-good spec" would roll back a
revision that just passed canary analysis and is running fine on the canary.
That may be the wrong trade in exactly the cases that triggered #1898.
State left after #1931 (what's still imperfect)
- The primary deployment still has the new (failing) pod spec applied;
Promote() overwrites the primary in place and Flagger keeps no copy of the
previous spec (only a hash in status.lastPromotedSpec).
- The canary stays up and is marked
Failed, rather than the normal invariant
of "healthy primary serving, canary scaled to zero".
- In
RollingUpdate mode the orphaned failing replica on the primary is not
cleaned up. Note this can't be removed without reverting the primary template
(the Deployment still targets the new spec), so cleanup and revert are coupled.
Options under consideration
A. Self-heal + keep healthy canary (extend #1931). Keep the healthy canary
serving and let the primary keep retrying so transient failures (slow pull)
recover and promotion eventually finishes; alert on a stalled promotion.
Pro: serves the analysis-passing revision; transient failures self-heal;
small, low-risk. Con: doesn't clean up the orphan replica; can sit off the
model invariant indefinitely (relies on alerting).
B. Revert the primary to last-known-good. Snapshot the previous primary
template before Promote; on failure revert primary, wait healthy, route back,
scale canary to zero, mark Failed. Pro: restores the model invariant and
removes the orphan. Con: discards an analysis-passing revision; no self-heal;
largest change (must persist the full previous template — annotation vs new
status field; multi-reconcile orchestration that must not re-enter the success
path and loop; also DaemonSet + promoted ConfigMaps/Secrets; in Recreate the
revert itself has a gap while the old pods come back).
C. Make revert opt-in. Default to A; add a Canary spec field to opt into B.
Pro: per-workload choice, safe default. Con: most surface area (all of B
plus a CRD field, validation, docs, two tested paths).
Implementation notes (if B/C is chosen)
Promote() (pkg/canary/deployment_controller.go) does an in-place Update
of the primary; reverting needs the previous primary template stored
somewhere (annotation on the primary, or a new status field — Flagger keeps
only a hash today). The Deployment's ReplicaSet revision history (rollout-undo
style) is an alternative but is awkward since Flagger manages the spec
declaratively.
- Equivalent path needed for
daemonset_controller.go, promoted
ConfigMaps/Secrets, the Finalising phase, and skipAnalysis promotions.
Would appreciate maintainer steer on A vs B vs C before implementation.
Related: #1898, #1931
Background
#1898 reported that when a primary pod fails to initialize after a canary
promotion, Flagger routes all traffic to the broken primary and scales the
healthy canary to zero, causing an outage.
#1931 fixes the immediate safety problem: when the primary is not ready and the
canary is in the
Promoting/Finalisingphase, Flagger now halts thepromotion, keeps the canary running, and leaves routing untouched instead of
performing the destructive rollback. The rollout is marked
Failedand alerts.That is a minimal, non-destructive stop-gap. This issue is to decide the
correct long-term behaviour, which is not obvious.
Key consideration
A promotion only starts after the canary analysis succeeds. That means the
canary pods running the new revision are healthy — it is the primary's
copy that fails to come up. The primary's pod template is not identical to the
canary's:
getPrimaryDeploymentTemplateSpecrewrites config/secret refs andaffinity/topology labels to their
-primaryvariants, the primary may use adifferent update strategy, land on different nodes, etc. The two real cases in
#1898 (image pull slower than the timeout; a sidecar failing only on the
primary) are both "new revision is fine, primary's copy hiccupped".
So "revert the primary to the last-known-good spec" would roll back a
revision that just passed canary analysis and is running fine on the canary.
That may be the wrong trade in exactly the cases that triggered #1898.
State left after #1931 (what's still imperfect)
Promote()overwrites the primary in place and Flagger keeps no copy of theprevious spec (only a hash in
status.lastPromotedSpec).Failed, rather than the normal invariantof "healthy primary serving, canary scaled to zero".
RollingUpdatemode the orphaned failing replica on the primary is notcleaned up. Note this can't be removed without reverting the primary template
(the Deployment still targets the new spec), so cleanup and revert are coupled.
Options under consideration
A. Self-heal + keep healthy canary (extend #1931). Keep the healthy canary
serving and let the primary keep retrying so transient failures (slow pull)
recover and promotion eventually finishes; alert on a stalled promotion.
Pro: serves the analysis-passing revision; transient failures self-heal;
small, low-risk. Con: doesn't clean up the orphan replica; can sit off the
model invariant indefinitely (relies on alerting).
B. Revert the primary to last-known-good. Snapshot the previous primary
template before
Promote; on failure revert primary, wait healthy, route back,scale canary to zero, mark
Failed. Pro: restores the model invariant andremoves the orphan. Con: discards an analysis-passing revision; no self-heal;
largest change (must persist the full previous template — annotation vs new
status field; multi-reconcile orchestration that must not re-enter the success
path and loop; also DaemonSet + promoted ConfigMaps/Secrets; in
Recreatetherevert itself has a gap while the old pods come back).
C. Make revert opt-in. Default to A; add a Canary spec field to opt into B.
Pro: per-workload choice, safe default. Con: most surface area (all of B
plus a CRD field, validation, docs, two tested paths).
Implementation notes (if B/C is chosen)
Promote()(pkg/canary/deployment_controller.go) does an in-placeUpdateof the primary; reverting needs the previous primary template stored
somewhere (annotation on the primary, or a new status field — Flagger keeps
only a hash today). The Deployment's ReplicaSet revision history (rollout-undo
style) is an alternative but is awkward since Flagger manages the spec
declaratively.
daemonset_controller.go, promotedConfigMaps/Secrets, the
Finalisingphase, andskipAnalysispromotions.Would appreciate maintainer steer on A vs B vs C before implementation.
Related: #1898, #1931