Fix rolling update deadlock when pods are stuck in non-running state#3051
Fix rolling update deadlock when pods are stuck in non-running state#3051annielzy wants to merge 5 commits intozalando:masterfrom
Conversation
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
|
Thanks @annielzy for this nice contribution. We were also aware of this limitation for a cluster to repair itself. Could you test the following scenario? Pick a Spilo docker image that does not exist. The first pod to be replaced should fail to come back because of the "broken" image. What would happen on the next SYNC cycle? I think the operator would then continue with the next pod until the cascading failure is complete. However, with your code a revert of the image would then work because the pods are getting recreated, which is a big plus. |
Problem
The postgres-operator can enter a permanent deadlock state where pods that need
recreation are never recreated because the Patroni API safety checks fail on
non-running pods.
This occurs when:
secret reference from
postgres-operator-<VERSION A>-secrettopostgres-operator-<VERSION B>-secret)zalando-postgres-operator-rolling-update-requiredannotation
recreatePodstocomplete the rolling update
CreateContainerConfigErrorbecause the old secretno longer exists
syncPatroniConfigandrestartInstancesfail becausethey cannot reach the Patroni API on the non-running pod
isSafeToRecreatePods = false, which blocksrecreatePodsThis affects both single-node and multi-node clusters. In a 3-node cluster, one
broken pod blocks the rolling update of all pods, including the healthy ones.
Root Cause
syncStatefulSetunconditionally setsisSafeToRecreatePods = falsewhensyncPatroniConfigorrestartInstancesreturns any error. It does notdistinguish between:
Fix
pod.gopodIsNotRunning()helper that checks if a pod is stuck in a non-runningstate (e.g.
CreateContainerConfigError,CrashLoopBackOff,ImagePullBackOff,terminated containers, non-Running phase)
allPodsRunning()helper used bysyncStatefulSetto determine if PatroniAPI errors are expected
recreatePods, skip the switchover attempt when the master pod is not running,since the Patroni API is unreachable and Patroni has likely already triggered
automatic failover
sync.gosyncPatroniConfigorrestartInstancesfails, only setisSafeToRecreatePods = falseif all pods are actually running. If some podsare not running, the Patroni API errors are expected and should not block the
pod recreation that would fix them.
Behavior change summary
Testing
TestPodIsNotRunning— verifies detection of various non-running states(CreateContainerConfigError, CrashLoopBackOff, ImagePullBackOff, terminated
containers, mixed container states, pending/failed phase)
TestAllPodsRunning— verifies correct behavior with all-running,mixed, all-broken, and empty pod lists