Skip dockercfg secret wait when image-registry pods are unhealthy#30782
Skip dockercfg secret wait when image-registry pods are unhealthy#30782weinliu wants to merge 1 commit intoopenshift:mainfrom
Conversation
On debug/development clusters, the service account token controller may be broken, which prevents dockercfg secrets from being created. Even though the ImageRegistry capability reports as enabled, the image-registry pods are not Running and Ready. This causes setupProject() to wait 3 minutes per SA for secrets that will never appear, then fail with a timeout. Add a pod health check after the ImageRegistry capability check: if image-registry pods exist but none are Running and Ready, skip the dockercfg secret and role binding wait entirely. This fixes the 3-minute timeout seen in Prow CI debug builds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: weinliu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Scheduling required tests: |
|
@weinliu: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Fixes WINC-1578
On debug/development clusters (e.g., Prow CI
debug-winc-*jobs), the service account token controller may be broken, which means dockercfg secrets are never created. Even though the ImageRegistry capability reports as enabled, the image-registry pods are not Running and Ready.This causes
setupProject()to wait 3 minutes per service account for dockercfg secrets that will never appear, then fail with a timeout:All 4
debug-winc-*Prow CI jobs (AWS, Azure, GCP, vSphere) have been continuously failing due to this issue. Every single test case times out at 3m5s.Root Cause Analysis
compat_otp.NewCLIWithoutNamespace("default")calls origin'sexutil.NewCLI(), which registerssetupProject()inBeforeEachsetupProject()checksIsCapabilityEnabled(ImageRegistry)→ returnstrue(capability is enabled)WaitForServiceAccountWithSecret()for "default" and "builder" SAsWaitForServiceAccountWithSecret()polls for 3 minutes waiting for-dockercfg-insa.ImagePullSecretsNewCLI/NewCLIWithoutNamespacefails after 3m5sThe problem is that
setupProject()only checks whether the ImageRegistry capability is enabled, but does not verify whether the image-registry pods are actually healthy.Changes
Added a pod health check in
setupProject()after the ImageRegistry capability check:openshift-image-registrynamespace with labeldocker-registry=defaultImpact
Evidence
Prow CI failure logs from
openshift-tests-privatePR #29169:timed out waiting for the condition (3m5s)atclient.go:424pull-ci-openshift-openshift-tests-private-main-debug-winc-vsphere-ipiNote: Direct verification via
openshift-tests-privateProw CI is not possible because thedebug-winc-*test steps use thetests-privateimage from the release payload (nottests-private-prbuilt from the PR), so code changes inopenshift-tests-privatePRs do not affect these test runs.Test plan