Skip to content

Skip dockercfg secret wait when image-registry pods are unhealthy#30782

Open
weinliu wants to merge 1 commit intoopenshift:mainfrom
weinliu:fix-winc-1578-dockercfg-timeout
Open

Skip dockercfg secret wait when image-registry pods are unhealthy#30782
weinliu wants to merge 1 commit intoopenshift:mainfrom
weinliu:fix-winc-1578-dockercfg-timeout

Conversation

@weinliu
Copy link

@weinliu weinliu commented Feb 13, 2026

Summary

Fixes WINC-1578

On debug/development clusters (e.g., Prow CI debug-winc-* jobs), the service account token controller may be broken, which means dockercfg secrets are never created. Even though the ImageRegistry capability reports as enabled, the image-registry pods are not Running and Ready.

This causes setupProject() to wait 3 minutes per service account for dockercfg secrets that will never appear, then fail with a timeout:

fail [github.com/openshift/origin/test/extended/util/client.go:424]:
timed out waiting for the condition (3m5s)

All 4 debug-winc-* Prow CI jobs (AWS, Azure, GCP, vSphere) have been continuously failing due to this issue. Every single test case times out at 3m5s.

Root Cause Analysis

  1. compat_otp.NewCLIWithoutNamespace("default") calls origin's exutil.NewCLI(), which registers setupProject() in BeforeEach
  2. setupProject() checks IsCapabilityEnabled(ImageRegistry) → returns true (capability is enabled)
  3. Calls WaitForServiceAccountWithSecret() for "default" and "builder" SAs
  4. WaitForServiceAccountWithSecret() polls for 3 minutes waiting for -dockercfg- in sa.ImagePullSecrets
  5. On debug clusters, SA token controller is broken → dockercfg secrets never created → 3-minute timeout per SA
  6. Result: every test that uses NewCLI/NewCLIWithoutNamespace fails after 3m5s

The problem is that setupProject() only checks whether the ImageRegistry capability is enabled, but does not verify whether the image-registry pods are actually healthy.

Changes

Added a pod health check in setupProject() after the ImageRegistry capability check:

  • Lists pods in openshift-image-registry namespace with label docker-registry=default
  • Checks if at least one pod is Running AND has Ready condition
  • If pods exist but none are healthy → skip the dockercfg secret and role binding wait
  • If pods are healthy → existing behavior unchanged

Impact

Scenario Before After
Normal cluster (pods healthy) Wait for dockercfg ✅ Wait for dockercfg ✅ (no change)
ImageRegistry disabled Skip wait ✅ Skip wait ✅ (no change)
No image-registry pods Skip wait ✅ Skip wait ✅ (no change)
Debug cluster (pods unhealthy) Wait 3min → timeout ❌ Skip wait ✅ (fixed)

Evidence

Prow CI failure logs from openshift-tests-private PR #29169:

  • All 35 winc test cases fail with identical timed out waiting for the condition (3m5s) at client.go:424
  • Job link: pull-ci-openshift-openshift-tests-private-main-debug-winc-vsphere-ipi

Note: Direct verification via openshift-tests-private Prow CI is not possible because the debug-winc-* test steps use the tests-private image from the release payload (not tests-private-pr built from the PR), so code changes in openshift-tests-private PRs do not affect these test runs.

Test plan

  • Verify on normal clusters: dockercfg wait behavior unchanged (image-registry pods Running+Ready → still waits for secrets)
  • Verify on debug clusters: no 3-minute timeout when image-registry pods are unhealthy

On debug/development clusters, the service account token controller
may be broken, which prevents dockercfg secrets from being created.
Even though the ImageRegistry capability reports as enabled, the
image-registry pods are not Running and Ready. This causes
setupProject() to wait 3 minutes per SA for secrets that will never
appear, then fail with a timeout.

Add a pod health check after the ImageRegistry capability check: if
image-registry pods exist but none are Running and Ready, skip the
dockercfg secret and role binding wait entirely. This fixes the
3-minute timeout seen in Prow CI debug builds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 13, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: weinliu
Once this PR has been reviewed and has the lgtm label, please assign dgoodwin for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 13, 2026

@weinliu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-ovn-upgrade 5b5344d link true /test e2e-gcp-ovn-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants