Skip to content

fix: proxy-webhook selector matches operator pods#3228

Draft
bowling233 wants to merge 1 commit intotektoncd:mainfrom
ZJUSCT:main
Draft

fix: proxy-webhook selector matches operator pods#3228
bowling233 wants to merge 1 commit intotektoncd:mainfrom
ZJUSCT:main

Conversation

@bowling233
Copy link

Changes

Fixes #3227

Both the tekton-operator and tekton-operator-proxy-webhook Deployments
label their Pods with name: tekton-operator. The
tekton-operator-proxy-webhook Service uses this same label as its only
selector, so it inadvertently load-balances traffic across both Deployments.
Because tekton-operator pods do not serve on port 8443, ~50% of admission
webhook requests fail with connection refused. Since the
MutatingWebhookConfiguration has failurePolicy: Fail, each failure
immediately rejects TaskRun Pod creation.

Changes:

  • cmd/kubernetes/operator/kodata/webhook/webhook.yaml: rename the
    proxy-webhook Deployment's matchLabels selector and pod template label
    from name: tekton-operator to name: tekton-operator-proxy-webhook;
    update the Service selector to match.
  • cmd/openshift/operator/kodata/webhook/webhook.yaml: same change for the
    OpenShift manifest.

The existing app: tekton-operator label is preserved on both Deployments.
No other resources are affected.

Alternative considered: adding a set-based (NotIn) expression to the
Service selector to exclude tekton-operator pods. This was not viable
because Kubernetes Services only support equality-based (matchLabels)
selectors.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Note on tests: This bug only manifests at the Service routing layer
(i.e., ~50% of requests land on a pod with no server). There is no
in-tree unit or integration test that exercises which pods a Service
selects. A targeted e2e test verifying that the proxy-webhook Service
endpoints do not include tekton-operator pods would be a good addition,
but is left for a follow-up.

Release Notes

Fix: the tekton-operator-proxy-webhook Service selector incorrectly matched
tekton-operator pods in addition to proxy-webhook pods, causing ~50% of
admission webhook requests to fail with "connection refused" and TaskRun Pod
creation to be rejected. Users on v0.78.1 can work around this until upgrading
by adding `pod-template-hash: <webhook-pod-hash>` to the Service selector.

Both the `tekton-operator` and `tekton-operator-proxy-webhook`
Deployments label their Pods with `name: tekton-operator`. The
`tekton-operator-proxy-webhook` Service uses this same label as
its only selector, so it inadvertently load-balances traffic
across both Deployments. Because `tekton-operator` pods do not
serve on port 8443, ~50% of admission webhook requests fail:

  failed calling webhook "proxy.operator.tekton.dev":
  Post ".../tekton-operator-proxy-webhook.../defaulting":
  dial tcp <ClusterIP>:443: connect: connection refused

Because MutatingWebhookConfiguration has `failurePolicy: Fail`,
each such failure immediately rejects TaskRun Pod creation.

Rename the proxy-webhook Deployment's selector matchLabels and
pod template label from `name: tekton-operator` to
`name: tekton-operator-proxy-webhook`, and update the Service
selector to match. The `app: tekton-operator` label is left
unchanged. Applies to both Kubernetes and OpenShift manifests.

Adding a set-based (NotIn) expression to the Service selector
instead was not viable as Kubernetes Services only support
equality-based (matchLabels) selectors.
Copilot AI review requested due to automatic review settings February 19, 2026 10:19
@tekton-robot tekton-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 19, 2026
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Feb 19, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: bowling233 / name: Baolin Zhu (49cacf1)

@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign anithapriyanatarajan after the PR has been reviewed.
You can assign the PR to them by writing /assign @anithapriyanatarajan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 19, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a production routing bug where the tekton-operator-proxy-webhook Service selector unintentionally matched both proxy-webhook and main operator pods, causing intermittent admission webhook failures and rejected TaskRun Pod creation.

Changes:

  • Update the proxy-webhook Deployment selector + pod template label to use name: tekton-operator-proxy-webhook.
  • Update the proxy-webhook Service selector to match the new pod label (Kubernetes + OpenShift manifests).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
cmd/kubernetes/operator/kodata/webhook/webhook.yaml Aligns proxy-webhook Deployment/Service selectors to target only proxy-webhook pods on Kubernetes.
cmd/openshift/operator/kodata/webhook/webhook.yaml Same selector/label fix for the OpenShift manifest.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jkhelil
Copy link
Member

jkhelil commented Feb 22, 2026

@bowling233 , thank for your PR.

  • can you check what happens to existing clusters during upgrade? ( Install 0.78.1 and then apply your change)
    Please describe and post a proof that upgrade is working and not broken

@anithapriyanatarajan
Copy link
Contributor

@bowling233 - Request you to address the review comment, if you are still pursuing this PR. Thank you. 🙇‍♀️

@bowling233
Copy link
Author

Hi @anithapriyanatarajan,

So sorry for the late response! I won't have the bandwidth to properly validate these changes until next month.

I can confirm this approach has side effects—specifically, the HorizontalPodAutoscaler is hitting FailedGetResourceMetric errors because it's incorrectly picking up the main operator pods, which lack the expected CPU requests in the tekton-operator-lifecycle container.

This PR definitely needs more refinement to handle the selector immutability and the HPA configuration. Should I move this to a Draft for now, or would you prefer I close this and resubmit once I've validated a full fix?

@bowling233 bowling233 marked this pull request as draft March 10, 2026 14:34
@tekton-robot tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tekton-operator-proxy-webhook Service selector matches operator pods, causing ~50% webhook admission failures

5 participants