Skip to content

Add a metric and an alert for conditional risk condtions#1318

Open
hongkailiu wants to merge 4 commits intoopenshift:mainfrom
hongkailiu:cluster_version_risk_condition
Open

Add a metric and an alert for conditional risk condtions#1318
hongkailiu wants to merge 4 commits intoopenshift:mainfrom
hongkailiu:cluster_version_risk_condition

Conversation

@hongkailiu
Copy link
Member

@hongkailiu hongkailiu commented Feb 15, 2026

Follow up #1284 (comment)

The new metic is cluster_version_risk_conditions{name, condition, risk} and the alert is firing if some risk applies to the cluster.

The samples for cluster_version_risk_conditions will be collected
only when its operator shouldReconcileAcceptRisks. It means, e.g.,
on a TechPreview disabled cluster the metric is still defined but
has no samples.

The alert is defined in a new
PrometheusRule/cluster-version-operator-tech-preview
which is installed only on a TechPreview cluster.

@coderabbitai
Copy link

coderabbitai bot commented Feb 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a Prometheus GaugeVec metric cluster_version_risk_conditions (labels: name, condition, risk), wires its collection into the operator metrics flow, adds unit tests for the collector, and introduces a PrometheusRule alert that fires on risk conditions.

Changes

Cohort / File(s) Summary
Metrics Implementation
pkg/cvo/metrics.go
Adds clusterVersionRiskConditions *prometheus.GaugeVec to operatorMetrics, initializes it in newOperatorMetrics, exposes its descriptor in Describe, adds collectConditionalUpdateRisks to convert ConditionalUpdateRisk entries to metrics, and calls it from the per-ClusterVersion and main Collect flows.
Metrics Testing
pkg/cvo/metrics_test.go
Adds Test_collectConditionalUpdateRisks to assert emitted metrics for various ConditionalUpdateRisk cases. The test appears duplicated (two identical definitions) in the file.
Alert Rule
install/0000_90_cluster-version-operator_02_prometheusrule-TechPreviewNoUpgrade.yaml
Adds a PrometheusRule with a RiskApplies alert that fires when cluster_version_risk_conditions indicates an Applies risk, grouped by namespace/name/risk, with a 10-minute window and warning severity.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 15, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 15, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@install/0000_90_cluster-version-operator_02_servicemonitor.yaml`:
- Around line 86-95: The runbook_url for the alert RiskApplies points to a
non-existent path; update the runbook_url field in the RiskApplies alert stanza
to use the cluster-version-operator runbook path used by similar alerts (e.g.,
match the URL used by ClusterVersionOperatorDown) — i.e., replace
https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/RiskApplies.md
with
https://github.com/openshift/runbooks/blob/master/alerts/cluster-version-operator/RiskApplies.md
or alternatively create the missing runbook at the original path so the current
URL remains valid.

In `@pkg/cvo/metrics_test.go`:
- Line 976: The test function named Test_collectConditionalUpdates is testing
collectConditionalUpdateRisks (not collectConditionalUpdates), which is
misleading and conflicts with TestCollectUnknownConditionalUpdates; rename the
test to something like Test_collectConditionalUpdateRisks (or
TestCollectConditionalUpdateRisks) and update any references so the test name
clearly matches the function under test (collectConditionalUpdateRisks) and does
not collide with the existing TestCollectUnknownConditionalUpdates that targets
collectConditionalUpdates.

@hongkailiu hongkailiu force-pushed the cluster_version_risk_condition branch 2 times, most recently from 3b6c582 to 129bc80 Compare February 15, 2026 13:52
@hongkailiu
Copy link
Member Author

Testing with 129bc80:

launch 4.22,openshift/cluster-version-operator#1318 aws,techpreview

Then

$ oc patch clusterversion/version --patch '{"spec":{"upstream":"https://fauxinnati-fauxinnati.apps.ota-stage.q2z4.p1.openshiftapps.com/api/upgrades_info/graph"}}' --type=merge

$ oc adm upgrade channel risks-always

The alert is Pending.

Screenshot 2026-02-15 at 10 46 12

and

Screenshot 2026-02-15 at 10 46 27

@hongkailiu
Copy link
Member Author

/hold

This feature has to be gated by TP.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 16, 2026
@hongkailiu hongkailiu force-pushed the cluster_version_risk_condition branch 4 times, most recently from 71cca53 to 9138c46 Compare February 16, 2026 21:35
Follow up [1].

The samples for `cluster_version_risk_conditions` will be collected
only when its operator `shouldReconcileAcceptRisks`. It means, e.g.,
on a TechPreview disabled cluster the metric is still defined but
has no samples.

[1]. openshift#1284 (comment)
The alert is defined in a new
`PrometheusRule/cluster-version-operator-tech-preview`
which is installed only on a TechPreview cluster.
@hongkailiu hongkailiu force-pushed the cluster_version_risk_condition branch from 9138c46 to 2822b83 Compare February 16, 2026 22:07
@hongkailiu
Copy link
Member Author

hongkailiu commented Feb 16, 2026

Test with 2822b83

launch 4.22,openshift/cluster-version-operator#1318 aws,techpreview
$ oc get prometheusrule -n openshift-cluster-version cluster-version-operator-tech-preview
NAME                                    AGE
cluster-version-operator-tech-preview   37m

Repeating the steps in #1318 (comment) get the same result.


launch 4.22,openshift/cluster-version-operator#1318 gcp,single-node
$ oc get prometheusrule -n openshift-cluster-version cluster-version-operator-tech-preview
Error from server (NotFound): prometheusrules.monitoring.coreos.com "cluster-version-operator-tech-preview" not found

and no data found for cluster_version_risk_conditions.

Screenshot 2026-02-16 at 23 10 44

@hongkailiu
Copy link
Member Author

/retest-required

@hongkailiu
Copy link
Member Author

Still holding.
Wait for #1323 to get in.

continue
}

g := m.clusterVersionRiskConditions.WithLabelValues("version", condition.Type, risk.Name)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that the name label value is always version. Any future plan to have more values?

BTW the same question applies to the condition label which is always internal.ConditionalUpdateRiskConditionTypeApplies?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have only one ClusterVersion object on the cluster.

$ oc get clusterversion --kubeconfig /tmp/ota-stage.c
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.20.10   True        False         33d     Cluster version is 4.20.10

I do not see we have more to come and thus removed the "name" label with 28563c5.

Applies is the one we care for now.
But we may have more in the future.

// +kubebuilder:validation:XValidation:rule="self.exists_one(x, x.type == 'Applies')",message="must contain a condition of type 'Applies'"
// +kubebuilder:validation:MaxItems=8
// +kubebuilder:validation:MinItems=1

I would like to keep this label.

Copy link

@simonpasquier simonpasquier Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure it sounds good if you plan to have more conditions in the future.

Copy link
Member Author

@hongkailiu hongkailiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. @simonpasquier

continue
}

g := m.clusterVersionRiskConditions.WithLabelValues("version", condition.Type, risk.Name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have only one ClusterVersion object on the cluster.

$ oc get clusterversion --kubeconfig /tmp/ota-stage.c
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.20.10   True        False         33d     Cluster version is 4.20.10

I do not see we have more to come and thus removed the "name" label with 28563c5.

Applies is the one we care for now.
But we may have more in the future.

// +kubebuilder:validation:XValidation:rule="self.exists_one(x, x.type == 'Applies')",message="must contain a condition of type 'Applies'"
// +kubebuilder:validation:MaxItems=8
// +kubebuilder:validation:MinItems=1

I would like to keep this label.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 17, 2026

@hongkailiu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/unit db8a35a link true /test unit
ci/prow/e2e-hypershift db8a35a link true /test e2e-hypershift
ci/prow/e2e-hypershift-conformance db8a35a link true /test e2e-hypershift-conformance

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants