Add a metric and an alert for conditional risk condtions#1318
Add a metric and an alert for conditional risk condtions#1318hongkailiu wants to merge 4 commits intoopenshift:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a Prometheus GaugeVec metric Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongkailiu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@install/0000_90_cluster-version-operator_02_servicemonitor.yaml`:
- Around line 86-95: The runbook_url for the alert RiskApplies points to a
non-existent path; update the runbook_url field in the RiskApplies alert stanza
to use the cluster-version-operator runbook path used by similar alerts (e.g.,
match the URL used by ClusterVersionOperatorDown) — i.e., replace
https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/RiskApplies.md
with
https://github.com/openshift/runbooks/blob/master/alerts/cluster-version-operator/RiskApplies.md
or alternatively create the missing runbook at the original path so the current
URL remains valid.
In `@pkg/cvo/metrics_test.go`:
- Line 976: The test function named Test_collectConditionalUpdates is testing
collectConditionalUpdateRisks (not collectConditionalUpdates), which is
misleading and conflicts with TestCollectUnknownConditionalUpdates; rename the
test to something like Test_collectConditionalUpdateRisks (or
TestCollectConditionalUpdateRisks) and update any references so the test name
clearly matches the function under test (collectConditionalUpdateRisks) and does
not collide with the existing TestCollectUnknownConditionalUpdates that targets
collectConditionalUpdates.
install/0000_90_cluster-version-operator_02_servicemonitor.yaml
Outdated
Show resolved
Hide resolved
3b6c582 to
129bc80
Compare
|
Testing with 129bc80: Then The alert is
and
|
|
/hold This feature has to be gated by TP. |
71cca53 to
9138c46
Compare
Follow up [1]. The samples for `cluster_version_risk_conditions` will be collected only when its operator `shouldReconcileAcceptRisks`. It means, e.g., on a TechPreview disabled cluster the metric is still defined but has no samples. [1]. openshift#1284 (comment)
The alert is defined in a new `PrometheusRule/cluster-version-operator-tech-preview` which is installed only on a TechPreview cluster.
9138c46 to
2822b83
Compare
|
Test with 2822b83 Repeating the steps in #1318 (comment) get the same result. $ oc get prometheusrule -n openshift-cluster-version cluster-version-operator-tech-preview
Error from server (NotFound): prometheusrules.monitoring.coreos.com "cluster-version-operator-tech-preview" not foundand no data found for
|
|
/retest-required |
|
Still holding. |
pkg/cvo/metrics.go
Outdated
| continue | ||
| } | ||
|
|
||
| g := m.clusterVersionRiskConditions.WithLabelValues("version", condition.Type, risk.Name) |
There was a problem hiding this comment.
it seems that the name label value is always version. Any future plan to have more values?
BTW the same question applies to the condition label which is always internal.ConditionalUpdateRiskConditionTypeApplies?
There was a problem hiding this comment.
We have only one ClusterVersion object on the cluster.
$ oc get clusterversion --kubeconfig /tmp/ota-stage.c
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.20.10 True False 33d Cluster version is 4.20.10I do not see we have more to come and thus removed the "name" label with 28563c5.
Applies is the one we care for now.
But we may have more in the future.
I would like to keep this label.
There was a problem hiding this comment.
sure it sounds good if you plan to have more conditions in the future.
hongkailiu
left a comment
There was a problem hiding this comment.
Thanks for the review. @simonpasquier
pkg/cvo/metrics.go
Outdated
| continue | ||
| } | ||
|
|
||
| g := m.clusterVersionRiskConditions.WithLabelValues("version", condition.Type, risk.Name) |
There was a problem hiding this comment.
We have only one ClusterVersion object on the cluster.
$ oc get clusterversion --kubeconfig /tmp/ota-stage.c
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.20.10 True False 33d Cluster version is 4.20.10I do not see we have more to come and thus removed the "name" label with 28563c5.
Applies is the one we care for now.
But we may have more in the future.
I would like to keep this label.
|
@hongkailiu: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |



Follow up #1284 (comment)
The new metic is
cluster_version_risk_conditions{name, condition, risk}and the alert is firing if some risk applies to the cluster.The samples for
cluster_version_risk_conditionswill be collectedonly when its operator
shouldReconcileAcceptRisks. It means, e.g.,on a TechPreview disabled cluster the metric is still defined but
has no samples.
The alert is defined in a new
PrometheusRule/cluster-version-operator-tech-previewwhich is installed only on a TechPreview cluster.