Generate riskConditions in a better way by hongkailiu · Pull Request #1324 · openshift/cluster-version-operator

hongkailiu · 2026-02-17T21:32:21Z

Follow up [1].

The issues are

It goes down to the per-target level when managing the
riskConditions cache which can be simplified by collecting
risks while going through the targets.
The cache may not get refreshed after re-evaluation
of the risk.
We self-throttle to avoid overloading Prometheus. It is a
better strategy if we evaluate risks for latest versions
because a cluster admin is more likely to upgrade to them.

With the pull, riskConditions is generated by

loadRiskVersions stores the risks and the latest version
to which it is exposed.
risksInOrder gets a list of risk names sorted by its latest
version.
loadRiskConditions evaluates the risks according the order
above.

Since all risk have been evaluated, the Recommended condition
can be determined easily.

[1]. https://github.com/openshift/cluster-version-operator/pull/1284/changes#r2734918038

coderabbitai · 2026-02-17T21:32:42Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Risk evaluation was reworked: risk-version mapping and ordering were extracted, risk conditions are preloaded during availability calculation, evaluateConditionalUpdate signature was simplified (removing context/registry), and conditional-update status mutations now operate on a defensive deep copy.

Changes

Cohort / File(s)	Summary
Risk-evaluation core `pkg/cvo/availableupdates.go`	Added `riskWithVersion` type and helpers `loadRiskVersions`, `risksInOrder`, `loadRiskConditions`; compute `riskVersions`, `risks`, and `RiskConditions` inside `getAvailableUpdates`; removed in-file `RiskConditions` init and the old `evaluateConditionalUpdates` flow; updated `evaluateConditionalUpdate` signature (removed `context` and `conditionRegistry`) and wired callers to pass `riskConditions`.
Tests and test helpers `pkg/cvo/availableupdates_test.go`	Updated tests to the new `evaluateConditionalUpdate` signature (no registry), removed PromQL registry setup, pass `riskConditions` directly; added tests for `loadRiskVersions`, `risksInOrder`, and `loadRiskConditions`; import adjustments (added `semver/v4`).
Status mutation safety `pkg/cvo/status.go`	Make a defensive deep copy of the conditional-updates slice and apply all mutations (Risks, RiskNames, RiskConditions) to the copy instead of the original.
Constants `pkg/internal/constants.go`	Added exported constant `ConditionalUpdateRiskConditionReasonInternalErrorFoundNoRiskCondition = "InternalErrorFoundNoRiskCondition"`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

hongkailiu · 2026-02-17T21:34:32Z

/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

pkg/cvo/availableupdates.go (1)

629-632: Consider graceful error handling instead of panic.

If riskCondition is nil due to any data inconsistency (race condition, bug in helper functions, etc.), the operator will crash. While this "should never happen" given the current code flow, defensive error handling would improve reliability.

♻️ Suggested alternative

 riskCondition := meta.FindStatusCondition(riskConditions[risk.Name], internal.ConditionalUpdateRiskConditionTypeApplies)
 if riskCondition == nil {
-    panic(fmt.Sprintf("failed to find conditional update risk %s in risk conditions map", risk.Name))
+    klog.Errorf("BUG: failed to find conditional update risk %s in risk conditions map, treating as unknown", risk.Name)
+    recommended.Status = newRecommendedStatus(recommended.Status, metav1.ConditionUnknown)
+    recommended.Reason = newRecommendedReason(recommended.Reason, recommendedReasonEvaluationFailed)
+    errorMessages = append(errorMessages, fmt.Sprintf("Internal error evaluating risk %s", risk.Name))
+    continue
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/cvo/availableupdates.go` around lines 629 - 632, Replace the panic when
meta.FindStatusCondition(...) returns nil with graceful handling: check if
riskCondition is nil and, instead of panicking, log an error including risk.Name
and the fact FindStatusCondition returned nil (include
internal.ConditionalUpdateRiskConditionTypeApplies in the message for context)
and then skip further processing of that risk (continue the loop) so the
operator does not crash; update the code paths around riskCondition,
meta.FindStatusCondition, and the loop iterating risks to handle the skipped
case safely.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/cvo/availableupdates_test.go`:
- Around line 872-900: The test is flaky because risksInOrder builds a slice
from the map and sorts only by version, so equal versions (e.g., riskB and
riskC) rely on non-deterministic map iteration; update the risksInOrder function
to add a deterministic secondary sort key (e.g., compare the risk name) when
versions are equal (use the existing riskWithVersion entries and sort.Slice or
sort.SliceStable comparator to first compare version, then compare risk.Name or
the map key) so ordering is stable, or alternatively change the test to not
expect a specific order for equal-version risks.

In `@pkg/cvo/availableupdates.go`:
- Around line 392-394: The inline comment on the sort is incorrect:
sort.SliceStable(temp, func(i, j int) bool { return
temp[i].version.GT(temp[j].version) }) actually sorts in descending order
(higher versions first); either change the comparator to use version.LT (or flip
indices) to produce ascending order, or update the comment to "Sort in
descending order" to match the current behavior—locate the sort.SliceStable call
and fix the comparator or the comment accordingly.

---

Nitpick comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 629-632: Replace the panic when meta.FindStatusCondition(...)
returns nil with graceful handling: check if riskCondition is nil and, instead
of panicking, log an error including risk.Name and the fact FindStatusCondition
returned nil (include internal.ConditionalUpdateRiskConditionTypeApplies in the
message for context) and then skip further processing of that risk (continue the
loop) so the operator does not crash; update the code paths around
riskCondition, meta.FindStatusCondition, and the loop iterating risks to handle
the skipped case safely.

pkg/cvo/availableupdates_test.go

pkg/cvo/availableupdates.go

openshift-ci · 2026-02-17T21:38:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [hongkailiu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 334-353: Risk evaluation currently runs while holding statusLock
and calls conditionRegistry.Match via loadRiskConditions using
context.Background(), which can block the lock and ignores cancellation; to fix,
inside the critical section (statusLock) copy the minimal state needed
(u.ConditionalUpdates, u.AcceptRisks, u.ShouldReconcileAcceptRisks, and any risk
lists generated by loadRiskVersions/risksInOrder) into local variables, release
the lock, then call loadRiskConditions/evaluateConditionalUpdate using a proper
context (propagate the caller ctx or create a bounded timeout context instead of
context.Background()); finally, re-acquire the lock only to write back the
computed RiskConditions and to update u.ConditionalUpdates via
u.addUpdate/u.removeUpdate and meta.SetStatusCondition so the lock is held only
for quick state mutation.

---

Duplicate comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 385-394: risksInOrder builds a slice from the riskVersions map and
sorts only by version, causing nondeterministic order when versions are equal;
update the sort.SliceStable comparator in risksInOrder to use a secondary key
(the risk name) when versions compare equal (e.g., if
temp[i].version.EQ(temp[j].version) then return temp[i].name < temp[j].name) so
ordering is deterministic; keep the primary version comparison as-is and ensure
the comparator returns a consistent boolean for all cases.

pkg/cvo/availableupdates.go

coderabbitai

🧹 Nitpick comments (1)

pkg/cvo/availableupdates.go (1)

634-637: Consider graceful degradation instead of panic.

The panic at line 636 will crash the operator if a risk name is missing from the precomputed riskConditions map. While this "should never happen" given the current flow, defensive error handling (e.g., returning ConditionUnknown similar to status.go lines 442-449) would be more resilient to future refactoring or edge cases.

♻️ Suggested defensive handling

 riskCondition := meta.FindStatusCondition(riskConditions[risk.Name], internal.ConditionalUpdateRiskConditionTypeApplies)
 if riskCondition == nil {
-    panic(fmt.Sprintf("failed to find conditional update risk %s in risk conditions map", risk.Name))
+    klog.Errorf("Internal error: missing risk condition for %s, treating as unknown", risk.Name)
+    riskCondition = &metav1.Condition{
+        Type:   internal.ConditionalUpdateRiskConditionTypeApplies,
+        Status: metav1.ConditionUnknown,
+        Reason: "InternalErrorNoConditionCollected",
+    }
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/cvo/availableupdates.go` around lines 634 - 637, Don't panic when
meta.FindStatusCondition(riskConditions[risk.Name],
internal.ConditionalUpdateRiskConditionTypeApplies) returns nil; instead create
a defensive "unknown" condition and continue. Replace the panic with logic that
assigns riskCondition to an explicit unknown condition (Type =
internal.ConditionalUpdateRiskConditionTypeApplies, Status =
metav1.ConditionUnknown, and a short Reason/Message like "MissingPrecomputed")
or call the existing status helper that returns ConditionUnknown used elsewhere,
so callers that expect riskCondition can proceed without crashing.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 334-354: Risk evaluation (loadRiskConditions -> which calls
conditionRegistry.Match) is being executed while statusLock is held, causing
potential stalls; move the work off the critical section by computing
riskVersions (loadRiskVersions), risks (risksInOrder) and calling
loadRiskConditions outside the statusLock, then acquire the lock only to set
u.RiskConditions and to update u.ConditionalUpdates based on the
already-evaluated conditions; specifically, ensure conditionRegistry.Match is
invoked outside the lock and that evaluateConditionalUpdate and
meta.SetStatusCondition still update shared state only while holding statusLock.

---

Nitpick comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 634-637: Don't panic when
meta.FindStatusCondition(riskConditions[risk.Name],
internal.ConditionalUpdateRiskConditionTypeApplies) returns nil; instead create
a defensive "unknown" condition and continue. Replace the panic with logic that
assigns riskCondition to an explicit unknown condition (Type =
internal.ConditionalUpdateRiskConditionTypeApplies, Status =
metav1.ConditionUnknown, and a short Reason/Message like "MissingPrecomputed")
or call the existing status helper that returns ConditionUnknown used elsewhere,
so callers that expect riskCondition can proceed without crashing.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/cvo/availableupdates_test.go`:
- Around line 944-947: The test call to loadRiskConditions is missing the
required context parameter; create a context (e.g., ctx := context.Background()
or context.TODO()) before calling loadRiskConditions and pass it as the first
argument (change loadRiskConditions(tt.risks, tt.riskVersions, registry) to
loadRiskConditions(ctx, tt.risks, tt.riskVersions, registry)); also add the
"context" import to the test file if it isn’t already imported.

---

Duplicate comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 329-355: The risk evaluation (loadRiskConditions) must not run
while holding the status lock; instead, under the lock copy the minimal state
needed (e.g., optr.availableUpdates.ConditionalUpdates or their DeepCopy
results, riskVersions from loadRiskVersions, and u.ConditionRegistry reference),
then release the lock and call loadRiskConditions(ctx, risks, riskVersions,
copiedConditionRegistry) outside the lock, and finally re-acquire the lock only
to assign u.ConditionalUpdates, u.RiskConditions and to apply results (the loop
that calls evaluateConditionalUpdate, meta.SetStatusCondition and updates via
u.addUpdate/u.removeUpdate and u.ConditionalUpdates[i].Conditions). Ensure you
reference loadRiskConditions, loadRiskVersions, evaluateConditionalUpdate,
u.ConditionRegistry, u.ConditionalUpdates, and u.RiskConditions when making the
change.

pkg/cvo/availableupdates_test.go

hongkailiu · 2026-02-17T22:24:12Z

/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323

Follow up [1]. The issues are * It goes down to the per-target level when managing the riskConditions cache which can be simplified by collecting risks while going through the targets. * The cache may not get refreshed after re-evaluation of the risk. * We self-throttle to avoid overloading Prometheus. It is a better strategy if we evaluate risks for latest versions because a cluster admin is more likely to upgrade to them. With the pull, `riskConditions` is generated by * `loadRiskVersions` stores the risks and the latest version to which it is exposed. * `risksInOrder` gets a list of risk names sorted by its latest version. * `loadRiskConditions` evaluates the risks according the order above. Since all risk have been evaluated, the Recommended condition can be determined easily. [1]. https://github.com/openshift/cluster-version-operator/pull/1284/changes#r2734918038

hongkailiu · 2026-02-18T20:28:44Z

/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323

hongkailiu · 2026-02-18T20:29:00Z

/retest-required

openshift-ci · 2026-02-18T21:06:28Z

@hongkailiu: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hongkailiu · 2026-02-18T22:00:26Z

/test all

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

pkg/cvo/availableupdates_test.go Show resolved Hide resolved

pkg/cvo/availableupdates.go Show resolved Hide resolved

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 17, 2026

hongkailiu force-pushed the evaluateConditionalUpdate branch 2 times, most recently from e6bcf40 to fdd9f72 Compare February 17, 2026 21:47

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

pkg/cvo/availableupdates.go Outdated Show resolved Hide resolved

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

hongkailiu force-pushed the evaluateConditionalUpdate branch from fdd9f72 to 11092ac Compare February 17, 2026 22:04

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

pkg/cvo/availableupdates_test.go Show resolved Hide resolved

hongkailiu force-pushed the evaluateConditionalUpdate branch from 11092ac to c300606 Compare February 17, 2026 22:21

hongkailiu force-pushed the evaluateConditionalUpdate branch 3 times, most recently from 4e66b13 to bc68d14 Compare February 18, 2026 00:37

hongkailiu mentioned this pull request Feb 18, 2026

OTA-1546: Reconcile cv.spec.desiredUpdate.acceptRisks #1284

Merged

hongkailiu force-pushed the evaluateConditionalUpdate branch from bc68d14 to 01b9700 Compare February 18, 2026 13:05

hongkailiu added 2 commits February 18, 2026 08:08

Simplify conditionalUpdateWithRiskNamesAndRiskConditions

0a8b4d0

hongkailiu force-pushed the evaluateConditionalUpdate branch from 01b9700 to 0a8b4d0 Compare February 18, 2026 13:08

Conversation

hongkailiu commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Uh oh!

hongkailiu commented Feb 17, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

openshift-ci bot commented Feb 17, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hongkailiu commented Feb 17, 2026

Uh oh!

hongkailiu commented Feb 18, 2026

Uh oh!

hongkailiu commented Feb 18, 2026

Uh oh!

openshift-ci bot commented Feb 18, 2026

Uh oh!

hongkailiu commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

hongkailiu commented Feb 17, 2026 •

edited

Loading

coderabbitai bot commented Feb 17, 2026 •

edited

Loading