Skip to content

Generate riskConditions in a better way#1324

Open
hongkailiu wants to merge 2 commits intoopenshift:mainfrom
hongkailiu:evaluateConditionalUpdate
Open

Generate riskConditions in a better way#1324
hongkailiu wants to merge 2 commits intoopenshift:mainfrom
hongkailiu:evaluateConditionalUpdate

Conversation

@hongkailiu
Copy link
Member

@hongkailiu hongkailiu commented Feb 17, 2026

Follow up [1].

The issues are

  • It goes down to the per-target level when managing the
    riskConditions cache which can be simplified by collecting
    risks while going through the targets.
  • The cache may not get refreshed after re-evaluation
    of the risk.
  • We self-throttle to avoid overloading Prometheus. It is a
    better strategy if we evaluate risks for latest versions
    because a cluster admin is more likely to upgrade to them.

With the pull, riskConditions is generated by

  • loadRiskVersions stores the risks and the latest version
    to which it is exposed.
  • risksInOrder gets a list of risk names sorted by its latest
    version.
  • loadRiskConditions evaluates the risks according the order
    above.

Since all risk have been evaluated, the Recommended condition
can be determined easily.

[1]. https://github.com/openshift/cluster-version-operator/pull/1284/changes#r2734918038

@coderabbitai
Copy link

coderabbitai bot commented Feb 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Risk evaluation was reworked: risk-version mapping and ordering were extracted, risk conditions are preloaded during availability calculation, evaluateConditionalUpdate signature was simplified (removing context/registry), and conditional-update status mutations now operate on a defensive deep copy.

Changes

Cohort / File(s) Summary
Risk-evaluation core
pkg/cvo/availableupdates.go
Added riskWithVersion type and helpers loadRiskVersions, risksInOrder, loadRiskConditions; compute riskVersions, risks, and RiskConditions inside getAvailableUpdates; removed in-file RiskConditions init and the old evaluateConditionalUpdates flow; updated evaluateConditionalUpdate signature (removed context and conditionRegistry) and wired callers to pass riskConditions.
Tests and test helpers
pkg/cvo/availableupdates_test.go
Updated tests to the new evaluateConditionalUpdate signature (no registry), removed PromQL registry setup, pass riskConditions directly; added tests for loadRiskVersions, risksInOrder, and loadRiskConditions; import adjustments (added semver/v4).
Status mutation safety
pkg/cvo/status.go
Make a defensive deep copy of the conditional-updates slice and apply all mutations (Risks, RiskNames, RiskConditions) to the copy instead of the original.
Constants
pkg/internal/constants.go
Added exported constant ConditionalUpdateRiskConditionReasonInternalErrorFoundNoRiskCondition = "InternalErrorFoundNoRiskCondition".

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@hongkailiu
Copy link
Member Author

/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pkg/cvo/availableupdates.go (1)

629-632: Consider graceful error handling instead of panic.

If riskCondition is nil due to any data inconsistency (race condition, bug in helper functions, etc.), the operator will crash. While this "should never happen" given the current code flow, defensive error handling would improve reliability.

♻️ Suggested alternative
 riskCondition := meta.FindStatusCondition(riskConditions[risk.Name], internal.ConditionalUpdateRiskConditionTypeApplies)
 if riskCondition == nil {
-    panic(fmt.Sprintf("failed to find conditional update risk %s in risk conditions map", risk.Name))
+    klog.Errorf("BUG: failed to find conditional update risk %s in risk conditions map, treating as unknown", risk.Name)
+    recommended.Status = newRecommendedStatus(recommended.Status, metav1.ConditionUnknown)
+    recommended.Reason = newRecommendedReason(recommended.Reason, recommendedReasonEvaluationFailed)
+    errorMessages = append(errorMessages, fmt.Sprintf("Internal error evaluating risk %s", risk.Name))
+    continue
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cvo/availableupdates.go` around lines 629 - 632, Replace the panic when
meta.FindStatusCondition(...) returns nil with graceful handling: check if
riskCondition is nil and, instead of panicking, log an error including risk.Name
and the fact FindStatusCondition returned nil (include
internal.ConditionalUpdateRiskConditionTypeApplies in the message for context)
and then skip further processing of that risk (continue the loop) so the
operator does not crash; update the code paths around riskCondition,
meta.FindStatusCondition, and the loop iterating risks to handle the skipped
case safely.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/cvo/availableupdates_test.go`:
- Around line 872-900: The test is flaky because risksInOrder builds a slice
from the map and sorts only by version, so equal versions (e.g., riskB and
riskC) rely on non-deterministic map iteration; update the risksInOrder function
to add a deterministic secondary sort key (e.g., compare the risk name) when
versions are equal (use the existing riskWithVersion entries and sort.Slice or
sort.SliceStable comparator to first compare version, then compare risk.Name or
the map key) so ordering is stable, or alternatively change the test to not
expect a specific order for equal-version risks.

In `@pkg/cvo/availableupdates.go`:
- Around line 392-394: The inline comment on the sort is incorrect:
sort.SliceStable(temp, func(i, j int) bool { return
temp[i].version.GT(temp[j].version) }) actually sorts in descending order
(higher versions first); either change the comparator to use version.LT (or flip
indices) to produce ascending order, or update the comment to "Sort in
descending order" to match the current behavior—locate the sort.SliceStable call
and fix the comparator or the comment accordingly.

---

Nitpick comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 629-632: Replace the panic when meta.FindStatusCondition(...)
returns nil with graceful handling: check if riskCondition is nil and, instead
of panicking, log an error including risk.Name and the fact FindStatusCondition
returned nil (include internal.ConditionalUpdateRiskConditionTypeApplies in the
message for context) and then skip further processing of that risk (continue the
loop) so the operator does not crash; update the code paths around
riskCondition, meta.FindStatusCondition, and the loop iterating risks to handle
the skipped case safely.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 17, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 17, 2026
@hongkailiu hongkailiu force-pushed the evaluateConditionalUpdate branch 2 times, most recently from e6bcf40 to fdd9f72 Compare February 17, 2026 21:47
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 334-353: Risk evaluation currently runs while holding statusLock
and calls conditionRegistry.Match via loadRiskConditions using
context.Background(), which can block the lock and ignores cancellation; to fix,
inside the critical section (statusLock) copy the minimal state needed
(u.ConditionalUpdates, u.AcceptRisks, u.ShouldReconcileAcceptRisks, and any risk
lists generated by loadRiskVersions/risksInOrder) into local variables, release
the lock, then call loadRiskConditions/evaluateConditionalUpdate using a proper
context (propagate the caller ctx or create a bounded timeout context instead of
context.Background()); finally, re-acquire the lock only to write back the
computed RiskConditions and to update u.ConditionalUpdates via
u.addUpdate/u.removeUpdate and meta.SetStatusCondition so the lock is held only
for quick state mutation.

---

Duplicate comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 385-394: risksInOrder builds a slice from the riskVersions map and
sorts only by version, causing nondeterministic order when versions are equal;
update the sort.SliceStable comparator in risksInOrder to use a secondary key
(the risk name) when versions compare equal (e.g., if
temp[i].version.EQ(temp[j].version) then return temp[i].name < temp[j].name) so
ordering is deterministic; keep the primary version comparison as-is and ensure
the comparator returns a consistent boolean for all cases.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/cvo/availableupdates.go (1)

634-637: Consider graceful degradation instead of panic.

The panic at line 636 will crash the operator if a risk name is missing from the precomputed riskConditions map. While this "should never happen" given the current flow, defensive error handling (e.g., returning ConditionUnknown similar to status.go lines 442-449) would be more resilient to future refactoring or edge cases.

♻️ Suggested defensive handling
 riskCondition := meta.FindStatusCondition(riskConditions[risk.Name], internal.ConditionalUpdateRiskConditionTypeApplies)
 if riskCondition == nil {
-    panic(fmt.Sprintf("failed to find conditional update risk %s in risk conditions map", risk.Name))
+    klog.Errorf("Internal error: missing risk condition for %s, treating as unknown", risk.Name)
+    riskCondition = &metav1.Condition{
+        Type:   internal.ConditionalUpdateRiskConditionTypeApplies,
+        Status: metav1.ConditionUnknown,
+        Reason: "InternalErrorNoConditionCollected",
+    }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cvo/availableupdates.go` around lines 634 - 637, Don't panic when
meta.FindStatusCondition(riskConditions[risk.Name],
internal.ConditionalUpdateRiskConditionTypeApplies) returns nil; instead create
a defensive "unknown" condition and continue. Replace the panic with logic that
assigns riskCondition to an explicit unknown condition (Type =
internal.ConditionalUpdateRiskConditionTypeApplies, Status =
metav1.ConditionUnknown, and a short Reason/Message like "MissingPrecomputed")
or call the existing status helper that returns ConditionUnknown used elsewhere,
so callers that expect riskCondition can proceed without crashing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 334-354: Risk evaluation (loadRiskConditions -> which calls
conditionRegistry.Match) is being executed while statusLock is held, causing
potential stalls; move the work off the critical section by computing
riskVersions (loadRiskVersions), risks (risksInOrder) and calling
loadRiskConditions outside the statusLock, then acquire the lock only to set
u.RiskConditions and to update u.ConditionalUpdates based on the
already-evaluated conditions; specifically, ensure conditionRegistry.Match is
invoked outside the lock and that evaluateConditionalUpdate and
meta.SetStatusCondition still update shared state only while holding statusLock.

---

Nitpick comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 634-637: Don't panic when
meta.FindStatusCondition(riskConditions[risk.Name],
internal.ConditionalUpdateRiskConditionTypeApplies) returns nil; instead create
a defensive "unknown" condition and continue. Replace the panic with logic that
assigns riskCondition to an explicit unknown condition (Type =
internal.ConditionalUpdateRiskConditionTypeApplies, Status =
metav1.ConditionUnknown, and a short Reason/Message like "MissingPrecomputed")
or call the existing status helper that returns ConditionUnknown used elsewhere,
so callers that expect riskCondition can proceed without crashing.

@hongkailiu hongkailiu force-pushed the evaluateConditionalUpdate branch from fdd9f72 to 11092ac Compare February 17, 2026 22:04
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/cvo/availableupdates_test.go`:
- Around line 944-947: The test call to loadRiskConditions is missing the
required context parameter; create a context (e.g., ctx := context.Background()
or context.TODO()) before calling loadRiskConditions and pass it as the first
argument (change loadRiskConditions(tt.risks, tt.riskVersions, registry) to
loadRiskConditions(ctx, tt.risks, tt.riskVersions, registry)); also add the
"context" import to the test file if it isn’t already imported.

---

Duplicate comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 329-355: The risk evaluation (loadRiskConditions) must not run
while holding the status lock; instead, under the lock copy the minimal state
needed (e.g., optr.availableUpdates.ConditionalUpdates or their DeepCopy
results, riskVersions from loadRiskVersions, and u.ConditionRegistry reference),
then release the lock and call loadRiskConditions(ctx, risks, riskVersions,
copiedConditionRegistry) outside the lock, and finally re-acquire the lock only
to assign u.ConditionalUpdates, u.RiskConditions and to apply results (the loop
that calls evaluateConditionalUpdate, meta.SetStatusCondition and updates via
u.addUpdate/u.removeUpdate and u.ConditionalUpdates[i].Conditions). Ensure you
reference loadRiskConditions, loadRiskVersions, evaluateConditionalUpdate,
u.ConditionRegistry, u.ConditionalUpdates, and u.RiskConditions when making the
change.

@hongkailiu hongkailiu force-pushed the evaluateConditionalUpdate branch from 11092ac to c300606 Compare February 17, 2026 22:21
@hongkailiu
Copy link
Member Author

/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323

@hongkailiu hongkailiu force-pushed the evaluateConditionalUpdate branch 3 times, most recently from 4e66b13 to bc68d14 Compare February 18, 2026 00:37
@hongkailiu hongkailiu force-pushed the evaluateConditionalUpdate branch from bc68d14 to 01b9700 Compare February 18, 2026 13:05
Follow up [1].

The issues are

* It goes down to the per-target level when managing the
  riskConditions cache which can be simplified by collecting
  risks while going through the targets.
* The cache may not get refreshed after re-evaluation
  of the risk.
* We self-throttle to avoid overloading Prometheus. It is a
  better strategy if we evaluate risks for latest versions
  because a cluster admin is more likely to upgrade to them.

With the pull, `riskConditions` is generated by

* `loadRiskVersions` stores the risks and the latest version
to which it is exposed.
* `risksInOrder` gets a list of risk names sorted by its latest
version.
* `loadRiskConditions` evaluates the risks according the order
above.

Since all risk have been evaluated, the Recommended condition
can be determined easily.

[1]. https://github.com/openshift/cluster-version-operator/pull/1284/changes#r2734918038
@hongkailiu hongkailiu force-pushed the evaluateConditionalUpdate branch from 01b9700 to 0a8b4d0 Compare February 18, 2026 13:08
@hongkailiu
Copy link
Member Author

/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323

@hongkailiu
Copy link
Member Author

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 18, 2026

@hongkailiu: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hongkailiu
Copy link
Member Author

/test all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments