Generate riskConditions in a better way#1324
Generate riskConditions in a better way#1324hongkailiu wants to merge 2 commits intoopenshift:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughRisk evaluation was reworked: risk-version mapping and ordering were extracted, risk conditions are preloaded during availability calculation, Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323 |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/cvo/availableupdates.go (1)
629-632: Consider graceful error handling instead of panic.If
riskConditionis nil due to any data inconsistency (race condition, bug in helper functions, etc.), the operator will crash. While this "should never happen" given the current code flow, defensive error handling would improve reliability.♻️ Suggested alternative
riskCondition := meta.FindStatusCondition(riskConditions[risk.Name], internal.ConditionalUpdateRiskConditionTypeApplies) if riskCondition == nil { - panic(fmt.Sprintf("failed to find conditional update risk %s in risk conditions map", risk.Name)) + klog.Errorf("BUG: failed to find conditional update risk %s in risk conditions map, treating as unknown", risk.Name) + recommended.Status = newRecommendedStatus(recommended.Status, metav1.ConditionUnknown) + recommended.Reason = newRecommendedReason(recommended.Reason, recommendedReasonEvaluationFailed) + errorMessages = append(errorMessages, fmt.Sprintf("Internal error evaluating risk %s", risk.Name)) + continue }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cvo/availableupdates.go` around lines 629 - 632, Replace the panic when meta.FindStatusCondition(...) returns nil with graceful handling: check if riskCondition is nil and, instead of panicking, log an error including risk.Name and the fact FindStatusCondition returned nil (include internal.ConditionalUpdateRiskConditionTypeApplies in the message for context) and then skip further processing of that risk (continue the loop) so the operator does not crash; update the code paths around riskCondition, meta.FindStatusCondition, and the loop iterating risks to handle the skipped case safely.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/cvo/availableupdates_test.go`:
- Around line 872-900: The test is flaky because risksInOrder builds a slice
from the map and sorts only by version, so equal versions (e.g., riskB and
riskC) rely on non-deterministic map iteration; update the risksInOrder function
to add a deterministic secondary sort key (e.g., compare the risk name) when
versions are equal (use the existing riskWithVersion entries and sort.Slice or
sort.SliceStable comparator to first compare version, then compare risk.Name or
the map key) so ordering is stable, or alternatively change the test to not
expect a specific order for equal-version risks.
In `@pkg/cvo/availableupdates.go`:
- Around line 392-394: The inline comment on the sort is incorrect:
sort.SliceStable(temp, func(i, j int) bool { return
temp[i].version.GT(temp[j].version) }) actually sorts in descending order
(higher versions first); either change the comparator to use version.LT (or flip
indices) to produce ascending order, or update the comment to "Sort in
descending order" to match the current behavior—locate the sort.SliceStable call
and fix the comparator or the comment accordingly.
---
Nitpick comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 629-632: Replace the panic when meta.FindStatusCondition(...)
returns nil with graceful handling: check if riskCondition is nil and, instead
of panicking, log an error including risk.Name and the fact FindStatusCondition
returned nil (include internal.ConditionalUpdateRiskConditionTypeApplies in the
message for context) and then skip further processing of that risk (continue the
loop) so the operator does not crash; update the code paths around
riskCondition, meta.FindStatusCondition, and the loop iterating risks to handle
the skipped case safely.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongkailiu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
e6bcf40 to
fdd9f72
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 334-353: Risk evaluation currently runs while holding statusLock
and calls conditionRegistry.Match via loadRiskConditions using
context.Background(), which can block the lock and ignores cancellation; to fix,
inside the critical section (statusLock) copy the minimal state needed
(u.ConditionalUpdates, u.AcceptRisks, u.ShouldReconcileAcceptRisks, and any risk
lists generated by loadRiskVersions/risksInOrder) into local variables, release
the lock, then call loadRiskConditions/evaluateConditionalUpdate using a proper
context (propagate the caller ctx or create a bounded timeout context instead of
context.Background()); finally, re-acquire the lock only to write back the
computed RiskConditions and to update u.ConditionalUpdates via
u.addUpdate/u.removeUpdate and meta.SetStatusCondition so the lock is held only
for quick state mutation.
---
Duplicate comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 385-394: risksInOrder builds a slice from the riskVersions map and
sorts only by version, causing nondeterministic order when versions are equal;
update the sort.SliceStable comparator in risksInOrder to use a secondary key
(the risk name) when versions compare equal (e.g., if
temp[i].version.EQ(temp[j].version) then return temp[i].name < temp[j].name) so
ordering is deterministic; keep the primary version comparison as-is and ensure
the comparator returns a consistent boolean for all cases.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/cvo/availableupdates.go (1)
634-637: Consider graceful degradation instead of panic.The
panicat line 636 will crash the operator if a risk name is missing from the precomputedriskConditionsmap. While this "should never happen" given the current flow, defensive error handling (e.g., returningConditionUnknownsimilar to status.go lines 442-449) would be more resilient to future refactoring or edge cases.♻️ Suggested defensive handling
riskCondition := meta.FindStatusCondition(riskConditions[risk.Name], internal.ConditionalUpdateRiskConditionTypeApplies) if riskCondition == nil { - panic(fmt.Sprintf("failed to find conditional update risk %s in risk conditions map", risk.Name)) + klog.Errorf("Internal error: missing risk condition for %s, treating as unknown", risk.Name) + riskCondition = &metav1.Condition{ + Type: internal.ConditionalUpdateRiskConditionTypeApplies, + Status: metav1.ConditionUnknown, + Reason: "InternalErrorNoConditionCollected", + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cvo/availableupdates.go` around lines 634 - 637, Don't panic when meta.FindStatusCondition(riskConditions[risk.Name], internal.ConditionalUpdateRiskConditionTypeApplies) returns nil; instead create a defensive "unknown" condition and continue. Replace the panic with logic that assigns riskCondition to an explicit unknown condition (Type = internal.ConditionalUpdateRiskConditionTypeApplies, Status = metav1.ConditionUnknown, and a short Reason/Message like "MissingPrecomputed") or call the existing status helper that returns ConditionUnknown used elsewhere, so callers that expect riskCondition can proceed without crashing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 334-354: Risk evaluation (loadRiskConditions -> which calls
conditionRegistry.Match) is being executed while statusLock is held, causing
potential stalls; move the work off the critical section by computing
riskVersions (loadRiskVersions), risks (risksInOrder) and calling
loadRiskConditions outside the statusLock, then acquire the lock only to set
u.RiskConditions and to update u.ConditionalUpdates based on the
already-evaluated conditions; specifically, ensure conditionRegistry.Match is
invoked outside the lock and that evaluateConditionalUpdate and
meta.SetStatusCondition still update shared state only while holding statusLock.
---
Nitpick comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 634-637: Don't panic when
meta.FindStatusCondition(riskConditions[risk.Name],
internal.ConditionalUpdateRiskConditionTypeApplies) returns nil; instead create
a defensive "unknown" condition and continue. Replace the panic with logic that
assigns riskCondition to an explicit unknown condition (Type =
internal.ConditionalUpdateRiskConditionTypeApplies, Status =
metav1.ConditionUnknown, and a short Reason/Message like "MissingPrecomputed")
or call the existing status helper that returns ConditionUnknown used elsewhere,
so callers that expect riskCondition can proceed without crashing.
fdd9f72 to
11092ac
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/cvo/availableupdates_test.go`:
- Around line 944-947: The test call to loadRiskConditions is missing the
required context parameter; create a context (e.g., ctx := context.Background()
or context.TODO()) before calling loadRiskConditions and pass it as the first
argument (change loadRiskConditions(tt.risks, tt.riskVersions, registry) to
loadRiskConditions(ctx, tt.risks, tt.riskVersions, registry)); also add the
"context" import to the test file if it isn’t already imported.
---
Duplicate comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 329-355: The risk evaluation (loadRiskConditions) must not run
while holding the status lock; instead, under the lock copy the minimal state
needed (e.g., optr.availableUpdates.ConditionalUpdates or their DeepCopy
results, riskVersions from loadRiskVersions, and u.ConditionRegistry reference),
then release the lock and call loadRiskConditions(ctx, risks, riskVersions,
copiedConditionRegistry) outside the lock, and finally re-acquire the lock only
to assign u.ConditionalUpdates, u.RiskConditions and to apply results (the loop
that calls evaluateConditionalUpdate, meta.SetStatusCondition and updates via
u.addUpdate/u.removeUpdate and u.ConditionalUpdates[i].Conditions). Ensure you
reference loadRiskConditions, loadRiskVersions, evaluateConditionalUpdate,
u.ConditionRegistry, u.ConditionalUpdates, and u.RiskConditions when making the
change.
11092ac to
c300606
Compare
|
/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323 |
4e66b13 to
bc68d14
Compare
bc68d14 to
01b9700
Compare
Follow up [1]. The issues are * It goes down to the per-target level when managing the riskConditions cache which can be simplified by collecting risks while going through the targets. * The cache may not get refreshed after re-evaluation of the risk. * We self-throttle to avoid overloading Prometheus. It is a better strategy if we evaluate risks for latest versions because a cluster admin is more likely to upgrade to them. With the pull, `riskConditions` is generated by * `loadRiskVersions` stores the risks and the latest version to which it is exposed. * `risksInOrder` gets a list of risk names sorted by its latest version. * `loadRiskConditions` evaluates the risks according the order above. Since all risk have been evaluated, the Recommended condition can be determined easily. [1]. https://github.com/openshift/cluster-version-operator/pull/1284/changes#r2734918038
01b9700 to
0a8b4d0
Compare
|
/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323 |
|
/retest-required |
|
@hongkailiu: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/test all |
Follow up [1].
The issues are
riskConditions cache which can be simplified by collecting
risks while going through the targets.
of the risk.
better strategy if we evaluate risks for latest versions
because a cluster admin is more likely to upgrade to them.
With the pull,
riskConditionsis generated byloadRiskVersionsstores the risks and the latest versionto which it is exposed.
risksInOrdergets a list of risk names sorted by its latestversion.
loadRiskConditionsevaluates the risks according the orderabove.
Since all risk have been evaluated, the Recommended condition
can be determined easily.
[1]. https://github.com/openshift/cluster-version-operator/pull/1284/changes#r2734918038