[SDCICD-1830] Add retry mechanism with fallback model for LLM analysis engines by varunraokadaparthi · Pull Request #3231 · openshift/osde2e

varunraokadaparthi · 2026-05-18T23:41:37Z

Summary

Add exponential backoff retry logic (1m → 2m → 4m, 3 retries) for Gemini API calls in both analysis engines (internal/analysisengine, pkg/krknai/analysisengine)
On primary model (gemini-3.1-pro-preview) exhaustion, automatically switches to fallback model (gemini-2.5-pro) with the same retry strategy
Retryable error classification uses a status code allowlist (429, 500, 502, 503) via genai.APIError inspection; non-retryable errors (4xx, context cancellation, tool failures) short-circuit immediately
Both clients are pre-initialized at engine construction to avoid allocation and auth overhead on the retry path
Sentinel errors replace string matching for SDK-originated error conditions

Test plan

15 unit tests in internal/llm/retry_test.go covering: success paths, retryable/non-retryable status codes, primary exhaustion with fallback, both models exhausted, context cancellation, network timeouts, wrapped errors
Existing pkg/krknai/analysisengine tests pass unchanged
make build succeeds

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Analysis operations now automatically retry on transient failures with exponential backoff.
- Fallback model added to improve reliability when the primary model encounters issues.
- Enhanced handling of network timeouts, API errors, and other transient conditions.

…s engines Add exponential backoff retry logic (1m -> 2m -> 4m, 3 retries) for Gemini API calls in both analysis engines. On primary model exhaustion, automatically switches to gemini-2.5-pro as fallback with the same retry strategy. Non-retryable errors (4xx, context cancellation, tool failures) short-circuit immediately without retry or fallback. Retryable error classification uses a status code allowlist (429, 500, 502, 503) via genai.APIError inspection. Both fallback and primary clients are pre-initialized at engine construction to avoid repeated allocation and auth overhead on the retry path. Co-Authored-By: Claude Code <noreply@anthropic.com>

openshift-merge-bot · 2026-05-18T23:41:40Z

There are test jobs defined for this repository which are not configured to run automatically. Comment /test ? to see a list of all defined jobs. Review these jobs and use /test <job> to manually trigger jobs most likely to be impacted by the proposed changes.Comment /pipeline required to trigger all required & necessary jobs.

coderabbitai · 2026-05-18T23:41:45Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: bd7c3136-74ab-4f69-9a97-d771fff846d8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

The PR implements a resilient LLM analysis pipeline with a primary/fallback model strategy. It introduces sentinel errors to classify Gemini failures, adds exponential-backoff retry orchestration that promotes to a fallback model on exhaustion, integrates both patterns into internal and public engines, and validates behavior across success paths, transient failures, API errors, and edge cases.

Changes

LLM Retry and Fallback Pattern

Layer / File(s)	Summary
Sentinel Errors and Fallback Model `internal/llm/errors.go`, `internal/llm/gemini.go`	Defines exported sentinel error variables (`ErrNoResponseCandidates`, `ErrNoContentInResponse`, `ErrToolCallFailed`, `ErrMaxIterations`) and adds `FallbackModel` constant for a second Gemini model variant.
Gemini Error Classification `internal/llm/gemini.go`	Updates error handling in response parsing and tool execution to emit sentinel errors instead of formatted strings; wraps tool failures with `ErrToolCallFailed` and returns `ErrMaxIterations` on max-iteration termination.
Retry Orchestration with Primary/Fallback `internal/llm/retry.go`	Implements `AnalyzeWithRetry` to run primary analysis with exponential-backoff retries; on exhaustion, switches to fallback function with identical retry logic; returns aggregated error if both fail. Includes `isRetryable` predicate for API codes, context cancellation, network timeouts, and sentinel errors.
Engine Integration `internal/analysisengine/engine.go`, `pkg/krknai/analysisengine/engine.go`	Both engines add `fallbackLLMClient` field and initialize a second Gemini client with `FallbackModel` in `New`; `Run` replaces direct primary-client calls with `llm.AnalyzeWithRetry`, passing context logger and closures for primary and fallback analysis.
Comprehensive Retry Tests `internal/llm/retry_test.go`	Validates primary-only success, transient failure recovery, primary exhaustion with fallback promotion, dual exhaustion with aggregated errors, non-retryable error handling, HTTP status code retryability, context cancellation, network timeouts, sentinel error classification, and error wrapping preservation through multiple layers.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.52% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and clearly summarizes the primary changes: adding retry mechanism with fallback model support to LLM analysis engines.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	This PR uses Go's standard testing package, not Ginkgo. All test names are stable identifiers with no dynamic content, timestamps, UUIDs, or pod/node/namespace names that change between runs.
Test Structure And Quality	✅ Passed	Custom check targets Ginkgo tests (It/Describe/BeforeEach blocks). The PR adds standard Go tests using testing.T. No Ginkgo tests present, so check is not applicable.
Microshift Test Compatibility	✅ Passed	No Ginkgo e2e tests are added in this PR. All test additions are standard Go unit tests in internal/llm/retry_test.go using testing.T, not e2e tests. The check is not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR does not introduce new Ginkgo e2e tests. All changes are internal LLM logic and unit tests using Go's standard testing framework, which are not subject to SNO compatibility requirements.
Topology-Aware Scheduling Compatibility	✅ Passed	Check not applicable. PR contains only Go library code—LLM retry logic and error handling—with no Kubernetes deployment manifests, operator code, or scheduling constraints.
Ote Binary Stdout Contract	✅ Passed	PR modifies only library packages. No main packages, init/TestMain functions, or direct stdout writes. Uses logr.Logger for logging. OTE contract satisfied.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No Ginkgo e2e tests are added. PR adds only standard Go unit tests to internal/llm/retry_test.go using testing.T. The custom check for IPv6/disconnected network compatibility is not applicable.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-18T23:41:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: varunraokadaparthi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [varunraokadaparthi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

varunraokadaparthi · 2026-05-18T23:42:55Z

@coderabbitai review

coderabbitai · 2026-05-18T23:43:01Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/llm/gemini.go (1)

109-117: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Return ErrMaxIterations instead of success on the final tool-call iteration.

Line 109-114 returns nil error on the last iteration, so the max-iteration failure path at Line 117 is bypassed.

Suggested fix

 func (g *GeminiClient) handleConversationWithTools(ctx context.Context, contents []*genai.Content, genConfig *genai.GenerateContentConfig, toolRegistry *tools.Registry) (*AnalysisResult, error) {
 	const maxIterations = 5
 	var toolCalls []*genai.FunctionCall
+	var lastTextContent string
 
 	for i := range maxIterations {
 		resp, err := g.client.Models.GenerateContent(ctx, g.model, contents, genConfig)
 		if err != nil {
 			return nil, fmt.Errorf("gemini API error: %w", err)
@@
 		textContent, functionCalls := g.processCandidateParts(candidate)
+		lastTextContent = textContent
@@
-		// Return partial result if we hit max iterations
-		if i == maxIterations-1 {
-			return &AnalysisResult{
-				Content:   textContent,
-				ToolCalls: toolCalls,
-			}, nil
-		}
+		_ = i
 	}
 
-	return &AnalysisResult{ToolCalls: toolCalls}, ErrMaxIterations
+	return &AnalysisResult{
+		Content:   lastTextContent,
+		ToolCalls: toolCalls,
+	}, ErrMaxIterations
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/llm/gemini.go` around lines 109 - 117, The final iteration branch
currently returns (&AnalysisResult{Content: textContent, ToolCalls: toolCalls},
nil) which bypasses the ErrMaxIterations path; update the branch in the loop
that checks if i == maxIterations-1 to return the AnalysisResult along with
ErrMaxIterations instead of nil so the caller sees the max-iteration failure
(referencing AnalysisResult, ErrMaxIterations, i, maxIterations, textContent,
and toolCalls).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/llm/retry_test.go`:
- Line 15: The file internal/llm/retry_test.go is not gofumpt-formatted (issue
at line 15); run the gofumpt formatter on that file to rewrite it to gofumpt
style, stage the resulting changes, and re-run the linter so the test file
(retry_test.go) conforms to the project's gofumpt formatting rules.
- Around line 317-320: The test's fallbackFn currently returns an invalid (nil,
nil) tuple after calling t.Fatal; change fallbackFn in
internal/llm/retry_test.go so it returns a non-nil error instead of nil for the
error value (e.g., return nil, errors.New("fallback called unexpectedly") or
fmt.Errorf(...)) to avoid nilnil; ensure you add the required import (errors or
fmt) if not already present.

In `@internal/llm/retry.go`:
- Around line 37-52: Before switching to the fallback, save the original primary
error (the err returned from retryLoop with primaryFn) into a distinct variable
(e.g., primaryErr) so it isn't lost; then when the fallback retryLoop (with
fallbackFn) also fails, return a combined error that preserves both failures
(use errors.Join(primaryErr, err) or fmt.Errorf with both messages) and include
both in the logger.Error context so diagnostics show both primary and fallback
failures.
- Around line 101-104: The errors.As check uses a value-typed target
(genai.APIError) which won't match the SDK's pointer error type; change the
target to a pointer (e.g., declare apiErr as *genai.APIError) and call
errors.As(err, &apiErr) so the match succeeds and retryableStatusCodes is
consulted for apiErr.Code; update the logic around the variable name `apiErr` in
retry.go to use the pointer type and keep the existing return of
retryableStatusCodes[apiErr.Code].

---

Outside diff comments:
In `@internal/llm/gemini.go`:
- Around line 109-117: The final iteration branch currently returns
(&AnalysisResult{Content: textContent, ToolCalls: toolCalls}, nil) which
bypasses the ErrMaxIterations path; update the branch in the loop that checks if
i == maxIterations-1 to return the AnalysisResult along with ErrMaxIterations
instead of nil so the caller sees the max-iteration failure (referencing
AnalysisResult, ErrMaxIterations, i, maxIterations, textContent, and toolCalls).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 2bc7f5b6-0d3e-4dd5-86d0-245f9f6591a7

📥 Commits

Reviewing files that changed from the base of the PR and between 8c5716a and 9b2d95d.

📒 Files selected for processing (6)

internal/analysisengine/engine.go
internal/llm/errors.go
internal/llm/gemini.go
internal/llm/retry.go
internal/llm/retry_test.go
pkg/krknai/analysisengine/engine.go

…xhaustion Rename err to primaryErr/fallbackErr in AnalyzeWithRetry for clarity. Use errors.Join to preserve both primary and fallback errors when both models are exhausted, so callers and logs see the full failure picture. Co-Authored-By: Claude Code <noreply@anthropic.com>

Co-Authored-By: Claude Code <noreply@anthropic.com>

ritmun · 2026-05-19T14:02:56Z

Could wait.PollUntil functions be used anywhere?

YiqinZhang · 2026-05-19T14:50:24Z

 			return &AnalysisResult{
 				Content:   textContent,
 				ToolCalls: toolCalls,
 			}, nil


Suggested change

}, ErrMaxIterations

…, integration tests - 2 retries on primary model, 1 attempt on fallback, graceful error message - Flat 30s retry delay (no exponential backoff) - Integration tests with mock Gemini server for all error code scenarios - Store failed analysis result so LLM errors appear in Slack notifications Co-Authored-By: Claude Code <noreply@anthropic.com>

openshift-ci · 2026-05-21T22:56:42Z

@varunraokadaparthi: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot requested review from christophermancini and ritmun May 18, 2026 23:41

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2026

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread internal/llm/retry_test.go Outdated

Comment thread internal/llm/retry_test.go

Comment thread internal/llm/retry.go Outdated

Comment thread internal/llm/retry.go

varunraokadaparthi and others added 2 commits May 18, 2026 17:00

[SDCICD-1830] Fix gofumpt formatting

8217630

Co-Authored-By: Claude Code <noreply@anthropic.com>

YiqinZhang reviewed May 19, 2026

View reviewed changes

varunraokadaparthi force-pushed the retry-mechanism branch from 5d3d505 to f4b0b75 Compare May 21, 2026 22:34

varunraokadaparthi force-pushed the retry-mechanism branch from f4b0b75 to 7dbc4df Compare May 21, 2026 22:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SDCICD-1830] Add retry mechanism with fallback model for LLM analysis engines#3231

[SDCICD-1830] Add retry mechanism with fallback model for LLM analysis engines#3231
varunraokadaparthi wants to merge 4 commits into
openshift:mainfrom
varunraokadaparthi:retry-mechanism

varunraokadaparthi commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-merge-bot Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Review skipped

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented May 18, 2026

Uh oh!

varunraokadaparthi commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ritmun commented May 19, 2026

Uh oh!

YiqinZhang May 19, 2026

Uh oh!

openshift-ci Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

varunraokadaparthi commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

openshift-merge-bot Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented May 18, 2026

Uh oh!

varunraokadaparthi commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ritmun commented May 19, 2026

Uh oh!

YiqinZhang May 19, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

varunraokadaparthi commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading