Skip to content

[SDCICD-1830] Add retry mechanism with fallback model for LLM analysis engines#3231

Open
varunraokadaparthi wants to merge 4 commits into
openshift:mainfrom
varunraokadaparthi:retry-mechanism
Open

[SDCICD-1830] Add retry mechanism with fallback model for LLM analysis engines#3231
varunraokadaparthi wants to merge 4 commits into
openshift:mainfrom
varunraokadaparthi:retry-mechanism

Conversation

@varunraokadaparthi
Copy link
Copy Markdown
Contributor

@varunraokadaparthi varunraokadaparthi commented May 18, 2026

Summary

  • Add exponential backoff retry logic (1m → 2m → 4m, 3 retries) for Gemini API calls in both analysis engines (internal/analysisengine, pkg/krknai/analysisengine)
  • On primary model (gemini-3.1-pro-preview) exhaustion, automatically switches to fallback model (gemini-2.5-pro) with the same retry strategy
  • Retryable error classification uses a status code allowlist (429, 500, 502, 503) via genai.APIError inspection; non-retryable errors (4xx, context cancellation, tool failures) short-circuit immediately
  • Both clients are pre-initialized at engine construction to avoid allocation and auth overhead on the retry path
  • Sentinel errors replace string matching for SDK-originated error conditions

Test plan

  • 15 unit tests in internal/llm/retry_test.go covering: success paths, retryable/non-retryable status codes, primary exhaustion with fallback, both models exhausted, context cancellation, network timeouts, wrapped errors
  • Existing pkg/krknai/analysisengine tests pass unchanged
  • make build succeeds

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features
    • Analysis operations now automatically retry on transient failures with exponential backoff.
    • Fallback model added to improve reliability when the primary model encounters issues.
    • Enhanced handling of network timeouts, API errors, and other transient conditions.

…s engines

Add exponential backoff retry logic (1m -> 2m -> 4m, 3 retries) for
Gemini API calls in both analysis engines. On primary model exhaustion,
automatically switches to gemini-2.5-pro as fallback with the same
retry strategy. Non-retryable errors (4xx, context cancellation, tool
failures) short-circuit immediately without retry or fallback.

Retryable error classification uses a status code allowlist (429, 500,
502, 503) via genai.APIError inspection.

Both fallback and primary clients are pre-initialized at engine
construction to avoid repeated allocation and auth overhead on the
retry path.

Co-Authored-By: Claude Code <noreply@anthropic.com>
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

There are test jobs defined for this repository which are not configured to run automatically. Comment /test ? to see a list of all defined jobs. Review these jobs and use /test <job> to manually trigger jobs most likely to be impacted by the proposed changes.Comment /pipeline required to trigger all required & necessary jobs.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: bd7c3136-74ab-4f69-9a97-d771fff846d8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

The PR implements a resilient LLM analysis pipeline with a primary/fallback model strategy. It introduces sentinel errors to classify Gemini failures, adds exponential-backoff retry orchestration that promotes to a fallback model on exhaustion, integrates both patterns into internal and public engines, and validates behavior across success paths, transient failures, API errors, and edge cases.

Changes

LLM Retry and Fallback Pattern

Layer / File(s) Summary
Sentinel Errors and Fallback Model
internal/llm/errors.go, internal/llm/gemini.go
Defines exported sentinel error variables (ErrNoResponseCandidates, ErrNoContentInResponse, ErrToolCallFailed, ErrMaxIterations) and adds FallbackModel constant for a second Gemini model variant.
Gemini Error Classification
internal/llm/gemini.go
Updates error handling in response parsing and tool execution to emit sentinel errors instead of formatted strings; wraps tool failures with ErrToolCallFailed and returns ErrMaxIterations on max-iteration termination.
Retry Orchestration with Primary/Fallback
internal/llm/retry.go
Implements AnalyzeWithRetry to run primary analysis with exponential-backoff retries; on exhaustion, switches to fallback function with identical retry logic; returns aggregated error if both fail. Includes isRetryable predicate for API codes, context cancellation, network timeouts, and sentinel errors.
Engine Integration
internal/analysisengine/engine.go, pkg/krknai/analysisengine/engine.go
Both engines add fallbackLLMClient field and initialize a second Gemini client with FallbackModel in New; Run replaces direct primary-client calls with llm.AnalyzeWithRetry, passing context logger and closures for primary and fallback analysis.
Comprehensive Retry Tests
internal/llm/retry_test.go
Validates primary-only success, transient failure recovery, primary exhaustion with fallback promotion, dual exhaustion with aggregated errors, non-retryable error handling, HTTP status code retryability, context cancellation, network timeouts, sentinel error classification, and error wrapping preservation through multiple layers.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.52% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the primary changes: adding retry mechanism with fallback model support to LLM analysis engines.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed This PR uses Go's standard testing package, not Ginkgo. All test names are stable identifiers with no dynamic content, timestamps, UUIDs, or pod/node/namespace names that change between runs.
Test Structure And Quality ✅ Passed Custom check targets Ginkgo tests (It/Describe/BeforeEach blocks). The PR adds standard Go tests using testing.T. No Ginkgo tests present, so check is not applicable.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. All test additions are standard Go unit tests in internal/llm/retry_test.go using testing.T, not e2e tests. The check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR does not introduce new Ginkgo e2e tests. All changes are internal LLM logic and unit tests using Go's standard testing framework, which are not subject to SNO compatibility requirements.
Topology-Aware Scheduling Compatibility ✅ Passed Check not applicable. PR contains only Go library code—LLM retry logic and error handling—with no Kubernetes deployment manifests, operator code, or scheduling constraints.
Ote Binary Stdout Contract ✅ Passed PR modifies only library packages. No main packages, init/TestMain functions, or direct stdout writes. Uses logr.Logger for logging. OTE contract satisfied.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests are added. PR adds only standard Go unit tests to internal/llm/retry_test.go using testing.T. The custom check for IPv6/disconnected network compatibility is not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 18, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: varunraokadaparthi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2026
@varunraokadaparthi
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/llm/gemini.go (1)

109-117: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Return ErrMaxIterations instead of success on the final tool-call iteration.

Line 109-114 returns nil error on the last iteration, so the max-iteration failure path at Line 117 is bypassed.

Suggested fix
 func (g *GeminiClient) handleConversationWithTools(ctx context.Context, contents []*genai.Content, genConfig *genai.GenerateContentConfig, toolRegistry *tools.Registry) (*AnalysisResult, error) {
 	const maxIterations = 5
 	var toolCalls []*genai.FunctionCall
+	var lastTextContent string
 
 	for i := range maxIterations {
 		resp, err := g.client.Models.GenerateContent(ctx, g.model, contents, genConfig)
 		if err != nil {
 			return nil, fmt.Errorf("gemini API error: %w", err)
@@
 		textContent, functionCalls := g.processCandidateParts(candidate)
+		lastTextContent = textContent
@@
-		// Return partial result if we hit max iterations
-		if i == maxIterations-1 {
-			return &AnalysisResult{
-				Content:   textContent,
-				ToolCalls: toolCalls,
-			}, nil
-		}
+		_ = i
 	}
 
-	return &AnalysisResult{ToolCalls: toolCalls}, ErrMaxIterations
+	return &AnalysisResult{
+		Content:   lastTextContent,
+		ToolCalls: toolCalls,
+	}, ErrMaxIterations
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/llm/gemini.go` around lines 109 - 117, The final iteration branch
currently returns (&AnalysisResult{Content: textContent, ToolCalls: toolCalls},
nil) which bypasses the ErrMaxIterations path; update the branch in the loop
that checks if i == maxIterations-1 to return the AnalysisResult along with
ErrMaxIterations instead of nil so the caller sees the max-iteration failure
(referencing AnalysisResult, ErrMaxIterations, i, maxIterations, textContent,
and toolCalls).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/llm/retry_test.go`:
- Line 15: The file internal/llm/retry_test.go is not gofumpt-formatted (issue
at line 15); run the gofumpt formatter on that file to rewrite it to gofumpt
style, stage the resulting changes, and re-run the linter so the test file
(retry_test.go) conforms to the project's gofumpt formatting rules.
- Around line 317-320: The test's fallbackFn currently returns an invalid (nil,
nil) tuple after calling t.Fatal; change fallbackFn in
internal/llm/retry_test.go so it returns a non-nil error instead of nil for the
error value (e.g., return nil, errors.New("fallback called unexpectedly") or
fmt.Errorf(...)) to avoid nilnil; ensure you add the required import (errors or
fmt) if not already present.

In `@internal/llm/retry.go`:
- Around line 37-52: Before switching to the fallback, save the original primary
error (the err returned from retryLoop with primaryFn) into a distinct variable
(e.g., primaryErr) so it isn't lost; then when the fallback retryLoop (with
fallbackFn) also fails, return a combined error that preserves both failures
(use errors.Join(primaryErr, err) or fmt.Errorf with both messages) and include
both in the logger.Error context so diagnostics show both primary and fallback
failures.
- Around line 101-104: The errors.As check uses a value-typed target
(genai.APIError) which won't match the SDK's pointer error type; change the
target to a pointer (e.g., declare apiErr as *genai.APIError) and call
errors.As(err, &apiErr) so the match succeeds and retryableStatusCodes is
consulted for apiErr.Code; update the logic around the variable name `apiErr` in
retry.go to use the pointer type and keep the existing return of
retryableStatusCodes[apiErr.Code].

---

Outside diff comments:
In `@internal/llm/gemini.go`:
- Around line 109-117: The final iteration branch currently returns
(&AnalysisResult{Content: textContent, ToolCalls: toolCalls}, nil) which
bypasses the ErrMaxIterations path; update the branch in the loop that checks if
i == maxIterations-1 to return the AnalysisResult along with ErrMaxIterations
instead of nil so the caller sees the max-iteration failure (referencing
AnalysisResult, ErrMaxIterations, i, maxIterations, textContent, and toolCalls).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 2bc7f5b6-0d3e-4dd5-86d0-245f9f6591a7

📥 Commits

Reviewing files that changed from the base of the PR and between 8c5716a and 9b2d95d.

📒 Files selected for processing (6)
  • internal/analysisengine/engine.go
  • internal/llm/errors.go
  • internal/llm/gemini.go
  • internal/llm/retry.go
  • internal/llm/retry_test.go
  • pkg/krknai/analysisengine/engine.go

Comment thread internal/llm/retry_test.go Outdated
Comment thread internal/llm/retry_test.go
Comment thread internal/llm/retry.go Outdated
Comment thread internal/llm/retry.go
varunraokadaparthi and others added 2 commits May 18, 2026 17:00
…xhaustion

Rename err to primaryErr/fallbackErr in AnalyzeWithRetry for clarity.
Use errors.Join to preserve both primary and fallback errors when both
models are exhausted, so callers and logs see the full failure picture.

Co-Authored-By: Claude Code <noreply@anthropic.com>
Co-Authored-By: Claude Code <noreply@anthropic.com>
@ritmun
Copy link
Copy Markdown
Contributor

ritmun commented May 19, 2026

Could wait.PollUntil functions be used anywhere?

Comment thread internal/llm/gemini.go
return &AnalysisResult{
Content: textContent,
ToolCalls: toolCalls,
}, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}, ErrMaxIterations

…, integration tests

- 2 retries on primary model, 1 attempt on fallback, graceful error message
- Flat 30s retry delay (no exponential backoff)
- Integration tests with mock Gemini server for all error code scenarios
- Store failed analysis result so LLM errors appear in Slack notifications

Co-Authored-By: Claude Code <noreply@anthropic.com>
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

@varunraokadaparthi: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants