Improve CSI recover loop observability with consistent logging by mrhapile · Pull Request #5664 · fluid-cloudnative/fluid

mrhapile · 2026-02-04T10:56:17Z

This PR improves observability of the CSI recover loop by standardizing log
levels and adding structured context to recovery-related logs, without
changing recovery behavior.

Problem

The CSI recovery logic emitted logs with inconsistent severity levels and
limited contextual information, making it difficult for operators to
understand recovery state transitions and diagnose issues in production.

What this PR does

Aligns log levels with Kubernetes conventions:
- Info: normal flow and recovery state transitions
- Warning: recoverable mount and unmount failures
- Error: unexpected or API-level failures
Adds structured context to logs (mountPath, source path, mount count,
thresholds) for easier debugging
Logs recovery start, skip, cleanup, and success paths consistently
Keeps changes strictly log-only (no behavior or retry logic changes)

Design notes

No public API changes
No behavior change on successful or failed recoveries
Suitable for long-running CSI daemonsets
Improves operator experience without increasing log noise

Tests

Existing CSI recovery tests continue to pass
Some mount-dependent tests are skipped on non-Linux platforms due to
k8s.io/utils/mount limitations (pre-existing behavior)

Verification

go test ./pkg/csi/recover/... -v

Signed-off-by: mrhapile <allinonegaming3456@gmail.com>

fluid-e2e-bot · 2026-02-04T10:56:25Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ronggu for approval by writing /assign @ronggu in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist · 2026-02-04T10:56:32Z

Summary of Changes

Hello @mrhapile, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the diagnostic capabilities of the CSI recovery mechanism. By standardizing log levels and enriching log messages with structured contextual information, it provides operators with a clearer, more consistent view into the recovery process. This enhancement aims to simplify troubleshooting and monitoring of CSI mount operations without introducing any changes to the core recovery logic.

Highlights

Standardized Log Levels: Log levels for the CSI recovery loop have been aligned with Kubernetes conventions: 'Info' for normal flow and state transitions, 'Warning' for recoverable mount/unmount failures, and 'Error' for unexpected or API-level failures.
Enhanced Log Context: Structured context, such as mountPath, source path, mount count, and thresholds, has been added to recovery-related logs to facilitate easier debugging and understanding of recovery states.
Consistent Recovery Path Logging: Logging for recovery start, skip, cleanup, and success paths is now consistent, providing a clearer picture of the recovery process.
No Behavioral Changes: The changes are strictly log-only, ensuring that there are no alterations to the recovery behavior or retry logic.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

pkg/csi/recover/recover.go
- Updated error logging for GetBrokenMountPoints to provide more specific details.
- Added an informational log when broken mount points are detected, indicating a state transition.
- Changed mount failure logs from Errorf to Warningf and included structured context like mountPath and source.
- Changed unmount duplicate failure logs from Errorf to Warningf and included structured context like mountPath.
- Updated error logging for API calls (GetNamespacedNameByVolumeId, GetDataset) to be more specific about the failure and include relevant identifiers.
- Modified log messages for skipped recoveries (due to lock contention or already cleaned up mounts) to include structured context and appropriate log levels.
- Added an informational log at the start of a recovery attempt, detailing mountPath, source, and mountCount.
- Updated the warning log for excessive mount counts to include structured context like mountPath, count, and threshold.
- Added an informational log upon successful recovery, indicating a state transition from broken to healthy.

Activity

This pull request addresses and fixes issue Standardize CSI recover loop logging for clearer observability #5663.
The author has provided a detailed problem statement, outlining the difficulties operators faced with inconsistent logging.
The PR description clearly defines what changes are made, focusing on log standardization and context enrichment.
Design notes confirm no public API or behavior changes, making it suitable for long-running CSI daemonsets.
Verification steps include running go test ./pkg/csi/recover/... -v to confirm functionality.
Existing CSI recovery tests continue to pass, ensuring no regressions.
Some mount-dependent tests are noted to be skipped on non-Linux platforms due to pre-existing k8s.io/utils/mount limitations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

fluid-e2e-bot · 2026-02-04T10:56:36Z

Hi @mrhapile. Thanks for your PR.

I'm waiting for a fluid-cloudnative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sonarqubecloud · 2026-02-04T10:57:11Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

gemini-code-assist

Code Review

This pull request does a great job of improving the observability of the CSI recover loop by standardizing log levels and adding structured context. The changes are clear and align well with the stated goals.

I've found one issue related to error handling in recoverBrokenMount where an error from r.Mount is not propagated. This prevents FuseRecoverFailed events from being recorded, which is contrary to the goal of improving observability. Fixing this would be a behavior change, but it's a bug fix that I believe is important for the correctness of the recovery logic's reporting. Please see my detailed comment.

gemini-code-assist · 2026-02-04T10:58:13Z

pkg/csi/recover/recover.go

-	glog.V(3).Infof("FuseRecovery: Start exec cmd: mount %s %s -o %v \n", point.SourcePath, point.MountPath, mountOption)
+	// Info: Attempting recovery action
+	glog.V(3).Infof("FuseRecovery: attempting bind mount, source=%s mountPath=%s options=%v", point.SourcePath, point.MountPath, mountOption)
 	if err := r.Mount(point.SourcePath, point.MountPath, "none", mountOption); err != nil {


The use of := here shadows the named return variable err. This causes the function to always return nil, even if r.Mount fails. Consequently, the caller doRecover never receives an error, and the FuseRecoverFailed event is never recorded, which undermines the goal of improving observability. To fix this, you should use = to assign to the existing err variable so the error can be propagated.

Suggested change

if err := r.Mount(point.SourcePath, point.MountPath, "none", mountOption); err != nil {

if err = r.Mount(point.SourcePath, point.MountPath, "none", mountOption); err != nil {

Copilot

Pull request overview

This PR refines logging in the CSI Fuse recovery loop to improve observability of recovery state and mount/unmount behavior without intentionally changing control flow. It brings log levels in line with Kubernetes-style conventions and adds structured context (paths, counts, thresholds) around recovery decisions.

Changes:

Standardizes log severity for recovery-related operations (info for normal transitions, warnings for recoverable mount/unmount issues, errors for unexpected or API failures).
Adds structured, contextual logging around mount detection, recovery start/skip, duplicate unmount cleanup, and successful recoveries.
Refines error logs in eventRecord to carry more precise context (dataset name/namespace, volume ID) for debugging API-level failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-04T11:05:29Z

pkg/csi/recover/recover.go

-	glog.V(3).Infof("FuseRecovery: Start exec cmd: mount %s %s -o %v \n", point.SourcePath, point.MountPath, mountOption)
+	// Info: Attempting recovery action
+	glog.V(3).Infof("FuseRecovery: attempting bind mount, source=%s mountPath=%s options=%v", point.SourcePath, point.MountPath, mountOption)
 	if err := r.Mount(point.SourcePath, point.MountPath, "none", mountOption); err != nil {


recoverBrokenMount declares a named return value err but never assigns to it; the short variable declaration in the if err := r.Mount(...); err != nil block shadows the named return and the function always returns nil. As a result, callers cannot distinguish success from failure based on this return value, which is misleading given the signature; consider either returning the actual mount error (e.g., by assigning to the named err instead of shadowing) or changing the function signature to not return an error at all.

Suggested change

if err := r.Mount(point.SourcePath, point.MountPath, "none", mountOption); err != nil {

err = r.Mount(point.SourcePath, point.MountPath, "none", mountOption)

if err != nil {

Copilot · 2026-02-04T11:05:29Z

pkg/csi/recover/recover.go

 	if err := r.recoverBrokenMount(point); err != nil {
+		// Warning logged inside recoverBrokenMount, just record event
 		r.eventRecord(point, corev1.EventTypeWarning, common.FuseRecoverFailed)
 		return
 	}


The if err := r.recoverBrokenMount(point); err != nil branch is effectively dead code because recoverBrokenMount never returns a non-nil error (its named return value is never set). This means the FuseRecoverFailed event is never emitted even when the mount operation fails, which weakens observability and contradicts the intent of this logging-focused change; once recoverBrokenMount is updated to return real errors, this branch should be exercised by tests to ensure failure events are recorded.

Suggested change

if err := r.recoverBrokenMount(point); err != nil {

// Warning logged inside recoverBrokenMount, just record event

r.eventRecord(point, corev1.EventTypeWarning, common.FuseRecoverFailed)

return

}

r.recoverBrokenMount(point)

csi recover loop

461ab61

Signed-off-by: mrhapile <allinonegaming3456@gmail.com>

Copilot AI review requested due to automatic review settings February 4, 2026 10:56

fluid-e2e-bot bot added the needs-ok-to-test label Feb 4, 2026

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

Copilot started reviewing on behalf of mrhapile February 4, 2026 10:59 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CSI recover loop observability with consistent logging#5664

Improve CSI recover loop observability with consistent logging#5664
mrhapile wants to merge 1 commit intofluid-cloudnative:masterfrom
mrhapile:csi-recover-loop

mrhapile commented Feb 4, 2026

Uh oh!

fluid-e2e-bot bot commented Feb 4, 2026

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Uh oh!

fluid-e2e-bot bot commented Feb 4, 2026

Uh oh!

sonarqubecloud bot commented Feb 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	if err := r.Mount(point.SourcePath, point.MountPath, "none", mountOption); err != nil {
	if err = r.Mount(point.SourcePath, point.MountPath, "none", mountOption); err != nil {

	if err := r.Mount(point.SourcePath, point.MountPath, "none", mountOption); err != nil {
	err = r.Mount(point.SourcePath, point.MountPath, "none", mountOption)
	if err != nil {

Conversation

mrhapile commented Feb 4, 2026

Problem

What this PR does

Design notes

Tests

Verification

Uh oh!

fluid-e2e-bot bot commented Feb 4, 2026

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

fluid-e2e-bot bot commented Feb 4, 2026

Uh oh!

sonarqubecloud bot commented Feb 4, 2026

Quality Gate passed

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant