Skip to content

Improve CSI recover loop observability with consistent logging#5664

Open
mrhapile wants to merge 1 commit intofluid-cloudnative:masterfrom
mrhapile:csi-recover-loop
Open

Improve CSI recover loop observability with consistent logging#5664
mrhapile wants to merge 1 commit intofluid-cloudnative:masterfrom
mrhapile:csi-recover-loop

Conversation

@mrhapile
Copy link

@mrhapile mrhapile commented Feb 4, 2026

Fixes #5663

This PR improves observability of the CSI recover loop by standardizing log
levels and adding structured context to recovery-related logs, without
changing recovery behavior.

Problem

The CSI recovery logic emitted logs with inconsistent severity levels and
limited contextual information, making it difficult for operators to
understand recovery state transitions and diagnose issues in production.

What this PR does

  • Aligns log levels with Kubernetes conventions:
    • Info: normal flow and recovery state transitions
    • Warning: recoverable mount and unmount failures
    • Error: unexpected or API-level failures
  • Adds structured context to logs (mountPath, source path, mount count,
    thresholds) for easier debugging
  • Logs recovery start, skip, cleanup, and success paths consistently
  • Keeps changes strictly log-only (no behavior or retry logic changes)

Design notes

  • No public API changes
  • No behavior change on successful or failed recoveries
  • Suitable for long-running CSI daemonsets
  • Improves operator experience without increasing log noise

Tests

  • Existing CSI recovery tests continue to pass
  • Some mount-dependent tests are skipped on non-Linux platforms due to
    k8s.io/utils/mount limitations (pre-existing behavior)

Verification

go test ./pkg/csi/recover/... -v

Signed-off-by: mrhapile <allinonegaming3456@gmail.com>
Copilot AI review requested due to automatic review settings February 4, 2026 10:56
@fluid-e2e-bot
Copy link

fluid-e2e-bot bot commented Feb 4, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ronggu for approval by writing /assign @ronggu in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @mrhapile, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the diagnostic capabilities of the CSI recovery mechanism. By standardizing log levels and enriching log messages with structured contextual information, it provides operators with a clearer, more consistent view into the recovery process. This enhancement aims to simplify troubleshooting and monitoring of CSI mount operations without introducing any changes to the core recovery logic.

Highlights

  • Standardized Log Levels: Log levels for the CSI recovery loop have been aligned with Kubernetes conventions: 'Info' for normal flow and state transitions, 'Warning' for recoverable mount/unmount failures, and 'Error' for unexpected or API-level failures.
  • Enhanced Log Context: Structured context, such as mountPath, source path, mount count, and thresholds, has been added to recovery-related logs to facilitate easier debugging and understanding of recovery states.
  • Consistent Recovery Path Logging: Logging for recovery start, skip, cleanup, and success paths is now consistent, providing a clearer picture of the recovery process.
  • No Behavioral Changes: The changes are strictly log-only, ensuring that there are no alterations to the recovery behavior or retry logic.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pkg/csi/recover/recover.go
    • Updated error logging for GetBrokenMountPoints to provide more specific details.
    • Added an informational log when broken mount points are detected, indicating a state transition.
    • Changed mount failure logs from Errorf to Warningf and included structured context like mountPath and source.
    • Changed unmount duplicate failure logs from Errorf to Warningf and included structured context like mountPath.
    • Updated error logging for API calls (GetNamespacedNameByVolumeId, GetDataset) to be more specific about the failure and include relevant identifiers.
    • Modified log messages for skipped recoveries (due to lock contention or already cleaned up mounts) to include structured context and appropriate log levels.
    • Added an informational log at the start of a recovery attempt, detailing mountPath, source, and mountCount.
    • Updated the warning log for excessive mount counts to include structured context like mountPath, count, and threshold.
    • Added an informational log upon successful recovery, indicating a state transition from broken to healthy.
Activity
  • This pull request addresses and fixes issue Standardize CSI recover loop logging for clearer observability #5663.
  • The author has provided a detailed problem statement, outlining the difficulties operators faced with inconsistent logging.
  • The PR description clearly defines what changes are made, focusing on log standardization and context enrichment.
  • Design notes confirm no public API or behavior changes, making it suitable for long-running CSI daemonsets.
  • Verification steps include running go test ./pkg/csi/recover/... -v to confirm functionality.
  • Existing CSI recovery tests continue to pass, ensuring no regressions.
  • Some mount-dependent tests are noted to be skipped on non-Linux platforms due to pre-existing k8s.io/utils/mount limitations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@fluid-e2e-bot
Copy link

fluid-e2e-bot bot commented Feb 4, 2026

Hi @mrhapile. Thanks for your PR.

I'm waiting for a fluid-cloudnative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 4, 2026

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request does a great job of improving the observability of the CSI recover loop by standardizing log levels and adding structured context. The changes are clear and align well with the stated goals.

I've found one issue related to error handling in recoverBrokenMount where an error from r.Mount is not propagated. This prevents FuseRecoverFailed events from being recorded, which is contrary to the goal of improving observability. Fixing this would be a behavior change, but it's a bug fix that I believe is important for the correctness of the recovery logic's reporting. Please see my detailed comment.

glog.V(3).Infof("FuseRecovery: Start exec cmd: mount %s %s -o %v \n", point.SourcePath, point.MountPath, mountOption)
// Info: Attempting recovery action
glog.V(3).Infof("FuseRecovery: attempting bind mount, source=%s mountPath=%s options=%v", point.SourcePath, point.MountPath, mountOption)
if err := r.Mount(point.SourcePath, point.MountPath, "none", mountOption); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of := here shadows the named return variable err. This causes the function to always return nil, even if r.Mount fails. Consequently, the caller doRecover never receives an error, and the FuseRecoverFailed event is never recorded, which undermines the goal of improving observability. To fix this, you should use = to assign to the existing err variable so the error can be propagated.

Suggested change
if err := r.Mount(point.SourcePath, point.MountPath, "none", mountOption); err != nil {
if err = r.Mount(point.SourcePath, point.MountPath, "none", mountOption); err != nil {

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refines logging in the CSI Fuse recovery loop to improve observability of recovery state and mount/unmount behavior without intentionally changing control flow. It brings log levels in line with Kubernetes-style conventions and adds structured context (paths, counts, thresholds) around recovery decisions.

Changes:

  • Standardizes log severity for recovery-related operations (info for normal transitions, warnings for recoverable mount/unmount issues, errors for unexpected or API failures).
  • Adds structured, contextual logging around mount detection, recovery start/skip, duplicate unmount cleanup, and successful recoveries.
  • Refines error logs in eventRecord to carry more precise context (dataset name/namespace, volume ID) for debugging API-level failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

glog.V(3).Infof("FuseRecovery: Start exec cmd: mount %s %s -o %v \n", point.SourcePath, point.MountPath, mountOption)
// Info: Attempting recovery action
glog.V(3).Infof("FuseRecovery: attempting bind mount, source=%s mountPath=%s options=%v", point.SourcePath, point.MountPath, mountOption)
if err := r.Mount(point.SourcePath, point.MountPath, "none", mountOption); err != nil {
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recoverBrokenMount declares a named return value err but never assigns to it; the short variable declaration in the if err := r.Mount(...); err != nil block shadows the named return and the function always returns nil. As a result, callers cannot distinguish success from failure based on this return value, which is misleading given the signature; consider either returning the actual mount error (e.g., by assigning to the named err instead of shadowing) or changing the function signature to not return an error at all.

Suggested change
if err := r.Mount(point.SourcePath, point.MountPath, "none", mountOption); err != nil {
err = r.Mount(point.SourcePath, point.MountPath, "none", mountOption)
if err != nil {

Copilot uses AI. Check for mistakes.
Comment on lines 285 to 289
if err := r.recoverBrokenMount(point); err != nil {
// Warning logged inside recoverBrokenMount, just record event
r.eventRecord(point, corev1.EventTypeWarning, common.FuseRecoverFailed)
return
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if err := r.recoverBrokenMount(point); err != nil branch is effectively dead code because recoverBrokenMount never returns a non-nil error (its named return value is never set). This means the FuseRecoverFailed event is never emitted even when the mount operation fails, which weakens observability and contradicts the intent of this logging-focused change; once recoverBrokenMount is updated to return real errors, this branch should be exercised by tests to ensure failure events are recorded.

Suggested change
if err := r.recoverBrokenMount(point); err != nil {
// Warning logged inside recoverBrokenMount, just record event
r.eventRecord(point, corev1.EventTypeWarning, common.FuseRecoverFailed)
return
}
r.recoverBrokenMount(point)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Standardize CSI recover loop logging for clearer observability

1 participant