Improve CSI recover loop observability with consistent logging#5664
Improve CSI recover loop observability with consistent logging#5664mrhapile wants to merge 1 commit intofluid-cloudnative:masterfrom
Conversation
Signed-off-by: mrhapile <allinonegaming3456@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello @mrhapile, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the diagnostic capabilities of the CSI recovery mechanism. By standardizing log levels and enriching log messages with structured contextual information, it provides operators with a clearer, more consistent view into the recovery process. This enhancement aims to simplify troubleshooting and monitoring of CSI mount operations without introducing any changes to the core recovery logic. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Hi @mrhapile. Thanks for your PR. I'm waiting for a fluid-cloudnative member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
There was a problem hiding this comment.
Code Review
This pull request does a great job of improving the observability of the CSI recover loop by standardizing log levels and adding structured context. The changes are clear and align well with the stated goals.
I've found one issue related to error handling in recoverBrokenMount where an error from r.Mount is not propagated. This prevents FuseRecoverFailed events from being recorded, which is contrary to the goal of improving observability. Fixing this would be a behavior change, but it's a bug fix that I believe is important for the correctness of the recovery logic's reporting. Please see my detailed comment.
| glog.V(3).Infof("FuseRecovery: Start exec cmd: mount %s %s -o %v \n", point.SourcePath, point.MountPath, mountOption) | ||
| // Info: Attempting recovery action | ||
| glog.V(3).Infof("FuseRecovery: attempting bind mount, source=%s mountPath=%s options=%v", point.SourcePath, point.MountPath, mountOption) | ||
| if err := r.Mount(point.SourcePath, point.MountPath, "none", mountOption); err != nil { |
There was a problem hiding this comment.
The use of := here shadows the named return variable err. This causes the function to always return nil, even if r.Mount fails. Consequently, the caller doRecover never receives an error, and the FuseRecoverFailed event is never recorded, which undermines the goal of improving observability. To fix this, you should use = to assign to the existing err variable so the error can be propagated.
| if err := r.Mount(point.SourcePath, point.MountPath, "none", mountOption); err != nil { | |
| if err = r.Mount(point.SourcePath, point.MountPath, "none", mountOption); err != nil { |
There was a problem hiding this comment.
Pull request overview
This PR refines logging in the CSI Fuse recovery loop to improve observability of recovery state and mount/unmount behavior without intentionally changing control flow. It brings log levels in line with Kubernetes-style conventions and adds structured context (paths, counts, thresholds) around recovery decisions.
Changes:
- Standardizes log severity for recovery-related operations (info for normal transitions, warnings for recoverable mount/unmount issues, errors for unexpected or API failures).
- Adds structured, contextual logging around mount detection, recovery start/skip, duplicate unmount cleanup, and successful recoveries.
- Refines error logs in
eventRecordto carry more precise context (dataset name/namespace, volume ID) for debugging API-level failures.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| glog.V(3).Infof("FuseRecovery: Start exec cmd: mount %s %s -o %v \n", point.SourcePath, point.MountPath, mountOption) | ||
| // Info: Attempting recovery action | ||
| glog.V(3).Infof("FuseRecovery: attempting bind mount, source=%s mountPath=%s options=%v", point.SourcePath, point.MountPath, mountOption) | ||
| if err := r.Mount(point.SourcePath, point.MountPath, "none", mountOption); err != nil { |
There was a problem hiding this comment.
recoverBrokenMount declares a named return value err but never assigns to it; the short variable declaration in the if err := r.Mount(...); err != nil block shadows the named return and the function always returns nil. As a result, callers cannot distinguish success from failure based on this return value, which is misleading given the signature; consider either returning the actual mount error (e.g., by assigning to the named err instead of shadowing) or changing the function signature to not return an error at all.
| if err := r.Mount(point.SourcePath, point.MountPath, "none", mountOption); err != nil { | |
| err = r.Mount(point.SourcePath, point.MountPath, "none", mountOption) | |
| if err != nil { |
| if err := r.recoverBrokenMount(point); err != nil { | ||
| // Warning logged inside recoverBrokenMount, just record event | ||
| r.eventRecord(point, corev1.EventTypeWarning, common.FuseRecoverFailed) | ||
| return | ||
| } |
There was a problem hiding this comment.
The if err := r.recoverBrokenMount(point); err != nil branch is effectively dead code because recoverBrokenMount never returns a non-nil error (its named return value is never set). This means the FuseRecoverFailed event is never emitted even when the mount operation fails, which weakens observability and contradicts the intent of this logging-focused change; once recoverBrokenMount is updated to return real errors, this branch should be exercised by tests to ensure failure events are recorded.
| if err := r.recoverBrokenMount(point); err != nil { | |
| // Warning logged inside recoverBrokenMount, just record event | |
| r.eventRecord(point, corev1.EventTypeWarning, common.FuseRecoverFailed) | |
| return | |
| } | |
| r.recoverBrokenMount(point) |



Fixes #5663
This PR improves observability of the CSI recover loop by standardizing log
levels and adding structured context to recovery-related logs, without
changing recovery behavior.
Problem
The CSI recovery logic emitted logs with inconsistent severity levels and
limited contextual information, making it difficult for operators to
understand recovery state transitions and diagnose issues in production.
What this PR does
thresholds) for easier debugging
Design notes
Tests
k8s.io/utils/mountlimitations (pre-existing behavior)Verification
go test ./pkg/csi/recover/... -v