HDDS-13590. Refactor HealthyPipelineSafeModeRule to not use PipelineReportFromDatanode #9651

priyeshkaratha · 2026-01-20T07:17:04Z

What changes were proposed in this pull request?

Refactored HealthyPipelineSafeModeRule to make it simpler and more reliable.
Earlier, it depended on events and internal state to track pipelines. Now, it directly checks the current state from the PipelineManager whenever validation is needed.

HealthyPipelineSafeModeRule.java

Removed event-based tracking: Eliminated processedPipelineIDs and unProcessedPipelineSet HashSets that tracked pipeline events incrementally
Direct validation approach: Modified validate() method to directly query PipelineManager for current open pipelines instead of relying on accumulated event state
New health check logic: Added isPipelineHealthy() helper method that validates:
Pipeline has exactly 3 nodes
All nodes are in healthy state (NodeStatus.inServiceHealthy())
All nodes are registered with the node manager
Simplified event processing: The process() method now only logs events instead of maintaining state, as validation happens directly during validate() calls
Cleaner lifecycle management: Updated cleanup() and initializeRule() methods to remove tracking state management

SafeModeMetrics.java

Changed currentHealthyPipelinesCount from counter (MutableCounterLong) to gauge (MutableGaugeLong)
Replaced incCurrentHealthyPipelinesCount() with setNumCurrentHealthyPipelines(long)
This aligns with the new approach where the value is set directly based on current state rather than incremented on events

Test Updates

TestHealthyPipelineSafeModeRule.java: Removed assertions that relied on event processing sequence; updated expected log messages
TestSCMSafeModeManager.java: Significant updates to handle immediate availability of healthy pipeline counts and added logic to handle edge cases where pipeline thresholds exceed available pipelines

What is the link to the Apache JIRA

HDDS-13590

How was this patch tested?

Tested using existing testcases written for safemode validation rules and Safemode Manager.
Validated using forked CI.

sumitagrawl

@priyeshkaratha Have few comments, logs will increase too much with change and rule once satisfied may not be satisfied next time, which is change in behavior. Check the impact.

sumitagrawl · 2026-01-28T18:10:43Z

...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/HealthyPipelineSafeModeRule.java

+
+    LOG.info("Found {} open RATIS/THREE pipelines", openPipelines.size());
+    currentHealthyPipelineCount = (int) openPipelines.stream()
+        .filter(this::isPipelineHealthy)


Earlier, for every pipeline event, pipeline info is logged once. But now it will be logged for all pipeline for every event, can result in a lot of INFO logs for pipeline only.
Also will be triggered for EC pipeline type to be logged which is not logged earlier.

sumitagrawl · 2026-01-28T18:23:56Z

...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/HealthyPipelineSafeModeRule.java

+    // Verify pipeline has all 3 nodes
+    List<DatanodeDetails> nodes = pipeline.getNodes();
+    if (nodes.size() != 3) {
+      LOG.info("Pipeline {} is not healthy: has {} nodes instead of 3",


this logging will happen now 3 times for every pipeline event, 2 times in validate() as called pre-process() and post-process(), and one time in getStatus(). And this will be done for every pipeline in the SCM for each event.

sumitagrawl · 2026-01-28T18:32:26Z

...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/HealthyPipelineSafeModeRule.java

          "finalization. Bypassing healthy pipeline safemode rule.");
      return true;
+    }
+    // Query PipelineManager directly for healthy pipeline count


In current implementation, if Rule is satisfied once, its marked exit. Never enter again. But with having dynamic rule check, one time may be success and other time may be failure.
This may not satisfy the purpose.

HDDS-13590. Refactoring HealthyPipelineSafeModeRule

efb72e1

priyeshkaratha marked this pull request as ready for review January 20, 2026 10:17

ChenSammi self-requested a review January 21, 2026 03:35

sumitagrawl reviewed Jan 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-13590. Refactor HealthyPipelineSafeModeRule to not use PipelineReportFromDatanode #9651

HDDS-13590. Refactor HealthyPipelineSafeModeRule to not use PipelineReportFromDatanode #9651

priyeshkaratha commented Jan 20, 2026 •

edited

Loading

Uh oh!

sumitagrawl left a comment

Uh oh!

sumitagrawl Jan 28, 2026

Uh oh!

sumitagrawl Jan 28, 2026

Uh oh!

sumitagrawl Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HDDS-13590. Refactor HealthyPipelineSafeModeRule to not use PipelineReportFromDatanode #9651

Are you sure you want to change the base?

HDDS-13590. Refactor HealthyPipelineSafeModeRule to not use PipelineReportFromDatanode #9651

Conversation

priyeshkaratha commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

sumitagrawl left a comment

Choose a reason for hiding this comment

Uh oh!

sumitagrawl Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

sumitagrawl Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

sumitagrawl Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

priyeshkaratha commented Jan 20, 2026 •

edited

Loading