Skip to content

Conversation

@priyeshkaratha
Copy link
Contributor

@priyeshkaratha priyeshkaratha commented Jan 20, 2026

What changes were proposed in this pull request?

Refactored HealthyPipelineSafeModeRule to make it simpler and more reliable.
Earlier, it depended on events and internal state to track pipelines. Now, it directly checks the current state from the PipelineManager whenever validation is needed.

HealthyPipelineSafeModeRule.java

  • Removed event-based tracking: Eliminated processedPipelineIDs and unProcessedPipelineSet HashSets that tracked pipeline events incrementally
  • Direct validation approach: Modified validate() method to directly query PipelineManager for current open pipelines instead of relying on accumulated event state
  • New health check logic: Added isPipelineHealthy() helper method that validates:
    Pipeline has exactly 3 nodes
    All nodes are in healthy state (NodeStatus.inServiceHealthy())
    All nodes are registered with the node manager
  • Simplified event processing: The process() method now only logs events instead of maintaining state, as validation happens directly during validate() calls
  • Cleaner lifecycle management: Updated cleanup() and initializeRule() methods to remove tracking state management

SafeModeMetrics.java

  • Changed currentHealthyPipelinesCount from counter (MutableCounterLong) to gauge (MutableGaugeLong)
    Replaced incCurrentHealthyPipelinesCount() with setNumCurrentHealthyPipelines(long)
  • This aligns with the new approach where the value is set directly based on current state rather than incremented on events

Test Updates

  • TestHealthyPipelineSafeModeRule.java: Removed assertions that relied on event processing sequence; updated expected log messages
  • TestSCMSafeModeManager.java: Significant updates to handle immediate availability of healthy pipeline counts and added logic to handle edge cases where pipeline thresholds exceed available pipelines

What is the link to the Apache JIRA

HDDS-13590

How was this patch tested?

Tested using existing testcases written for safemode validation rules and Safemode Manager.
Validated using forked CI.

@priyeshkaratha priyeshkaratha marked this pull request as ready for review January 20, 2026 10:17
@ChenSammi ChenSammi self-requested a review January 21, 2026 03:35
Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@priyeshkaratha Have few comments, logs will increase too much with change and rule once satisfied may not be satisfied next time, which is change in behavior. Check the impact.


LOG.info("Found {} open RATIS/THREE pipelines", openPipelines.size());
currentHealthyPipelineCount = (int) openPipelines.stream()
.filter(this::isPipelineHealthy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, for every pipeline event, pipeline info is logged once. But now it will be logged for all pipeline for every event, can result in a lot of INFO logs for pipeline only.
Also will be triggered for EC pipeline type to be logged which is not logged earlier.

// Verify pipeline has all 3 nodes
List<DatanodeDetails> nodes = pipeline.getNodes();
if (nodes.size() != 3) {
LOG.info("Pipeline {} is not healthy: has {} nodes instead of 3",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logging will happen now 3 times for every pipeline event, 2 times in validate() as called pre-process() and post-process(), and one time in getStatus(). And this will be done for every pipeline in the SCM for each event.

"finalization. Bypassing healthy pipeline safemode rule.");
return true;
}
// Query PipelineManager directly for healthy pipeline count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In current implementation, if Rule is satisfied once, its marked exit. Never enter again. But with having dynamic rule check, one time may be success and other time may be failure.
This may not satisfy the purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants