Skip to content

Conversation

@b-wu26
Copy link

@b-wu26 b-wu26 commented Dec 11, 2025

While using cortex, my team recently ran into an issue with the alert managers where a corrupt config was synced and extended across the entire alert manager ring. This led to our entire fleet being stuck in a crash loop back off and prompted a need to limit the blast radius of potential issues to the replication factor

What this PR does:

  • creates a feature flag which disables the replica set extension capabilities called: DisableReplicaSetExtension

Which issue(s) this PR fixes:
#7200

Checklist

  • [x ] Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@b-wu26 b-wu26 changed the title Fixing the shuffle sharding algorithm and doing api improments Fixing the shuffle sharding algorithm and doing api improvements Dec 23, 2025
randN := rand.Intn(len(replicationSet.Instances))
instances = replicationSet.Instances[randN : randN+1]
// For POST requests, add retry logic to PutSilence
if d.isUnaryWritePath(r.URL.Path) {
Copy link
Contributor

@rajagopalanand rajagopalanand Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be just a || in the if condition? for example if req.GetMethod() == "GET" || req.GetMethod() == "DELETE" || d.isUnaryWritePath(r.URL.Path)?

receivers = append(receivers, parsed...)
}

merged := mergeV2Receivers(receivers)
Copy link
Contributor

@rajagopalanand rajagopalanand Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be done as part of the for loop? I mean can we maintain a map and dedup as we go

@b-wu26 b-wu26 force-pushed the fix-shuffle-shard branch 2 times, most recently from e8e8590 to 2e1bdf7 Compare January 8, 2026 18:13
)

// RingOp is the operation used for reading/writing to the alertmanagers.
// Original ring operations (with extension enabled for backward compatibility)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't call it Original. We are not deprecating this one and DisableReplicaSetExtension is false by default.

// Original ring operations (with extension enabled for backward compatibility)
var RingOp = ring.NewOp([]ring.InstanceState{ring.ACTIVE}, func(s ring.InstanceState) bool {
// Only ACTIVE Alertmanager get requests. If instance is not ACTIVE, we need to find another Alertmanager.
// Original behavior: extend replica set if instance is not ACTIVE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove all original behavior from this PR

@b-wu26 b-wu26 force-pushed the fix-shuffle-shard branch 2 times, most recently from 38b63a6 to 74784e6 Compare January 8, 2026 19:57
@b-wu26 b-wu26 changed the title Fixing the shuffle sharding algorithm and doing api improvements Creating feature flag for disabling replica set extension Jan 8, 2026
@b-wu26 b-wu26 force-pushed the fix-shuffle-shard branch 2 times, most recently from 146915c to 3b363bf Compare January 8, 2026 20:07
@b-wu26 b-wu26 force-pushed the fix-shuffle-shard branch from 3b363bf to 78ced0f Compare January 8, 2026 21:37
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants