Flaky test fixes #580

lucian-tosa · 2025-11-10T10:45:35Z

Summary

This PR aims to reduce the flakiness of the following tests:

e2e_multi_cluster_sharded_snippets

Increased the timeout of test_running, since in failing runs, by the time the diagnostics are collected, the resources become ready.

e2e_multi_cluster_appdb_upgrade_downgrade_v1_27_to_mck

Increased the timeout of test_scale_appdb. Similarly, the assertion on appdb status fails, but by the time diagnostics are collected, the resource becomes ready.

e2e_appdb_tls_operator_upgrade_v1_32_to_mck

In this test we have a race condition.

om-appdb-upgrade-tls   1          7.0.18    Running              Pending         Disabled         17m
om-appdb-upgrade-tls   1          7.0.18    Running              Running         Disabled         17m
om-appdb-upgrade-tls   1          7.0.18    Pending              Running         Disabled         17m
om-appdb-upgrade-tls   1          7.0.18    Pending              Running         Disabled         18m
om-appdb-upgrade-tls   1          7.0.18    Pending              Running         Disabled         18m
om-appdb-upgrade-tls   1          7.0.18    Running              Running         Disabled         19m

There is a moment during the operator upgrade where the resource has the status of AppDB and OM set to running. This happens very briefly before the operator starts reconciling OM and sets the OM status to Pending. In that moment, the test will very quickly pass both assertions and move on to assert healthiness by connecting to OM. This will fail since OM was not actually ready.

Reaching phase Running for resource AppDbStatus took 216.2561867237091s
Reaching phase Running for resource OmStatus took 0.0025169849395751953s

To fix this, I added a persist_for flag in our assertion methods. This makes sure that the phase we are currently asserting is reached and persists for a number of retries.

Proof of Work

Retried the above tests a few times, and all pass
https://spruce.mongodb.com/version/6911c25146ed0e00077796e3/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC

Checklist

Have you linked a jira ticket and/or is the ticket in the title?
Have you checked whether your jira ticket required DOCSP changes?
Have you added changelog file?
- use skip-changelog label if not needed
- refer to Changelog files and Release Notes section in CONTRIBUTING.md for more details

github-actions · 2025-11-10T10:46:42Z

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.6.1 Release Notes

Bug Fixes

Backed up the agent password in a secret for SCRAM authentication to prevent unnecessary password rotations.
MongoDB Adding missing ownerrefs to ensure proper resource deletion by kubernetes.
Single Cluster Deleting resources created by CRD now only happens on multi-cluster deployments. Single Cluster will solely rely on ownerrefs.
MongoDB, MongoDBOpsManager: Improve validation for featureCompatibilityVersion field in MongoDB and MongoDBOpsManager spec.
The field now enforces proper semantic versioning. Previously, invalid semver values could be accepted,
potentially resulting in incorrect configurations.
Roles configured via Ops Manager UI or API will no longer be removed by the operator

Other Changes

Future releases will include a new asset, release_info_.json, which provides detailed information about each MongoDB Controllers for Kubernetes release, including a clear list of all container images. This will help customers, especially those running in air-gapped environments, easily identify all required images for a given release.

m1kola

There is a moment during the operator upgrade where the resource has the status of AppDB and OM set to running. This happens very briefly before the operator starts reconciling OM and sets the OM status to Pending.

Great observation! I think we should address this root cause instead of hacking around and making tests pass.

It will likely take more time so I suggest to separate this work from other flake fixes (where we rightfully increase timeouts).

lucian-tosa · 2025-11-19T10:10:25Z

I think we should address this root cause instead of hacking around and making tests pass.

@m1kola Do you think it is worth investing the time to look into the code for setting different status values? Is it something critical for customers? Otherwise, I would opt for fixing the test, which is a cheap but important fix for our CI stability

lsierant · 2025-11-19T11:19:20Z

docker/mongodb-kubernetes-tests/kubetester/mongodb_common.py

 class MongoDBCommon:
    @TRACER.start_as_current_span("wait_for")
-    def wait_for(self, fn, timeout=None, should_raise=True):
+    def wait_for(self, fn, timeout=None, should_raise=True, persist_for=1):


is persist_for essentially equivalent to "required consecutive successes to pass"?

if yes, then pls leave a brief comment or rename it, it took me a bit thinking to get the meaning of this param

and one of the reason was that when looking for "persist for" suggest that the "problem persists" or the goal is not achieve yet and the situation persists. When something succeeds it's not often described as a situation that "persists". But I might be nitpicking here.

is persist_for essentially equivalent to "required consecutive successes to pass"?

Yes

Not the best name, I know, I couldn't find a better one. I can add a comment. But do you have a better option? no_of_successful_passes?

While I'm not a fan of Gomega and Ginkgo I think we should separate "wait for something" from "consistently passes meets something" like they do:

Wait for a condition - https://onsi.github.io/gomega/#eventually

Ensure consistent pass of a condition - https://onsi.github.io/gomega/#consistently (edited: corrected the link)

As we discussed, I created a new method assert_persist_phase, and opened a ticket to track the root issue
please take another look

Thanks ! I like that we have the two high level methods assert_reaches_phase and assert_persist_phase separated now.

I also like how we now have assert_reaches_phase and assert_persist_phase, but still find wait_for to be a bit confusing. It's difficult to get what it does just by looking at the signature.

I'm not going to block on this, but if you can think of a way so we can leave wait_for as it was before and introduce another low level function with a simple signature - that would be awesome.

m1kola · 2025-11-19T14:30:09Z

I think we should address this root cause instead of hacking around and making tests pass.

@m1kola Do you think it is worth investing the time to look into the code for setting different status values? Is it something critical for customers? Otherwise, I would opt for fixing the test, which is a cheap but important fix for our CI stability

We are discussing this in Slack at the moment.

Julien-Ben · 2025-12-11T09:58:58Z

I'm in flavor of merging this because we have too many flaky tests at the moment and this is annoying. I agree that the proper thing to do is to fix the root cause, but it's unsure when we will have time dedicated to that.

m1kola

Thanks for opening the ticket for the root cause as we discussed.

I would like to suggest to open separate PRs for separate fixes next time. For example, timeout increases could've been merged much sooner without having to wait for a bigger discussion.

m1kola · 2025-12-11T16:15:49Z

docker/mongodb-kubernetes-tests/kubetester/mongodb_common.py

 class MongoDBCommon:
    @TRACER.start_as_current_span("wait_for")
-    def wait_for(self, fn, timeout=None, should_raise=True):
+    def wait_for(self, fn, timeout=None, should_raise=True, persist_for=1):


I also like how we now have assert_reaches_phase and assert_persist_phase, but still find wait_for to be a bit confusing. It's difficult to get what it does just by looking at the signature.

I'm not going to block on this, but if you can think of a way so we can leave wait_for as it was before and introduce another low level function with a simple signature - that would be awesome.

lucian-tosa added 3 commits November 10, 2025 11:39

Add persist_for in phase assertions

2000bcf

Increase timeouts

b3201cf

Precommit

4c1dd1b

lucian-tosa changed the title ~~Lucian/flaky test fixes~~ Flaky test fixes Nov 10, 2025

lucian-tosa added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Nov 10, 2025

lucian-tosa marked this pull request as ready for review November 10, 2025 13:55

lucian-tosa requested a review from a team as a code owner November 10, 2025 13:55

lucian-tosa requested review from anandsyncs and m1kola November 10, 2025 13:55

m1kola requested changes Nov 10, 2025

View reviewed changes

lsierant reviewed Nov 19, 2025

View reviewed changes

lsierant approved these changes Nov 19, 2025

View reviewed changes

Split persisting in its own method

75e91ff

lucian-tosa requested a review from Julien-Ben December 11, 2025 09:33

Merge branch 'master' into lucian/flaky-test-fixes

25ce1e7

lucian-tosa requested a review from m1kola December 11, 2025 09:38

m1kola approved these changes Dec 11, 2025

View reviewed changes

anandsyncs approved these changes Dec 11, 2025

View reviewed changes

Flaky test fixes #580

Are you sure you want to change the base?

Flaky test fixes #580

Conversation

lucian-tosa commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

e2e_multi_cluster_sharded_snippets

e2e_multi_cluster_appdb_upgrade_downgrade_v1_27_to_mck

e2e_appdb_tls_operator_upgrade_v1_32_to_mck

Proof of Work

Checklist

Uh oh!

github-actions bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MCK 1.6.1 Release Notes

Bug Fixes

Other Changes

Uh oh!

m1kola left a comment

Choose a reason for hiding this comment

Uh oh!

lucian-tosa commented Nov 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m1kola Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m1kola commented Nov 19, 2025

Uh oh!

Julien-Ben commented Dec 11, 2025

Uh oh!

m1kola left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lucian-tosa commented Nov 10, 2025 •

edited

Loading

github-actions bot commented Nov 10, 2025 •

edited

Loading

m1kola Nov 19, 2025 •

edited

Loading