-
Notifications
You must be signed in to change notification settings - Fork 31
Flaky test fixes #580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Flaky test fixes #580
Conversation
MCK 1.6.1 Release NotesBug Fixes
Other Changes
|
m1kola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a moment during the operator upgrade where the resource has the status of AppDB and OM set to running. This happens very briefly before the operator starts reconciling OM and sets the OM status to Pending.
Great observation! I think we should address this root cause instead of hacking around and making tests pass.
It will likely take more time so I suggest to separate this work from other flake fixes (where we rightfully increase timeouts).
@m1kola Do you think it is worth investing the time to look into the code for setting different status values? Is it something critical for customers? Otherwise, I would opt for fixing the test, which is a cheap but important fix for our CI stability |
| class MongoDBCommon: | ||
| @TRACER.start_as_current_span("wait_for") | ||
| def wait_for(self, fn, timeout=None, should_raise=True): | ||
| def wait_for(self, fn, timeout=None, should_raise=True, persist_for=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is persist_for essentially equivalent to "required consecutive successes to pass"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if yes, then pls leave a brief comment or rename it, it took me a bit thinking to get the meaning of this param
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and one of the reason was that when looking for "persist for" suggest that the "problem persists" or the goal is not achieve yet and the situation persists. When something succeeds it's not often described as a situation that "persists". But I might be nitpicking here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is persist_for essentially equivalent to "required consecutive successes to pass"?
Yes
Not the best name, I know, I couldn't find a better one. I can add a comment. But do you have a better option? no_of_successful_passes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I'm not a fan of Gomega and Ginkgo I think we should separate "wait for something" from "consistently passes meets something" like they do:
- Wait for a condition - https://onsi.github.io/gomega/#eventually
- Ensure consistent pass of a condition - https://onsi.github.io/gomega/#consistently (edited: corrected the link)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, I created a new method assert_persist_phase, and opened a ticket to track the root issue
please take another look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! I like that we have the two high level methods assert_reaches_phase and assert_persist_phase separated now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also like how we now have assert_reaches_phase and assert_persist_phase, but still find wait_for to be a bit confusing. It's difficult to get what it does just by looking at the signature.
I'm not going to block on this, but if you can think of a way so we can leave wait_for as it was before and introduce another low level function with a simple signature - that would be awesome.
We are discussing this in Slack at the moment. |
|
I'm in flavor of merging this because we have too many flaky tests at the moment and this is annoying. I agree that the proper thing to do is to fix the root cause, but it's unsure when we will have time dedicated to that. |
m1kola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening the ticket for the root cause as we discussed.
I would like to suggest to open separate PRs for separate fixes next time. For example, timeout increases could've been merged much sooner without having to wait for a bigger discussion.
| class MongoDBCommon: | ||
| @TRACER.start_as_current_span("wait_for") | ||
| def wait_for(self, fn, timeout=None, should_raise=True): | ||
| def wait_for(self, fn, timeout=None, should_raise=True, persist_for=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also like how we now have assert_reaches_phase and assert_persist_phase, but still find wait_for to be a bit confusing. It's difficult to get what it does just by looking at the signature.
I'm not going to block on this, but if you can think of a way so we can leave wait_for as it was before and introduce another low level function with a simple signature - that would be awesome.
Summary
This PR aims to reduce the flakiness of the following tests:
e2e_multi_cluster_sharded_snippets
Increased the timeout of
test_running, since in failing runs, by the time the diagnostics are collected, the resources become ready.e2e_multi_cluster_appdb_upgrade_downgrade_v1_27_to_mck
Increased the timeout of
test_scale_appdb. Similarly, the assertion on appdb status fails, but by the time diagnostics are collected, the resource becomes ready.e2e_appdb_tls_operator_upgrade_v1_32_to_mck
In this test we have a race condition.
There is a moment during the operator upgrade where the resource has the status of AppDB and OM set to running. This happens very briefly before the operator starts reconciling OM and sets the OM status to Pending. In that moment, the test will very quickly pass both assertions and move on to assert healthiness by connecting to OM. This will fail since OM was not actually ready.
To fix this, I added a
persist_forflag in our assertion methods. This makes sure that the phase we are currently asserting is reached and persists for a number of retries.Proof of Work
Retried the above tests a few times, and all pass
https://spruce.mongodb.com/version/6911c25146ed0e00077796e3/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC
Checklist
skip-changeloglabel if not needed