-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RELEASE]: Stabilize the integration test runs for distribution builds, no manual sign-offs anymore #4588
Comments
I support this proposal. As far as I know, manual sign off caused delay in releasing 5+ minor versions (as additional time was required for manual sign off), and resulted in 7 patch releases (as bugs slipped through manual sign off) in 2023. |
Here's a link to what's being manually signed off in 2.13: #4433 (comment) |
I am supportive - thanks @reta for pursuing this. However, given the number of manual signoffs/flaky tests, can we agree on a % reduction goal for 2.14, followed by targets for 2.15 and beyond such that we can get down to 0 manual signoffs in 3-4 months. All teams are impacted and are burning down, however, the pace is slower than we expect. |
The below components have most number of flaky tests. If we are taking the % reduction goal, we should take it at component level. OpenSearch:
OpenSearch Dashboards:
|
Thanks @bbarani - can we take a goal of 50% reduction for each component by 2.14? It will be difficult to manage different reduction targets across different components. |
Why 50% and not 100%? In 2.14 there were only 4 plugins that didn't manually sig-off. Any team that doesn't have stable tests should be focused on that and not adding new features. |
I have created child issues for the components that manually signed off in 2.13.0. See above links. Thanks! |
@bbarani You mentioned ISM as one of the components having high number of flaky tests. This isn't true and may not be true for some of the other plugins as well. As I called out in release meeting as well, for ISM primary reason of failures was update of wrong security certificates in the repo. This update in security certificate was suggested to plugins by security earlier to increase the expiry and fixed later by another PR opensearch-project/index-management#1147 by security itself. |
Hi @vikasvb90 , Even before security changes were merged, index management plugin has seen lot of flaky tests in the past. See the issues opened and closed multiple times in the past releases here. The re-executing of jenkins would not have fixed it as a new RC would have to be generated as integration test framework checks out the exact commit that was used to build RC. We also recently matched the GitHub Actions specifications on Jenkins. See opensearch-project/opensearch-ci#412 |
@gaiksaya I didn't mean that there are no flaky tests in ISM, I have mentioned this in my comment as well. We do have some flaky tests and we have been making improvements as well. I am pointing out the reason of failures in the last release based on which this was stated |
@vikasvb90, thank for your comment. This issue mainly focuses on remediating all flaky tests and doesn't focus on specific released OpenSearch version. The goal is to identify flaky tests and fix it before 2.14.0 release. |
OpenSearch Dashboards did not pass the build CI for 2.14 even though it passed these tests locally and it it's own component CI. I will work with the build team to address it. |
Thanks @ashwin-pc for getting the Dashboards testcases to run locally - we need all testcases running as part of CI in 2.15 without exceptions. |
Update on passing the tests in the Build CI to avoid a manual signoff for 2.15: To avoid having manual signoffs for OSD in 2.15, we need to split the OSD tests into groups. In 2.14 we had tests pass when run individually but fail when all the tests are run together. It appears that the issue doesn't lie with individual tests for OpenSearch Dashboards, but rather when running all the tests together. The OpenSearch Dashboards Github CI does not face this issue because we use CI groups to run the tests. The Functional Test Repo team has also observed that splitting the tests into separate CI groups helped resolve the failures. We need to implement a similar approach for the Build CI as well. Instead of running all the tests together, we should partition them into smaller groups or batches. This way, if any test leaves behind residual data or fails to clean up properly, it won't affect the subsequent tests in the same batch. While this may not address the root cause of the flakiness, it aligns with the previously successful solution of partitioning tests. Historically we have learned that if OSD tests run sequentially, we should expect failures. Investigating and fixing the underlying issues would require significant research efforts, which may not be feasible given the large number of inherited tests. By partitioning the tests, we can mitigate the impact of flakiness and ensure more reliable test runs. |
One alternative to this is to use @AMoo-Miki 's pipeline that has already implemented this grouping, that can use a built artifact and run the tests in groups. |
I think as a workaround, grouping looks like a step in right direction but one may still face the flakyness in scope of the single group since the cause (why sequential test runs are flaky) is still not understood. |
Another concern that I have is that the possibility of detecting a valid regression reduces when running in groups. For e.g., if a feature that causes performance regression is put in one group and by any chance is the last test of the particular group then subsequent groups will not detect that issue. As of now if that happens, the subsequent tests fail with timeout or other related errors. |
I believe @SuZhou-Joe is looking into flakey test failures and has made some discoveries about their probable cause. |
Thanks @rishabh6788 . Also waiting for @SuZhou-Joe to provide a possible fix to the root problem. The initial changes on github actions seems like avoiding the issues rather than fixing it permanently: Thanks. |
@reta Flakeyness within a single group is manageable since tests wont interfere with each other as much and all that the author of the test needs to ensure is that the group runs successfully. And because they are smaller groups, validating fixes to a group should be a lot quicker and more reproducible. Every fix made to the flakey tests so far has been addressing tests interfering with each other and not the core functionality/regression itself. And when each test run takes hours to complete, it makes debugging and addressing these tests so much harder. Also we do know why the grouping fixes the tests. The CI uses a single low powered OpenSearch instance to run all the tests. For OpenSearch Dashboards, we have a lot more tests as compared to plugins, so when we run all the tests sequentially, even though we clean up the data after the tests, OpenSearch does not clear its cache in time and starts to throttle the requests. @SuZhou-Joe came up with a solution to clear the cache for now, but relying on deep dives like that isnt scalable.
@rishabh6788 I think this is unnecessary. Each of the fixes so far, including the ones that @SuZhou-Joe 's latest fix for running them sequentially has been to address tests interfering with each other. His fix actually just asks OpenSearch to clear its cache so that the next set of tests can run. This is actually a workaround. UI tests expect the application to be in a specific state to not be flakey and expecting them to work sequentally only makes this more fragile. Also if these kinds of regressions are really a concern, why arent we running all functional tests (including that of other plugins) sequentially? If we really want to catch accidental regressions, we should introduce fuzzy tests which by definition will be flakey but can catch errors like the ones that you are concerned about. As for the other integ tests, they are meant to validate a specific function and arent written to valid across tests. |
Tests interfering with one another (either through concurrent execution or leaving modified state after test runs) is definitely a common problem. We should continue working towards a real solution to this problem so that tests are independent and reliable regardless of the context in which they run. Changing the test run grouping may be a good mitigation for now, but if we don't actually solve the root problem of test interference then we're likely to continue struggling with this going forward.
Fuzz tests should not be flaky. They should be deterministically reproducible with a given random seed, and any failures they uncover would be real bugs that need to be fixed.
CI is using m5.8xlarge instances as far as I know. Is this not correct? This is a 32 core, 128MB memory instance. If using a more powerful machine will help I think we can explore doing that. |
Is your feature request related to a problem? Please describe
At the moment recurrent practice during the release process is to collect manual sign-offs for numerous integration test failures across different components. Some of them are constant offenders, others pop up from release to release. The flakiness of the test runs take a lot of time from the release team to collect go/no-go decision and significantly lower the confidence in the release bundles.
Describe the solution you'd like
For 2.14.0, we should stabilize the integration test runs for distribution builds and do not accept manual sign-offs anymore. Integration test failures mean
no-go
by definition.Describe alternatives you've considered
Continue collecting manual sign-offs and burn time on that.
Additional context
All previous releases (2.13.0 / 2.12.0 / 2.11.0 / ...)
Children
The text was updated successfully, but these errors were encountered: