Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Infra] Add CI groups for OSD core test cases to avoid flaky test #1352

Merged
merged 23 commits into from
Jun 7, 2024

Conversation

SuZhou-Joe
Copy link
Member

@SuZhou-Joe SuZhou-Joe commented Jun 4, 2024

Description

In the release of 2.14, OSD core has 3 flaky tests listed in Issues Resolved section. This PR splits the test cases into CI groups to make those cases pass.

Issues Resolved

#1323
#1322
#1298

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

SuZhou-Joe added 14 commits June 5, 2024 10:36
Signed-off-by: SuZhou-Joe <[email protected]>
Signed-off-by: SuZhou-Joe <[email protected]>
Signed-off-by: SuZhou-Joe <[email protected]>
Signed-off-by: SuZhou-Joe <[email protected]>
Signed-off-by: SuZhou-Joe <[email protected]>
Signed-off-by: SuZhou-Joe <[email protected]>
Signed-off-by: SuZhou-Joe <[email protected]>
Signed-off-by: SuZhou-Joe <[email protected]>
Signed-off-by: SuZhou-Joe <[email protected]>
Signed-off-by: SuZhou-Joe <[email protected]>
Signed-off-by: SuZhou-Joe <[email protected]>
Signed-off-by: SuZhou-Joe <[email protected]>
Signed-off-by: SuZhou-Joe <[email protected]>
Signed-off-by: SuZhou-Joe <[email protected]>
Signed-off-by: SuZhou-Joe <[email protected]>
Signed-off-by: SuZhou-Joe <[email protected]>
Signed-off-by: SuZhou-Joe <[email protected]>
Signed-off-by: SuZhou-Joe <[email protected]>
Signed-off-by: SuZhou-Joe <[email protected]>
@rishabh6788
Copy link

rishabh6788 commented Jun 6, 2024

Before we merge this PR can you please help answer below questions?

  1. If the reasoning behind splitting into groups is resource constraint or data cleanup, then:
    a. Did you check running with bigger cluster with appropriate heap size?
    b. If it is data clean up issue then is there any root cause analysis for it?
  2. Is there any logic behind grouping or is it random?
  3. It will be helpful if you could elaborate on the logic used to group the tests.
  4. Please move the grouping logic outside of workflow yml file, so that the same can be used by CI integ test logic.
    @peterzhuamazon @getsaurabh02 @gaiksaya @prudhvigodithi

@peterzhuamazon
Copy link
Member

peterzhuamazon commented Jun 6, 2024

Add up to the last point raised by @rishabh6788 , we should consider using the existing test_finder and integtest shell script to apply the logic. We probably need to make some adjustment on the test manifest as well / or the integTest jenkinsfile to define the container assignment.

Our Python code can then apply these logics with parameters and onboard to Jenkins with smaller effort.

Thanks.

@gaiksaya
Copy link
Member

gaiksaya commented Jun 6, 2024

  1. We did not get the bandwidth to find the root cause and splitting into CI groups is the quick solution to unblock the release. And from @kavilla 's observation, it has something to do with the data clean up.
    2/3. I feel like the grouping logic is random, but @kavilla @ashwin-pc please correct me if I was wrong.

Can we invest sometime in root causing before jumping to quick solutions and conclusions? I understand the timeline concern but changes on the infra side also won't be as straight forward. Need to plan it properly not to break 1x line compatibility and also accommodate other components. Wondering if major concern here is data clean or CPU/resources constraints or flaky tests.

@SuZhou-Joe
Copy link
Member Author

  1. From @kavilla 's observation, it has something to do with the data clean up. Addressing the root cause will be a following item.

2/3. I feel like the grouping logic is random, but @kavilla @ashwin-pc please correct me if I was wrong.

  1. Yeah, the logic to format and get the specs under specific CI groups can be reused. Will optimize that.

@ruanyl ruanyl merged commit bd5e0fb into opensearch-project:2.x Jun 7, 2024
52 of 54 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jun 7, 2024
)

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: run all

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: revert the deletion of windows flow

Signed-off-by: SuZhou-Joe <[email protected]>

* fix: integration test failure

Signed-off-by: SuZhou-Joe <[email protected]>

* fix: integration test failure

Signed-off-by: SuZhou-Joe <[email protected]>

---------

Signed-off-by: SuZhou-Joe <[email protected]>
Co-authored-by: Yulong Ruan <[email protected]>
(cherry picked from commit bd5e0fb)
@rishabh6788
Copy link

@ruanyl We were still waiting on getting our questions answered and proposed a few changes. Any reason why this was merged before getting those questions answered?
@SuZhou-Joe @getsaurabh02 @peterzhuamazon

@getsaurabh02
Copy link
Member

getsaurabh02 commented Jun 7, 2024

If this PR was to fix the flaky test case failures through grouping, CI for this PR should not have failed? 😖

@ashwin-pc should we root cause why is CI still failing post grouping : https://github.com/opensearch-project/opensearch-dashboards-functional-test/actions/runs/9410026571/job/25921007182?pr=1368

@ruanyl
Copy link
Member

ruanyl commented Jun 7, 2024

@rishabh6788 My bad, I wasn't aware of the open questions. Shall we revert the PR?

@gaiksaya
Copy link
Member

gaiksaya commented Jun 7, 2024

If this PR was to fix the flaky test case failures through grouping, CI for this PR should not have failed? 😖

@ashwin-pc should we root cause why is CI still failing post grouping : https://github.com/opensearch-project/opensearch-dashboards-functional-test/actions/runs/9410026571/job/25921007182?pr=1368

To add to this, are we sure that the test cases are exactly same between OSD repo and FT repo?

@SuZhou-Joe
Copy link
Member Author

@ruanyl Never mind, it won't affect the existing infra pipeline and after merging the code to 2.x, and we could ask other PRs to ensure all the flows pass before merging into 2.x.

@peterzhuamazon
Copy link
Member

If this PR was to fix the flaky test case failures through grouping, CI for this PR should not have failed? 😖

@ashwin-pc should we root cause why is CI still failing post grouping : https://github.com/opensearch-project/opensearch-dashboards-functional-test/actions/runs/9410026571/job/25921007182?pr=1368

We need to understand if this is a code / resource / cleanup issues and set out a plan to fix it permanently if possible.

SuZhou-Joe added a commit that referenced this pull request Jun 11, 2024
)

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: run all

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: revert the deletion of windows flow

Signed-off-by: SuZhou-Joe <[email protected]>

* fix: integration test failure

Signed-off-by: SuZhou-Joe <[email protected]>

* fix: integration test failure

Signed-off-by: SuZhou-Joe <[email protected]>

---------

Signed-off-by: SuZhou-Joe <[email protected]>
Co-authored-by: Yulong Ruan <[email protected]>
(cherry picked from commit bd5e0fb)
SuZhou-Joe added a commit that referenced this pull request Jun 12, 2024
…d flaky test (#1368)

* [Infra] Add CI groups for OSD core test cases to avoid flaky test (#1352)

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: run all

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: add ci-groups

Signed-off-by: SuZhou-Joe <[email protected]>

* feat: revert the deletion of windows flow

Signed-off-by: SuZhou-Joe <[email protected]>

* fix: integration test failure

Signed-off-by: SuZhou-Joe <[email protected]>

* fix: integration test failure

Signed-off-by: SuZhou-Joe <[email protected]>

---------

Signed-off-by: SuZhou-Joe <[email protected]>
Co-authored-by: Yulong Ruan <[email protected]>
(cherry picked from commit bd5e0fb)

* feat: merge

Signed-off-by: SuZhou-Joe <[email protected]>

---------

Signed-off-by: SuZhou-Joe <[email protected]>
Co-authored-by: SuZhou-Joe <[email protected]>
@kavilla
Copy link
Member

kavilla commented Jun 13, 2024

Sorry about the delay.

The grouping with OSD cigroups weren't random. It split out the Discover the tests because each group is could be really slow. It also somewhat ensured that the ciGroups that are ran together would avoid causing failures in another group. We attempted to avoid couple tests in the same group that used another tests datasets.

Then we added each other test were we see fit (somewhat related) and avoiding the Discover tests.

For the resource clean up we see this problem in hosted environments anyways where then users have to force a refresh. Taking a random repo issue: opensearch-project/notifications#790

We mark things for deletion but that's not guaranteeing that these items are actually being deleted in the cluster. Then we move on to the next expensive test that inserts more data and tries to query that data and potentially a search phase execution error occurs because JVM has too much pressure.

But refreshing after every delete is also a very expensive process so there could be runtime issues.

So I'm really not sure of a better solution then to make sure to use ciGroups at least for OpenSearch Dashboards.

OpenSearch Dashboards on main has 58 plugins alone. When we forked the selenium tests required they be ran in CiGroups and we have plans to migrate to our selenium tests to Cypress. When we migrated most of the Discover selenium tests to cypress we were easily able to choke the cluster with the amount of times we write a bunch of data and delete unless we are allocated a production-like cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants