[ACTION NEEDED] Fix flaky integration tests at distribution level #1670

gaiksaya · 2024-04-03T21:10:36Z

What is the bug?
It was observed in 2.13.0 and previous other releases that this component manually signed off on the release for failing integration tests. See opensearch-project/opensearch-build#4433 (comment)
The flakiness of the test runs take a lot of time from the release team to collect go/no-go decision and significantly lower the confidence in the release bundles.

How can one reproduce the bug?
Steps to reproduce the behavior:

Run integration testing for altering and see the failures.
Issues can be reproduced using the steps declared in AUTOCUT issues for failed integration testing

What is the expected behavior?
Tests should be consistently passing.

Do you have any additional context?
Please note that this is a hard blocker for 2.14.0 release as per the discussion here

…oject#1472) (opensearch-project#1670) Signed-off-by: Kajetan Nobel <[email protected]> Signed-off-by: Kajetan Nobel <[email protected]> Co-authored-by: Stephen Crawford <[email protected]> Co-authored-by: Darshit Chanpura <[email protected]> Co-authored-by: Peter Nied <[email protected]> Co-authored-by: Peter Nied <[email protected]> (cherry picked from commit cfc83dd94eea02b5738bf607dd9866308814f2fc) Co-authored-by: jakubp-eliatra <[email protected]>

bbarani · 2024-04-23T22:12:32Z

@RyanL1997 @ps48 Can you please provide your inputs?

Swiddis · 2024-04-25T17:16:11Z

We're working on it, a while back I asked about the failures in opensearch-project/opensearch-build#4635, it doesn't look like the distribution failures are from our tests but somewhere in the pipeline as far as I can tell. I've marked our distribution issues with "help wanted" where the issue is applicable.

Swiddis · 2024-04-25T19:06:25Z

It also looks like many of the manifests are still showing a Not Available status, related to the discussion here, but it's showing them even for fresh logs so it doesn't seem to be an issue of the manifests being stale.

bbarani · 2024-04-25T20:03:24Z

Tagging @zelinh here to provide his inputs.

zelinh · 2024-04-25T20:09:02Z

Here are some reasons that it may show Not Available. https://github.com/opensearch-project/opensearch-build/tree/main/src/report_workflow#why-are-some-component-testing-results-missing-or-unavailable
@Swiddis Could you share one situation that is showing Not Available so I can look into it in more details.

Swiddis · 2024-04-25T20:15:17Z

Could you share one situation that is showing Not Available so I can look into it in more details.

E.g. the 2.14 integration tests autocut, of the three most recent manifests at the time of writing, two of them are unavailable (most recent, second most recent (available), third most recent).

zelinh · 2024-04-25T20:53:26Z

Could you share one situation that is showing Not Available so I can look into it in more details.

E.g. the 2.14 integration tests autocut, of the three most recent manifests at the time of writing, two of them are unavailable (most recent, second most recent (available), third most recent).

I saw these in both of the unavailable runs. Seems like the process is terminated because of timeout limit when we run the integ tests for observabilityDashboards ; therefore it didn't run through all the test recording process.

Cancelling nested steps due to timeout
Sending interrupt signal to process

Session terminated, killing shell...Terminated
 ...killed.
script returned exit code 143

https://build.ci.opensearch.org/job/integ-test-opensearch-dashboards/5856/consoleFull
https://build.ci.opensearch.org/job/integ-test-opensearch-dashboards/5844/consoleFull
Both of these jobs run for more than 4 hours; while the available one run only 1.5 hour.
Do you have any idea why these jobs run longer than usual? @rishabh6788 @gaiksaya

Swiddis · 2024-04-29T20:31:48Z

Hypothesis: The failing tests are flaky and the timeouts only happen if the tests pass (i.e. something later in the test suite is taking all the time). We only get the failure message when the earlier test fails and cuts the run short.

Based on this hypothesis I made opensearch-project/opensearch-dashboards-functional-test#1250 to fix the flakiness, but I'm still not sure what's causing the timeouts.

Swiddis · 2024-04-30T17:23:49Z

For completeness I've checked the recent pipeline logs after the flakiness fix was merged, and am not seeing any integ-test failures for observability. https://build.ci.opensearch.org/blue/rest/organizations/jenkins/pipelines/integ-test-opensearch-dashboards/runs/5899/log/?start=0

I can find the interruption exception, but not the indication of what specifically is being interrupted (is some test hanging?):

org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: 5a075705-b450-4433-85c4-0b5d9991ba84
org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
		at org.jenkinsci.plugins.workflow.steps.BodyExecution.cancel(BodyExecution.java:59)
		at org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.cancel(TimeoutStepExecution.java:197)
		at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:67)
		at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
		at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
		at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	Caused: java.lang.Exception: Error running integtest for component observabilityDashboards
		at WorkflowScript.run(WorkflowScript:317)
		at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(Docker.groovy:141)
		at ___cps.transform___(Native Method)
		at java.base/jdk.internal.reflect.GeneratedConstructorAccessor790.newInstance(Unknown Source)

bbarani · 2024-04-30T20:22:58Z

Tagging @rishabh6788 to look in to the above failure ^

Swiddis · 2024-06-14T18:56:24Z

Currently just held up by #1822

gaiksaya added bug Something isn't working untriaged v2.14.0 labels Apr 3, 2024

Swiddis removed the untriaged label Apr 3, 2024

Swiddis assigned ps48 and Swiddis Apr 3, 2024

Swiddis mentioned this issue Apr 12, 2024

[QUESTION] Why are our Windows distribution tests failing? opensearch-project/opensearch-build#4635

Closed

dblock mentioned this issue Apr 15, 2024

[RELEASE]: Stabilize the integration test runs for distribution builds, no manual sign-offs anymore opensearch-project/opensearch-build#4588

Open

12 tasks

RyanL1997 self-assigned this Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ACTION NEEDED] Fix flaky integration tests at distribution level #1670

[ACTION NEEDED] Fix flaky integration tests at distribution level #1670

gaiksaya commented Apr 3, 2024

bbarani commented Apr 23, 2024

Swiddis commented Apr 25, 2024

Swiddis commented Apr 25, 2024

bbarani commented Apr 25, 2024

zelinh commented Apr 25, 2024

Swiddis commented Apr 25, 2024

zelinh commented Apr 25, 2024

Swiddis commented Apr 29, 2024 •

edited

Loading

Swiddis commented Apr 30, 2024

bbarani commented Apr 30, 2024

Swiddis commented Jun 14, 2024

[ACTION NEEDED] Fix flaky integration tests at distribution level #1670

[ACTION NEEDED] Fix flaky integration tests at distribution level #1670

Comments

gaiksaya commented Apr 3, 2024

bbarani commented Apr 23, 2024

Swiddis commented Apr 25, 2024

Swiddis commented Apr 25, 2024

bbarani commented Apr 25, 2024

zelinh commented Apr 25, 2024

Swiddis commented Apr 25, 2024

zelinh commented Apr 25, 2024

Swiddis commented Apr 29, 2024 • edited Loading

Swiddis commented Apr 30, 2024

bbarani commented Apr 30, 2024

Swiddis commented Jun 14, 2024

Swiddis commented Apr 29, 2024 •

edited

Loading