Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Randomizing and repeating functional tests to detect flakyness #53105

Closed
wants to merge 1 commit into from

Conversation

jpodivin
Copy link

@jpodivin jpodivin commented Jun 12, 2024

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 12, 2024
Copy link
Contributor

openshift-ci bot commented Jun 12, 2024

Hi @jpodivin. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot requested review from lewisdenny and rabi June 12, 2024 10:12
@jpodivin
Copy link
Author

/pj-rehearse

@openshift-ci-robot
Copy link
Contributor

@jpodivin: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci-robot
Copy link
Contributor

@jpodivin: needs-ok-to-test label found, no rehearsals will be run

Copy link

@bogdando bogdando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, good idea

@dprince
Copy link
Contributor

dprince commented Jun 13, 2024

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 13, 2024
@lewisdenny
Copy link
Contributor

/pj-rehearse

@openshift-ci-robot
Copy link
Contributor

@lewisdenny: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@lewisdenny
Copy link
Contributor

/pj-rehearse ack

@openshift-ci-robot
Copy link
Contributor

@lewisdenny: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci-robot openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jun 19, 2024
@@ -64,7 +64,7 @@ tests:
mkdir -p ../operator && cp -r . ../operator
cd ../operator
export GOFLAGS=
make test GINKGO_ARGS='--no-color'
make test GINKGO_ARGS='--no-color --repeat=5 --randomize-all'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i woudl not add --repeat=5 in ci

if you want to support repeate for local execution i would add a new ENV var and default it to 1
--randomize-all i do agree with

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running a single randomized test wouldn't give us as much benefit. Purpose of this PR is to prevent merging of flaky code, which can slip by thanks to a "lucky" sequence of test cases. If run multiple random tests, the probability of something like this happening drops considerably.

Although for obvious reasons I can't say by how much since I don't know how many "lucky" combinations are there. But with successive tests the probability of only hitting "lucky" combinations drops pretty fast to zero.

Copy link
Contributor

@fmount fmount Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1(to clarify a few things): I think that value hardcoded to 5 is not a good default (unless you have good reasons to think that 5 is the right default).
I agree with @SeanMooney and use a ENV var that we can override in the future without requiring a new change here (and we can test diff combinations with a DNM change in the operator itself)
Also, are you going to propose this for the service operators or this change applies to openstack-operator only? Also, considering that dataplane operator has been merged into openstack operator, shouldn't just remove and cleanup this part?

Copy link
Author

@jpodivin jpodivin Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding var to makefile can help, but it would force repeated runs locally as well. That would quickly become annoying and possibly lead to overriding of that var. I want to avoid that by setting it here. Where it will only affect CI and nothing else.

Furthermore, setting repetitions to 1, as @SeanMooney proposed won't do us any good, unless we get very lucky.

As for dataplane operator merger, yes, that is something I want to do, the PR has been here since before the operators were merged.

Copy link
Contributor

@fmount fmount Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, and I agree 1 is definitely invalidating this patch.
If the rest of the team is ok to have 5 as default, then the remaining work here is to cleanup the dataplane-operator part.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turned the setting into variable in openstack-operator repo. This means that the PR now has a dependency it has to pull in order to execute. But it's the only way to satisfy conditions laid out.

The leftover from dataplane repo was removded some time ago

@@ -64,7 +64,7 @@ tests:
mkdir -p ../operator && cp -r . ../operator
cd ../operator
export GOFLAGS=
make test GINKGO_ARGS='--no-color'
make test GINKGO_ARGS='--no-color --repeat=5 --randomize-all'
Copy link
Contributor

@fmount fmount Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1(to clarify a few things): I think that value hardcoded to 5 is not a good default (unless you have good reasons to think that 5 is the right default).
I agree with @SeanMooney and use a ENV var that we can override in the future without requiring a new change here (and we can test diff combinations with a DNM change in the operator itself)
Also, are you going to propose this for the service operators or this change applies to openstack-operator only? Also, considering that dataplane operator has been merged into openstack operator, shouldn't just remove and cleanup this part?

@openshift-ci-robot openshift-ci-robot removed the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jun 27, 2024
Copy link
Contributor

openshift-ci bot commented Jun 27, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jpodivin
Once this PR has been reviewed and has the lgtm label, please assign slagle for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This patch changes target executed by the CI to one using flags for repetition and randomization.

Signed-off-by: Jiri Podivin <[email protected]>
@openshift-ci-robot
Copy link
Contributor

[REHEARSALNOTIFIER]
@jpodivin: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openstack-k8s-operators-openstack-operator-main-functional openstack-k8s-operators/openstack-operator presubmit Ci-operator config changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse abort to abort all active rehearsals

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@bogdando
Copy link

lgtm

@openshift-bot
Copy link
Contributor

Issues in openshift/release go stale after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 15d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 14, 2024
@openshift-bot
Copy link
Contributor

Stale issue in openshift/release rot after 15d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 15d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 1, 2024
Copy link
Contributor

openshift-ci bot commented Sep 13, 2024

@jpodivin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/build-farm/build11-dry cbdf4d8 link true /test build11-dry
ci/prow/check-cluster-profiles-config cbdf4d8 link true /test check-cluster-profiles-config

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

Rotten issues in openshift/release close after 15d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Sep 29, 2024
Copy link
Contributor

openshift-ci bot commented Sep 29, 2024

@openshift-bot: Closed this PR.

In response to this:

Rotten issues in openshift/release close after 15d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants