Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access to external services (e.g. quay) randomly fail during cluster-up steps #1330

Open
EdDev opened this issue Nov 25, 2024 · 3 comments
Open
Labels
kind/bug kind/flake Categorizes issue or PR as related to a flaky test.

Comments

@EdDev
Copy link
Member

EdDev commented Nov 25, 2024

What happened:

On kubevirt/kubevirt jobs run access failures to quay are seen from time [1].
These occurences fails the e2e jobs on start.

Here is an example:

 ./hack/cluster-up.sh
12:15:34: selecting podman as container runtime
12:15:34: Trying to pull quay.io/kubevirtci/gocli:2402231446-3191285...
12:17:34: Error: initializing source docker://quay.io/kubevirtci/gocli:2402231446-3191285: Get "https://quay.io/v2/auth?scope=repository%3Akubevirtci%2Fgocli%3Apull&service=quay.io": net/http: TLS handshake timeout
12:17:34: ./cluster-up/up.sh: line 34: pop_var_context: head of shell_variables not a function context
make: *** [Makefile:152: cluster-up] Error 125 

[1] https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/13184/pull-kubevirt-e2e-k8s-1.29-sig-network-1.2/1856305515764649984#1:build-log.txt%3A330

What you expected to happen:

The expectation is to assume the internet connectivity and the service (quay) may not be 100% up and some flakes may occur due to many factors.
Therefore, the expectation is for such attempts to retry with a backoff and a defined timeout.

How to reproduce it (as minimally and precisely as possible):
Random.

Additional context:

Environment:

@EdDev EdDev added the kind/bug label Nov 25, 2024
@EdDev
Copy link
Member Author

EdDev commented Nov 25, 2024

/cc @akalenyu

@dosubot dosubot bot added the kind/flake Categorizes issue or PR as related to a flaky test. label Nov 25, 2024
@oshoval
Copy link
Contributor

oshoval commented Nov 25, 2024

Btw if we will have a way to know what is the expected hash versus the one that exists in the cache of CI, if any,
then we don't need quay at all (once it is in the cache, which is most of the time), and can reduce flakes even more,
once quay is unreachable for periods / flakes.

But it is more complicated of course.

(atm, even if we have the exact requested hash, we do contact quay as far as i remember, even just for headers)

EDIT
I wonder if using sha instead tag will improve it a bit, but not sure, because even if it is in the cache by sha, it might still access quay (which the above solution tries to solve)

@oshoval
Copy link
Contributor

oshoval commented Nov 25, 2024

12:17:34: ./cluster-up/up.sh: line 34: pop_var_context: head of shell_variables not a function context

also can be fixed but not relevant to this ticket, it happens on every cluster-up failures it seems, since the refactor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug kind/flake Categorizes issue or PR as related to a flaky test.
Projects
None yet
Development

No branches or pull requests

2 participants