Access to external services (e.g. quay) randomly fail during cluster-up steps #1330

EdDev · 2024-11-25T09:14:46Z

What happened:

On kubevirt/kubevirt jobs run access failures to quay are seen from time [1].
These occurences fails the e2e jobs on start.

Here is an example:

 ./hack/cluster-up.sh
12:15:34: selecting podman as container runtime
12:15:34: Trying to pull quay.io/kubevirtci/gocli:2402231446-3191285...
12:17:34: Error: initializing source docker://quay.io/kubevirtci/gocli:2402231446-3191285: Get "https://quay.io/v2/auth?scope=repository%3Akubevirtci%2Fgocli%3Apull&service=quay.io": net/http: TLS handshake timeout
12:17:34: ./cluster-up/up.sh: line 34: pop_var_context: head of shell_variables not a function context
make: *** [Makefile:152: cluster-up] Error 125

[1] https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/13184/pull-kubevirt-e2e-k8s-1.29-sig-network-1.2/1856305515764649984#1:build-log.txt%3A330

What you expected to happen:

The expectation is to assume the internet connectivity and the service (quay) may not be 100% up and some flakes may occur due to many factors.
Therefore, the expectation is for such attempts to retry with a backoff and a defined timeout.

How to reproduce it (as minimally and precisely as possible):
Random.

Additional context:

Environment:

The text was updated successfully, but these errors were encountered:

EdDev · 2024-11-25T09:14:58Z

/cc @akalenyu

oshoval · 2024-11-25T09:33:22Z

Btw if we will have a way to know what is the expected hash versus the one that exists in the cache of CI, if any,
then we don't need quay at all (once it is in the cache, which is most of the time), and can reduce flakes even more,
once quay is unreachable for periods / flakes.

But it is more complicated of course.

(atm, even if we have the exact requested hash, we do contact quay as far as i remember, even just for headers)

EDIT
I wonder if using sha instead tag will improve it a bit, but not sure, because even if it is in the cache by sha, it might still access quay (which the above solution tries to solve)

oshoval · 2024-11-25T09:43:35Z

12:17:34: ./cluster-up/up.sh: line 34: pop_var_context: head of shell_variables not a function context

also can be fixed but not relevant to this ticket, it happens on every cluster-up failures it seems, since the refactor

EdDev added the kind/bug label Nov 25, 2024

dosubot bot added the kind/flake Categorizes issue or PR as related to a flaky test. label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access to external services (e.g. quay) randomly fail during cluster-up steps #1330

Access to external services (e.g. quay) randomly fail during cluster-up steps #1330

EdDev commented Nov 25, 2024

EdDev commented Nov 25, 2024

oshoval commented Nov 25, 2024 •

edited

Loading

oshoval commented Nov 25, 2024

Access to external services (e.g. quay) randomly fail during cluster-up steps #1330

Access to external services (e.g. quay) randomly fail during cluster-up steps #1330

Comments

EdDev commented Nov 25, 2024

EdDev commented Nov 25, 2024

oshoval commented Nov 25, 2024 • edited Loading

oshoval commented Nov 25, 2024

oshoval commented Nov 25, 2024 •

edited

Loading