Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaling recommendations with GitLab #439

Open
zifeo opened this issue Nov 20, 2024 · 4 comments
Open

Autoscaling recommendations with GitLab #439

zifeo opened this issue Nov 20, 2024 · 4 comments

Comments

@zifeo
Copy link

zifeo commented Nov 20, 2024

Following the advices from the readme, a newly node can experience some delay before images can be used from the proxy. While this may work with classical deployment with retries, it seems to cause issue when GitLab CI is managing the job. Is there any other recommendation for such setup?

WARNING: Event retrieved from the cluster: 0/7 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }, 1 node(s) had untolerated taint {nvidia.com/gpu: present}, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/7 nodes are available: 1 No preemption victims found for incoming pod, 6 Preemption is not helpful for scheduling.
WARNING: Event retrieved from the cluster: Failed to pull image "localhost:7439/registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v17.3.1": failed to pull and unpack image "localhost:7439/registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v17.3.1": failed to resolve reference "localhost:7439/registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v17.3.1": failed to do request: Head "http://localhost:7439/v2/registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper/manifests/x86_64-v17.3.1": dial tcp [::1]:7439: connect: connection refused
WARNING: Event retrieved from the cluster: Error: ErrImagePull
WARNING: Event retrieved from the cluster: Error: ImagePullBackOff
WARNING: Failed to pull image "localhost:7439/registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v17.3.1" with policy "": image pull failed: Back-off pulling image "localhost:7439/registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v17.3.1"
ERROR: Job failed: prepare environment: waiting for pod running: pulling image "localhost:7439/registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v17.3.1": image pull failed: Back-off pulling image "localhost:7439/registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v17.3.1". Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
@zifeo zifeo changed the title Autoscaling recommendation with GitLab Autoscaling recommendations with GitLab Nov 20, 2024
@plaffitt
Copy link
Contributor

I'm not sure to understand, what GitLab CI is doing here?

@zifeo
Copy link
Author

zifeo commented Nov 22, 2024

@paullaffitte GitLab runner is launching CI jobs on demand on Kubernetes. When there are too many jobs and Kubernetes decide to scale up the node count, there is a race conditions between the new job and the proxy being available. This usually works with classical workloads because of the automatic retry, however in case of a job managed by the GitLab runner the failure is not retried on the init container. I am looking to see you face similar situation and what else can be tried?

@plaffitt
Copy link
Contributor

Did you try to set a pull policy : https://docs.gitlab.com/runner/executors/kubernetes/#set-a-pull-policy

@gen-xu
Copy link

gen-xu commented Dec 25, 2024

I think this can be achieved by adding taints like node=waiting:NoSchedule to the new autoscaled nodes, and after the daemonset is ready, someone is going to be responsible for removing the taints so that the new pod will be able to scheulde on this node

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants