Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes 1.26 upgrade #2458

Open
wants to merge 38 commits into
base: master
Choose a base branch
from
Open

Kubernetes 1.26 upgrade #2458

wants to merge 38 commits into from

Conversation

tfriedel
Copy link

@tfriedel tfriedel commented Jun 13, 2023

This PR updates cortex to Kubernetes 1.26 and also updates most components to newer versions as described in versions.md.

An attempt to upgrade to Kubernetes 1.27 was made, but it was unsuccessful because of an open issue of the amazon-vpc-cni-k8s plugin with the Prometheus adapter.

Notes:

  • dockerd is deprecated after k8s 1.24 and was replaced with containerd. The only time docker is used in the k8s cluster is to check if images can be fetched. We disabled this functionality as it's not essential and it would be non-trivial to add. But if someone wants to fix this, please feel free to submit a patch.
  • We created ami mappings using go run build/generate_ami_mapping.go manager/manifests/ami.json public, however our AWS account can not access all regions, so we had to comment out regions that were not supported. Again, if someone wants to submit patch for this, it would be appreciated.
  • new ec2 instance types were added, but servicequotas.go and validateInstanceType() were not touched. Anyone interested in using the newer instance types may have to look into this. To be on the safe side, don't use the new instance types for now.
  • cluster-autoscaler was forked and patches were applied to a recent version. We switched to autoscaler/v2 api (from v2beta1 / v2beta2).
  • we had to install an ebs csi driver
  • AWS started using minimal base images that don't allow shell commands like cat, tar, sh etc (e.g. kube-proxy:v1.26.2-minimal-eksbuild.1 ). We had to adapt some scripts because of that and get configuration directly from kubernetes instead of the file system.
  • we ran into this issue: CustomResourceDefinition.apiextensions.k8s.io "prometheuses.monitoring.coreos.com" is invalid: metadata.annotations: Too long: must have at most 262144 bytes. We fixed it by using kubectl apply --server-side.
  • in the linter script we had to disable looppointer, as it was giving us errors that looked like
    looppointer: internal error: package "math" without types was imported which we couldn't resolve
  • during creation of a cluster there's a few error messages that look like ○ configuring metrics E0610 20:44:44.408766 1054 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
    E0610 20:44:44.563268. We assume it's a false alarm as metric collection seems to work, but if anyone has any insight into this, please let us know.

We ran 'make lint' and 'make test' and did manual testing with our model server over 2 days and have not noticed any issues yet. We did not run the e2e tests in the Makefile.

For some reason the circleCI script doesn't find the linter, even though it's installed and the PATH modification also looks correct. If someone knows how to fix it, please let us know.

Please test this thoroughly yourself before using it in production.

To use this version you will have to build self-hosted images. Follow the steps in
CONTRIBUTING.md up till "make images-all" and also read self-hosted-images.
Use go1.20.4 linux/amd64. A user tried go 1.21 and it didn't work with this version.

edit: we are running this version in production for a week and have not noticed any problems. We only use the realtime API.


checklist:

  • run make test and make lint
  • test manually (i.e. build/push all images, restart operator, and re-deploy APIs)
  • update examples
  • update docs and add any new files to summary.md (view in gitbook after merging)
  • cherry-pick into release branches if applicable
  • alert the dev team if the dev environment changed

tfriedel added 30 commits May 31, 2023 16:39
…heus operator / config reloader, fluentbit, cluster autoscaler, metrics server, neuron device plugin and scheduler
… default) and set container-runtime to containerd, as dockerd is not supported after kubernetes 1.24
…erd to containerd. While the cluster now starts, we can't use cortex deploy because it requires docker. Need to find a way to give it access to docker.
@CLAassistant
Copy link

CLAassistant commented Jun 13, 2023

CLA assistant check
All committers have signed the CLA.

@aleksandr-smechov
Copy link

Awesome work 😎 thanks for keeping the project alive! I'll test this out on our setup this week.

@tfriedel tfriedel marked this pull request as ready for review June 13, 2023 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants