Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster fails to start if mutiple control plane nodes are added. #3680

Open
terryjix opened this issue Jul 11, 2024 · 6 comments
Open

cluster fails to start if mutiple control plane nodes are added. #3680

terryjix opened this issue Jul 11, 2024 · 6 comments
Labels
kind/support Categorizes issue or PR as a support question. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@terryjix
Copy link

What happened:
the cluster is failed to create if I add mutiple control plane nodes to the cluster

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker
- role: worker
- role: worker

Error logs

{"level":"warn","ts":"2024-07-11T08:50:49.476205Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00062ee00/172.18.0.5:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
I0711 08:50:49.476269     249 etcd.go:550] [etcd] Promoting the learner 86e5aab36dbb6fb7 failed: etcdserver: can only promote a learner member which is in sync with leader
etcdserver: can only promote a learner member which is in sync with leader
error creating local etcd static pod manifest file
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runEtcdPhase
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/controlplanejoin.go:156
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:259
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:446
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:232
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdJoin.func1
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/join.go:183
github.com/spf13/cobra.(*Command).execute
        github.com/spf13/[email protected]/command.go:940
github.com/spf13/cobra.(*Command).ExecuteC
        github.com/spf13/[email protected]/command.go:1068
github.com/spf13/cobra.(*Command).Execute
        github.com/spf13/[email protected]/command.go:992
k8s.io/kubernetes/cmd/kubeadm/app.Run
        k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:52
main.main
        k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
        runtime/proc.go:271
runtime.goexit
        runtime/asm_amd64.s:1695
error execution phase control-plane-join/etcd
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:260
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:446
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:232
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdJoin.func1
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/join.go:183
github.com/spf13/cobra.(*Command).execute
        github.com/spf13/[email protected]/command.go:940
github.com/spf13/cobra.(*Command).ExecuteC
        github.com/spf13/[email protected]/command.go:1068
github.com/spf13/cobra.(*Command).Execute
        github.com/spf13/[email protected]/command.go:992
k8s.io/kubernetes/cmd/kubeadm/app.Run
        k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:52
main.main
        k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
        runtime/proc.go:271
runtime.goexit
        runtime/asm_amd64.s:169

What you expected to happen:
kind supports to create a kubernetes cluster with mutiple control plane nodes.

How to reproduce it (as minimally and precisely as possible):
use following configuration to launch a cluster

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker
- role: worker
- role: worker

Anything else we need to know?:

Environment:

  • kind version:
    kind version 0.23.0
  • Runtime info: (use docker info, podman info or nerdctl info):
    Client:
    Version: 25.0.3
    Context: default
    Debug Mode: false
    Plugins:
    buildx: Docker Buildx (Docker Inc.)
    Version: v0.0.0+unknown
    Path: /usr/libexec/docker/cli-plugins/docker-buildx

Server:
Containers: 21
Running: 0
Paused: 0
Stopped: 21
Images: 78
Server Version: 25.0.3
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 64b8a811b07ba6288238eefc14d898ee0b5b99ba
runc version: 4bccb38cc9cf198d52bebf2b3a90cd14e7af8c06
init version: de40ad0
Security Options:
seccomp
Profile: builtin
cgroupns
Kernel Version: 6.1.94-99.176.amzn2023.x86_64
Operating System: Amazon Linux 2023.5.20240701
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.629GiB
Name: ip-172-31-18-230.eu-west-1.compute.internal
ID: c3b0373c-7367-45d1-8e7b-12a0ff695616
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
binglj.people.aws.dev:443
127.0.0.0/8
Live Restore Enabled: false

  • OS (e.g. from /etc/os-release):
    Amazon Linux 2023
  • Kubernetes version: (use kubectl version):
    1.30.0
  • Any proxies or other special environment settings?:
@terryjix terryjix added the kind/bug Categorizes issue or PR as related to a bug. label Jul 11, 2024
@neolit123
Copy link
Member

I0711 08:50:49.476269 249 etcd.go:550] [etcd] Promoting the learner 86e5aab36dbb6fb7 failed: etcdserver: can only promote a learner member which is in sync with leader
etcdserver: can only promote a learner member which is in sync with leader
error creating local etcd static pod manifest file

@pacoxu didn't we wait for sync to happen before promote?

@terryjix
Copy link
Author

I added following arguments to the configuration file

  kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    featureGates:
      EtcdLearnerMode: false

the kubelet fails to start with another error message

Jul 11 09:46:25 k8s-playground-worker kubelet[379]: I0711 09:46:25.281034     379 factory.go:221] Registration of the systemd container factory successfully
Jul 11 09:46:25 k8s-playground-worker kubelet[379]: I0711 09:46:25.281297     379 factory.go:219] Registration of the crio container factory failed: Get "http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info": dial unix /var/run/crio/crio.sock: connect: no such file or directory
Jul 11 09:46:25 k8s-playground-worker kubelet[379]: I0711 09:46:25.290080     379 factory.go:221] Registration of the containerd container factory successfully
Jul 11 09:46:25 k8s-playground-worker kubelet[379]: E0711 09:46:25.290422     379 manager.go:294] Registration of the raw container factory failed: inotify_init: too many open files
Jul 11 09:46:25 k8s-playground-worker kubelet[379]: E0711 09:46:25.290542     379 kubelet.go:1530] "Failed to start cAdvisor" err="inotify_init: too many open files"

@neolit123
Copy link
Member

Failed to start cAdvisor" err="inotify_init: too many open files"

maybe an ulimit problem:
#2744 (comment)

@terryjix
Copy link
Author

terryjix commented Jul 11, 2024

no ulimit issue if I only add one control-plane node to the cluster.
trying to find a way to update the sysctl configuration

@BenTheElder
Copy link
Member

@BenTheElder BenTheElder added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 11, 2024
@BenTheElder
Copy link
Member

BenTheElder commented Jul 11, 2024

This is a lot of nodes, do you need them? for what purpose?

most development should prefer single node clusters. each node consumes resources from the host and unlike a "real" cluster adding more nodes does not actually add more resources (only falsely), you are almost certainly hitting resource limits on the host (see the known-issues doc re: inotify above, though this may not be the only limit you're hitting)

@BenTheElder BenTheElder added the triage/needs-information Indicates an issue needs more information in order to work on it. label Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

3 participants