cluster fails to start if mutiple control plane nodes are added. #3680

terryjix · 2024-07-11T09:01:03Z

What happened:
the cluster is failed to create if I add mutiple control plane nodes to the cluster

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker
- role: worker
- role: worker

Error logs

{"level":"warn","ts":"2024-07-11T08:50:49.476205Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00062ee00/172.18.0.5:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
I0711 08:50:49.476269     249 etcd.go:550] [etcd] Promoting the learner 86e5aab36dbb6fb7 failed: etcdserver: can only promote a learner member which is in sync with leader
etcdserver: can only promote a learner member which is in sync with leader
error creating local etcd static pod manifest file
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runEtcdPhase
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/controlplanejoin.go:156
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:259
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:446
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:232
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdJoin.func1
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/join.go:183
github.com/spf13/cobra.(*Command).execute
        github.com/spf13/[email protected]/command.go:940
github.com/spf13/cobra.(*Command).ExecuteC
        github.com/spf13/[email protected]/command.go:1068
github.com/spf13/cobra.(*Command).Execute
        github.com/spf13/[email protected]/command.go:992
k8s.io/kubernetes/cmd/kubeadm/app.Run
        k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:52
main.main
        k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
        runtime/proc.go:271
runtime.goexit
        runtime/asm_amd64.s:1695
error execution phase control-plane-join/etcd
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:260
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:446
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:232
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdJoin.func1
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/join.go:183
github.com/spf13/cobra.(*Command).execute
        github.com/spf13/[email protected]/command.go:940
github.com/spf13/cobra.(*Command).ExecuteC
        github.com/spf13/[email protected]/command.go:1068
github.com/spf13/cobra.(*Command).Execute
        github.com/spf13/[email protected]/command.go:992
k8s.io/kubernetes/cmd/kubeadm/app.Run
        k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:52
main.main
        k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
        runtime/proc.go:271
runtime.goexit
        runtime/asm_amd64.s:169

What you expected to happen:
kind supports to create a kubernetes cluster with mutiple control plane nodes.

How to reproduce it (as minimally and precisely as possible):
use following configuration to launch a cluster

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker
- role: worker
- role: worker

Anything else we need to know?:

Environment:

kind version:
kind version 0.23.0
Runtime info: (use docker info, podman info or nerdctl info):
Client:
Version: 25.0.3
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.0.0+unknown
Path: /usr/libexec/docker/cli-plugins/docker-buildx

Server:
Containers: 21
Running: 0
Paused: 0
Stopped: 21
Images: 78
Server Version: 25.0.3
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 64b8a811b07ba6288238eefc14d898ee0b5b99ba
runc version: 4bccb38cc9cf198d52bebf2b3a90cd14e7af8c06
init version: de40ad0
Security Options:
seccomp
Profile: builtin
cgroupns
Kernel Version: 6.1.94-99.176.amzn2023.x86_64
Operating System: Amazon Linux 2023.5.20240701
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.629GiB
Name: ip-172-31-18-230.eu-west-1.compute.internal
ID: c3b0373c-7367-45d1-8e7b-12a0ff695616
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
binglj.people.aws.dev:443
127.0.0.0/8
Live Restore Enabled: false

OS (e.g. from /etc/os-release):
Amazon Linux 2023
Kubernetes version: (use kubectl version):
1.30.0
Any proxies or other special environment settings?:

The text was updated successfully, but these errors were encountered:

neolit123 · 2024-07-11T10:56:16Z

I0711 08:50:49.476269 249 etcd.go:550] [etcd] Promoting the learner 86e5aab36dbb6fb7 failed: etcdserver: can only promote a learner member which is in sync with leader
etcdserver: can only promote a learner member which is in sync with leader
error creating local etcd static pod manifest file

@pacoxu didn't we wait for sync to happen before promote?

terryjix · 2024-07-11T10:59:01Z

I added following arguments to the configuration file

  kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    featureGates:
      EtcdLearnerMode: false

the kubelet fails to start with another error message

Jul 11 09:46:25 k8s-playground-worker kubelet[379]: I0711 09:46:25.281034     379 factory.go:221] Registration of the systemd container factory successfully
Jul 11 09:46:25 k8s-playground-worker kubelet[379]: I0711 09:46:25.281297     379 factory.go:219] Registration of the crio container factory failed: Get "http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info": dial unix /var/run/crio/crio.sock: connect: no such file or directory
Jul 11 09:46:25 k8s-playground-worker kubelet[379]: I0711 09:46:25.290080     379 factory.go:221] Registration of the containerd container factory successfully
Jul 11 09:46:25 k8s-playground-worker kubelet[379]: E0711 09:46:25.290422     379 manager.go:294] Registration of the raw container factory failed: inotify_init: too many open files
Jul 11 09:46:25 k8s-playground-worker kubelet[379]: E0711 09:46:25.290542     379 kubelet.go:1530] "Failed to start cAdvisor" err="inotify_init: too many open files"

neolit123 · 2024-07-11T11:02:42Z

Failed to start cAdvisor" err="inotify_init: too many open files"

maybe an ulimit problem:
#2744 (comment)

terryjix · 2024-07-11T11:39:37Z

no ulimit issue if I only add one control-plane node to the cluster.
trying to find a way to update the sysctl configuration

BenTheElder · 2024-07-11T16:42:09Z

inotify: https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files

BenTheElder · 2024-07-11T16:43:01Z

This is a lot of nodes, do you need them? for what purpose?

most development should prefer single node clusters. each node consumes resources from the host and unlike a "real" cluster adding more nodes does not actually add more resources (only falsely), you are almost certainly hitting resource limits on the host (see the known-issues doc re: inotify above, though this may not be the only limit you're hitting)

terryjix added the kind/bug Categorizes issue or PR as related to a bug. label Jul 11, 2024

BenTheElder added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 11, 2024

BenTheElder added the triage/needs-information Indicates an issue needs more information in order to work on it. label Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster fails to start if mutiple control plane nodes are added. #3680

cluster fails to start if mutiple control plane nodes are added. #3680

terryjix commented Jul 11, 2024

neolit123 commented Jul 11, 2024

terryjix commented Jul 11, 2024

neolit123 commented Jul 11, 2024

terryjix commented Jul 11, 2024 •

edited

Loading

BenTheElder commented Jul 11, 2024

BenTheElder commented Jul 11, 2024 •

edited

Loading

cluster fails to start if mutiple control plane nodes are added. #3680

cluster fails to start if mutiple control plane nodes are added. #3680

Comments

terryjix commented Jul 11, 2024

neolit123 commented Jul 11, 2024

terryjix commented Jul 11, 2024

neolit123 commented Jul 11, 2024

terryjix commented Jul 11, 2024 • edited Loading

BenTheElder commented Jul 11, 2024

BenTheElder commented Jul 11, 2024 • edited Loading

terryjix commented Jul 11, 2024 •

edited

Loading

BenTheElder commented Jul 11, 2024 •

edited

Loading