Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix kubeadm race #4

Open
discordianfish opened this issue Jan 15, 2018 · 1 comment
Open

Fix kubeadm race #4

discordianfish opened this issue Jan 15, 2018 · 1 comment

Comments

@discordianfish
Copy link
Member

Sometimes kubeadm fails, probably when it comes up before etcd reached quorum and fails (but can be restarted).

@discordianfish
Copy link
Member Author

We have kubeadm.service run After=etcd-member.service, which makes it start after etcd gets started the first time. etcd might fail though for various reasons (e.g SRV record not updated yet) on the first 1-2 starts which leads kubeadm to fail. Since it's Type=oneshot it can't be restarted by systemd.

Here is the log from a rolling upgrade showing the problem:

Jan 16 11:49:11 ip-172-20-181-150 etcd-wrapper[822]: 2018-01-16 11:49:11.362023 I | embed: listening for client requests on 0.0.0.0:2379
Jan 16 11:49:11 ip-172-20-181-150 etcd-wrapper[822]: 2018-01-16 11:49:11.388040 N | discovery: got bootstrap from DNS for etcd-server-ssl at https://ip-172-20-180-32.ec2.internal:2380
Jan 16 11:49:11 ip-172-20-181-150 etcd-wrapper[822]: 2018-01-16 11:49:11.388746 N | discovery: got bootstrap from DNS for etcd-server-ssl at https://ip-172-20-182-220.ec2.internal:2380
Jan 16 11:49:11 ip-172-20-181-150 etcd-wrapper[822]: 2018-01-16 11:49:11.411176 N | discovery: got bootstrap from DNS for etcd-server-ssl at https://ip-172-20-185-127.ec2.internal:2380
Jan 16 11:49:11 ip-172-20-181-150 etcd-wrapper[822]: 2018-01-16 11:49:11.442449 C | etcdmain: error setting up initial cluster: cannot find local etcd member "ip-172-20-181-150.ec2.internal" in SRV records
Jan 16 11:49:11 ip-172-20-181-150 systemd[1]: etcd-member.service: Main process exited, code=exited, status=1/FAILURE
Jan 16 11:49:11 ip-172-20-181-150 systemd[1]: Failed to start etcd (System Application Container).
Jan 16 11:49:11 ip-172-20-181-150 systemd[1]: etcd-member.service: Unit entered failed state.
Jan 16 11:49:11 ip-172-20-181-150 systemd[1]: etcd-member.service: Failed with result 'exit-code'.
Jan 16 11:49:11 ip-172-20-181-150 systemd[1]: Starting Kubeadm init...
Jan 16 11:49:11 ip-172-20-181-150 kubeadm[858]: [kubeadm] WARNING: kubeadm is currently in beta
Jan 16 11:49:11 ip-172-20-181-150 kubeadm[858]: [init] Using Kubernetes version: v1.8.4
Jan 16 11:49:11 ip-172-20-181-150 kubeadm[858]: [init] Using Authorization modes: [Node RBAC]
Jan 16 11:49:11 ip-172-20-181-150 kubeadm[858]: [init] WARNING: For cloudprovider integrations to work --cloud-provider must be set for all kubelets in the cluster.
Jan 16 11:49:11 ip-172-20-181-150 kubeadm[858]:         (/etc/systemd/system/kubelet.service.d/10-kubeadm.conf should be edited for this purpose)
Jan 16 11:49:11 ip-172-20-181-150 kubeadm[858]: [preflight] Running pre-flight checks.
Jan 16 11:49:15 ip-172-20-181-150 kubeadm[858]:         [WARNING KubeletVersion]: couldn't get kubelet version: exec: "kubelet": executable file not found in $PATH
Jan 16 11:49:15 ip-172-20-181-150 kubeadm[858]:         [WARNING Service-Kubelet]: kubelet service is not enabled, please run 'systemctl enable kubelet.service'
Jan 16 11:49:15 ip-172-20-181-150 kubeadm[858]:         [WARNING FileExisting-socat]: socat not found in system path
Jan 16 11:49:15 ip-172-20-181-150 kubeadm[858]:         [WARNING FileExisting-crictl]: crictl not found in system path
Jan 16 11:49:15 ip-172-20-181-150 kubeadm[858]:         [WARNING Service-Docker]: docker service is not enabled, please run 'systemctl enable docker.service'
Jan 16 11:49:22 ip-172-20-181-150 systemd[1]: etcd-member.service: Service hold-off time over, scheduling restart.
Jan 16 11:49:22 ip-172-20-181-150 systemd[1]: Stopped etcd (System Application Container).
Jan 16 11:49:22 ip-172-20-181-150 systemd[1]: Starting etcd (System Application Container)...
Jan 16 11:49:22 ip-172-20-181-150 rkt[974]: rm: unable to resolve UUID from file: open /var/lib/coreos/etcd-member-wrapper.uuid: no such file or directory
Jan 16 11:49:22 ip-172-20-181-150 rkt[974]: rm: failed to remove one or more pods
Jan 16 11:49:22 ip-172-20-181-150 etcd-member-add[984]: Adding ourself to cluster
Jan 16 11:49:24 ip-172-20-181-150 etcd-member-add[984]: client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member https://ip-172-20-185-127.ec2.internal.:2379 has no leader
Jan 16 11:49:24 ip-172-20-181-150 etcd-member-add[984]: ; error #1: client: etcd member https://ip-172-20-182-220.ec2.internal.:2379 has no leader
Jan 16 11:49:24 ip-172-20-181-150 etcd-member-add[984]: ; error #2: client: endpoint https://ip-172-20-180-32.ec2.internal.:2379 exceeded header timeout
Jan 16 11:49:24 ip-172-20-181-150 etcd-wrapper[992]: ++ id -u etcd
Jan 16 11:49:24 ip-172-20-181-150 etcd-wrapper[992]: + exec /usr/bin/rkt run --volume etcd-ssl,kind=host,source=/etc/ssl/etcd --mount volume=etcd-ssl,target=/etc/ssl/etcd --trust-keys-from-https --mount volume=coreos-systemd-dir,target=/run/systemd/system --volume
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.520919 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.521409 I | pkg/flags: recognized and used environment variable ETCD_DISCOVERY_SRV=int2.example.com
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.521669 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.521911 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=int2.example.com
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.522186 I | pkg/flags: recognized environment variable ETCD_NAME, but unused: shadowed by corresponding flag
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.522432 W | pkg/flags: unrecognized environment variable ETCD_USER=etcd
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.522666 W | pkg/flags: unrecognized environment variable ETCD_IMAGE_TAG=v3.1.10
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.522916 I | etcdmain: etcd Version: 3.1.10
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.523144 I | etcdmain: Git SHA: 0520cb9
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.523368 I | etcdmain: Go Version: go1.8.3
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.523618 I | etcdmain: Go OS/Arch: linux/amd64
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.523841 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.524106 I | embed: peerTLS: cert = /etc/ssl/etcd/peer.crt, key = /etc/ssl/etcd/peer.key, ca = , trusted-ca = /etc/ssl/etcd/ca.crt, client-cert-auth = true
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.525221 I | embed: listening for peers on https://172.20.181.150:2380
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.525527 I | embed: listening for client requests on 0.0.0.0:2379
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.552553 N | discovery: got bootstrap from DNS for etcd-server-ssl at https://ip-172-20-185-127.ec2.internal:2380
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.554191 N | discovery: got bootstrap from DNS for etcd-server-ssl at https://ip-172-20-180-32.ec2.internal:2380
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.555779 N | discovery: got bootstrap from DNS for etcd-server-ssl at https://ip-172-20-182-220.ec2.internal:2380
Jan 16 11:49:25 ip-172-20-181-150 etcd-wrapper[992]: 2018-01-16 11:49:25.557213 C | etcdmain: error setting up initial cluster: cannot find local etcd member "ip-172-20-181-150.ec2.internal" in SRV records
Jan 16 11:49:25 ip-172-20-181-150 systemd[1]: etcd-member.service: Main process exited, code=exited, status=1/FAILURE
Jan 16 11:49:25 ip-172-20-181-150 systemd[1]: Failed to start etcd (System Application Container).
Jan 16 11:49:25 ip-172-20-181-150 systemd[1]: etcd-member.service: Unit entered failed state.
Jan 16 11:49:25 ip-172-20-181-150 systemd[1]: etcd-member.service: Failed with result 'exit-code'.
Jan 16 11:49:30 ip-172-20-181-150 kubeadm[858]: [preflight] Some fatal errors occurred:
Jan 16 11:49:30 ip-172-20-181-150 kubeadm[858]:         [ERROR ExternalEtcdVersion]: couldn't parse external etcd version "": Version string empty

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant