Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tigera-operator blocks deployment if one node cannot run calico-node #9540

Open
christian-schlichtherle opened this issue Nov 28, 2024 · 0 comments

Comments

@christian-schlichtherle

I was following the instructions for multi-node K3s from your documentation and installed the tigera-operator resource manifests for version 3.29.1 and then deployed my custom resource manifests of kind=Installation and kind=APIServer.

In my test setup, there is an edge node which fails the readiness probe for it's calico-node container because it can't use ipset for some reason:

2024-11-28 12:21:04.487 [ERROR][147075] felix/ipsets.go 671: Bad return code from 'ipset list -name'. error=exit status 1 family="inet" stderr="ipset v7.11: Kernel error received: Invalid argument\n"                                                                  │
2024-11-28 12:21:04.487 [ERROR][147075] felix/ipsets.go 409: Failed to get the list of ipsets error=exit status 1 family="inet"

In turn, this triggers the tigera-operator being blocked with the deployment because it waits for ALL daemonset pods to report ready state. In result, calico is not rolled out and it looks like the CNI on the cluster is completely broken.

Expected Behavior

From my point of view this is a bug: In a distributed system, there's always a chance that a node is not working for some reason. In our case, the node in question is an IoT device where we have only limited control over the Linux Kernel (because it's part of a Linux distribution provided by the hardware vendor). The tigera-operator should tolerate the outage of nodes to some extent and just continue with the deployment.

In my case, I had to delete the edge node from the cluster to see the tigera-operator advancing with the rollout for the cloud nodes and then add the edge node back later.

Your Environment

  • Calico version: 3.29.1
  • Calico dataplane: iptables
$ kubectl get nodes -o wide
[...]
cloud-node-123  Ready    <none>  9d    v1.31.2+k3s1   172.18.0.16   1.2.3.4  Debian GNU/Linux 12 (bookworm)   6.1.0-28-arm64   containerd://1.7.22-k3s1
edge-node-123   Ready    <none>  89m   v1.31.2+k3s1   100.65.0.13   <none>   Ubuntu 24.04.1 LTS               5.15.137         containerd://1.7.22-k3s1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant