tigera-operator blocks deployment if one node cannot run calico-node #9540

christian-schlichtherle · 2024-11-28T12:36:55Z

I was following the instructions for multi-node K3s from your documentation and installed the tigera-operator resource manifests for version 3.29.1 and then deployed my custom resource manifests of kind=Installation and kind=APIServer.

In my test setup, there is an edge node which fails the readiness probe for it's calico-node container because it can't use ipset for some reason:

2024-11-28 12:21:04.487 [ERROR][147075] felix/ipsets.go 671: Bad return code from 'ipset list -name'. error=exit status 1 family="inet" stderr="ipset v7.11: Kernel error received: Invalid argument\n"                                                                  │
2024-11-28 12:21:04.487 [ERROR][147075] felix/ipsets.go 409: Failed to get the list of ipsets error=exit status 1 family="inet"

In turn, this triggers the tigera-operator being blocked with the deployment because it waits for ALL daemonset pods to report ready state. In result, calico is not rolled out and it looks like the CNI on the cluster is completely broken.

Expected Behavior

From my point of view this is a bug: In a distributed system, there's always a chance that a node is not working for some reason. In our case, the node in question is an IoT device where we have only limited control over the Linux Kernel (because it's part of a Linux distribution provided by the hardware vendor). The tigera-operator should tolerate the outage of nodes to some extent and just continue with the deployment.

In my case, I had to delete the edge node from the cluster to see the tigera-operator advancing with the rollout for the cloud nodes and then add the edge node back later.

Your Environment

Calico version: 3.29.1
Calico dataplane: iptables

$ kubectl get nodes -o wide
[...]
cloud-node-123  Ready    <none>  9d    v1.31.2+k3s1   172.18.0.16   1.2.3.4  Debian GNU/Linux 12 (bookworm)   6.1.0-28-arm64   containerd://1.7.22-k3s1
edge-node-123   Ready    <none>  89m   v1.31.2+k3s1   100.65.0.13   <none>   Ubuntu 24.04.1 LTS               5.15.137         containerd://1.7.22-k3s1

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tigera-operator blocks deployment if one node cannot run calico-node #9540

tigera-operator blocks deployment if one node cannot run calico-node #9540

christian-schlichtherle commented Nov 28, 2024

tigera-operator blocks deployment if one node cannot run calico-node #9540

tigera-operator blocks deployment if one node cannot run calico-node #9540

Comments

christian-schlichtherle commented Nov 28, 2024

Expected Behavior

Your Environment