Cyclops operator creates orphan nodes when it is not able to completely drain the node #75

skaushal-splunk · 2024-05-23T17:39:08Z

Describe the bug
A node which is not attached to any autoscaling group is created when the draining of node fails for some reason.

To Reproduce

Create a node which has pods in the pending state. This is one of the reasons why node draining fails. There are are other reasons why draining fails. We want to create a scenario where draining fails.
Create a CycleNodeRequest to create a new node and detaches the old node.
Create a CycleNodeStatus to drain the node.
Run both CycleNodeRequest and CycleNodeStatus.

Current behavior
CycleNodeRequest works first, detaches the old node and creates a new node. CycleNodeStatus is not able to drain the node because it sees some pods in the pending status.

As a result we have an old node which is not attached to autoscaling group but still has pods running on it. New node comes up but only has daemonsets running on it.

Expected behavior

The old node should be drained and the new node should have all the pods from the old node
There should not be any node which is not attached to AutoScaling group.

Kubernetes Cluster Version
v1.25

Cyclops Version
v1.7.0

The text was updated successfully, but these errors were encountered:

awprice · 2024-05-27T05:59:15Z

@skaushal-splunk Are you able to clarify what you mean by "Create a node which has pods in the pending state"?

As far as I'm aware, this is unlikely to be possible - a pod can't be scheduled onto a node and still be pending? Is there another factor causing the pods to be in this state that you aren't mentioning?

skaushal-splunk · 2024-06-13T19:16:38Z

I meant a pod which is not in Ready state

It could be a pod which is one of these states.

Completed
Error
CrashLoopBackOff
Running but not in Ready state. ie. the readiness probe fails

In all of these scenarios, cyclops operator fails to drain the node but removes the ec2 instance from autoscaling group.

In this state, the kubernetes node has pods running but the corresponding ec2 instance is not a part of any autoscaling group.

After this state the entire node rolling stops. We have to manually drain that node and then cyclops operator continues.

skaushal-splunk added the bug Something isn't working label May 23, 2024

awprice assigned vincentportella Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cyclops operator creates orphan nodes when it is not able to completely drain the node #75

Cyclops operator creates orphan nodes when it is not able to completely drain the node #75

skaushal-splunk commented May 23, 2024 •

edited

Loading

awprice commented May 27, 2024

skaushal-splunk commented Jun 13, 2024 •

edited

Loading

Cyclops operator creates orphan nodes when it is not able to completely drain the node #75

Cyclops operator creates orphan nodes when it is not able to completely drain the node #75

Comments

skaushal-splunk commented May 23, 2024 • edited Loading

awprice commented May 27, 2024

skaushal-splunk commented Jun 13, 2024 • edited Loading

skaushal-splunk commented May 23, 2024 •

edited

Loading

skaushal-splunk commented Jun 13, 2024 •

edited

Loading