Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cyclops operator creates orphan nodes when it is not able to completely drain the node #75

Open
skaushal-splunk opened this issue May 23, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@skaushal-splunk
Copy link

skaushal-splunk commented May 23, 2024

Describe the bug
A node which is not attached to any autoscaling group is created when the draining of node fails for some reason.

To Reproduce

  1. Create a node which has pods in the pending state. This is one of the reasons why node draining fails. There are are other reasons why draining fails. We want to create a scenario where draining fails.
  2. Create a CycleNodeRequest to create a new node and detaches the old node.
  3. Create a CycleNodeStatus to drain the node.
  4. Run both CycleNodeRequest and CycleNodeStatus.

Current behavior
CycleNodeRequest works first, detaches the old node and creates a new node. CycleNodeStatus is not able to drain the node because it sees some pods in the pending status.

As a result we have an old node which is not attached to autoscaling group but still has pods running on it. New node comes up but only has daemonsets running on it.

Expected behavior

  1. The old node should be drained and the new node should have all the pods from the old node
  2. There should not be any node which is not attached to AutoScaling group.

Kubernetes Cluster Version
v1.25

Cyclops Version
v1.7.0

@skaushal-splunk skaushal-splunk added the bug Something isn't working label May 23, 2024
@awprice
Copy link
Collaborator

awprice commented May 27, 2024

@skaushal-splunk Are you able to clarify what you mean by "Create a node which has pods in the pending state"?

As far as I'm aware, this is unlikely to be possible - a pod can't be scheduled onto a node and still be pending? Is there another factor causing the pods to be in this state that you aren't mentioning?

@skaushal-splunk
Copy link
Author

skaushal-splunk commented Jun 13, 2024

I meant a pod which is not in Ready state

It could be a pod which is one of these states.

  1. Completed
  2. Error
  3. CrashLoopBackOff
  4. Running but not in Ready state. ie. the readiness probe fails

In all of these scenarios, cyclops operator fails to drain the node but removes the ec2 instance from autoscaling group.

In this state, the kubernetes node has pods running but the corresponding ec2 instance is not a part of any autoscaling group.

After this state the entire node rolling stops. We have to manually drain that node and then cyclops operator continues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants