Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify the recovery when an existing node crashes during an expansion #646

Closed
adejanovski opened this issue Apr 30, 2024 · 1 comment · Fixed by #673
Closed

Simplify the recovery when an existing node crashes during an expansion #646

adejanovski opened this issue Apr 30, 2024 · 1 comment · Fixed by #673
Labels
done Issues in the state 'done'

Comments

@adejanovski
Copy link
Contributor

adejanovski commented Apr 30, 2024

There can be cases where a scale up operation would be blocked by another crashlooping pod.
In this case the new pod will have the "Starting" label, which prevents the pre-existing pod to come back up after a fix is applied (for example when it's rescheduled on a new worker).

One way we could solve this would be to allow pods to start right away if they host Cassandra nodes that have already bootstrapped in the past, and if they're not part of a replacement.
This way we could have faster startups overall while still protecting ourselves from concurrent bootstraps.

@adejanovski adejanovski changed the title Ensure we can replace a crashlooping pod while a scale up is happening Simplify the recovery when an existing node crashes during an expansion May 3, 2024
@adejanovski adejanovski moved this to Assess/Investigate in K8ssandra May 3, 2024
@adejanovski adejanovski added the assess Issues in the state 'assess' label May 3, 2024
@adejanovski
Copy link
Contributor Author

Currently:

  • 1 node at a time will start -> One node overall, with round robin between racks

Conditions to allow fast lane startups:

  • Pod has the Ready-To-Start label

Conditions to disallow fast lane startups:

  • node is planned for a replacement
  • node has never fully bootstrapped

Things to verify:

  • If a DC is fully stopped, bringing it back up fully concurrently works without a hiccup

Technical aspects
How do we ensure a node was part of the ring before?

The pod name appears in the cassdc .status.nodeStatuses struct with a host ID.
cass-operator needs to ensure the entry is added only for nodes that have successfully completed bootstrap (their state is UN).
nodeStatuses has to be the perfect representation of the DC topology. Any node removal should reflect there after a scale in/down operation. This can be detected with the LEAVING/REMOVED states in the endpoint state.

@adejanovski adejanovski moved this from Assess/Investigate to Ready For Review in K8ssandra Jul 9, 2024
@adejanovski adejanovski added ready-for-review Issues in the state 'ready-for-review' and removed assess Issues in the state 'assess' labels Jul 9, 2024
@adejanovski adejanovski moved this from Ready For Review to Review in K8ssandra Jul 9, 2024
@adejanovski adejanovski added review Issues in the state 'review' and removed ready-for-review Issues in the state 'ready-for-review' discuss labels Jul 9, 2024
@github-project-automation github-project-automation bot moved this from Review to Done in K8ssandra Jul 10, 2024
@adejanovski adejanovski added done Issues in the state 'done' and removed review Issues in the state 'review' labels Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
done Issues in the state 'done'
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant