Simplify the recovery when an existing node crashes during an expansion #646

adejanovski · 2024-04-30T08:00:57Z

There can be cases where a scale up operation would be blocked by another crashlooping pod.
In this case the new pod will have the "Starting" label, which prevents the pre-existing pod to come back up after a fix is applied (for example when it's rescheduled on a new worker).

One way we could solve this would be to allow pods to start right away if they host Cassandra nodes that have already bootstrapped in the past, and if they're not part of a replacement.
This way we could have faster startups overall while still protecting ourselves from concurrent bootstraps.

adejanovski · 2024-07-02T14:22:12Z

Currently:

1 node at a time will start -> One node overall, with round robin between racks

Conditions to allow fast lane startups:

Pod has the Ready-To-Start label

Conditions to disallow fast lane startups:

node is planned for a replacement
node has never fully bootstrapped

Things to verify:

If a DC is fully stopped, bringing it back up fully concurrently works without a hiccup

Technical aspects
How do we ensure a node was part of the ring before?

The pod name appears in the cassdc .status.nodeStatuses struct with a host ID.
cass-operator needs to ensure the entry is added only for nodes that have successfully completed bootstrap (their state is UN).
nodeStatuses has to be the perfect representation of the DC topology. Any node removal should reflect there after a scale in/down operation. This can be detected with the LEAVING/REMOVED states in the endpoint state.

adejanovski added this to K8ssandra Apr 30, 2024

adejanovski changed the title ~~Ensure we can replace a crashlooping pod while a scale up is happening~~ Simplify the recovery when an existing node crashes during an expansion May 3, 2024

adejanovski moved this to Assess/Investigate in K8ssandra May 3, 2024

adejanovski added the assess Issues in the state 'assess' label May 3, 2024

adejanovski added the discuss label Jul 2, 2024

burmanm mentioned this issue Jul 8, 2024

Allow parallel restart of all already bootstrapped nodes #673

Merged

5 tasks

adejanovski moved this from Assess/Investigate to Ready For Review in K8ssandra Jul 9, 2024

adejanovski added ready-for-review Issues in the state 'ready-for-review' and removed assess Issues in the state 'assess' labels Jul 9, 2024

adejanovski moved this from Ready For Review to Review in K8ssandra Jul 9, 2024

adejanovski added review Issues in the state 'review' and removed ready-for-review Issues in the state 'ready-for-review' discuss labels Jul 9, 2024

burmanm closed this as completed in #673 Jul 10, 2024

github-project-automation bot moved this from Review to Done in K8ssandra Jul 10, 2024

adejanovski added done Issues in the state 'done' and removed review Issues in the state 'review' labels Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify the recovery when an existing node crashes during an expansion #646

Simplify the recovery when an existing node crashes during an expansion #646

adejanovski commented Apr 30, 2024 •

edited

Loading

adejanovski commented Jul 2, 2024

Simplify the recovery when an existing node crashes during an expansion #646

Simplify the recovery when an existing node crashes during an expansion #646

Comments

adejanovski commented Apr 30, 2024 • edited Loading

adejanovski commented Jul 2, 2024

adejanovski commented Apr 30, 2024 •

edited

Loading