You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If we get into the situation where an existing cluster crashes before a new leader could have been elected, and maybe the leader volume got corrupted and cannot be recovered, a cold start after everything works again could get stuck.
Let's say we have nodes 1, 2, 3 and node 1 is the current leader.
If all nodes crash at the exact same time and the volume of node 1 get corrupted so badly, that it cannot be recovered and it happened in the middle of a log replication, so not all nodes are on the exact same log id, then nodes 2 and 3 could get into a situation where they try to re-connect to node 1 (which is dead), because they are lagging behind in log id.
If a situation like this appears, we need a way to force another node to become the new leader, basically ignoring any log id they have not received yet, even though they know that the leader had a higher one just before the crash.
This situaion is super rare, but I have been able to produce it in manual testing, even though it needed a few tries to get into it, even on purpose. However, a solution for something like this should exist.
The text was updated successfully, but these errors were encountered:
If we get into the situation where an existing cluster crashes before a new leader could have been elected, and maybe the leader volume got corrupted and cannot be recovered, a cold start after everything works again could get stuck.
Let's say we have nodes 1, 2, 3 and node 1 is the current leader.
If all nodes crash at the exact same time and the volume of node 1 get corrupted so badly, that it cannot be recovered and it happened in the middle of a log replication, so not all nodes are on the exact same log id, then nodes 2 and 3 could get into a situation where they try to re-connect to node 1 (which is dead), because they are lagging behind in log id.
If a situation like this appears, we need a way to force another node to become the new leader, basically ignoring any log id they have not received yet, even though they know that the leader had a higher one just before the crash.
This situaion is super rare, but I have been able to produce it in manual testing, even though it needed a few tries to get into it, even on purpose. However, a solution for something like this should exist.
The text was updated successfully, but these errors were encountered: