Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced split-brain protection #51

Open
guusdk opened this issue Sep 9, 2020 · 4 comments
Open

Enhanced split-brain protection #51

guusdk opened this issue Sep 9, 2020 · 4 comments

Comments

@guusdk
Copy link
Member

guusdk commented Sep 9, 2020

As @GregDThomas suggested:
A new enhancement, that would require you to have an odd number of cluster nodes. Basically, assuming three nodes, you have to have two communicating to get a cluster. If you only have one node, you're not clustered.

@GregDThomas: I'm assuming here that the aim of this is to have a resolution where a majority of servers dictates the resulting state?

@GregDThomas
Copy link
Contributor

GregDThomas commented Sep 9, 2020

From first principles (apologies for teaching readers to suck eggs):

In Openfire terms, a split-brain occurs when two (or more) nodes in a cluster both think they are the senior node. E.g. in a two node cluster, the network between the two nodes is lost, neither node can see that the other node is available, so both assume it is in the senior. ref https://en.wikipedia.org/wiki/Split-brain_(computing).

A typical solution to this problem is to introduce the concept of a quorum value. A quorum value would be (nodecount/2+1) - e.g. 2 nodes in a 3 node cluster, 3 nodes in a four node cluster, 3 nodes in a 5 node cluster. ref https://en.wikipedia.org/wiki/Quorum_(distributed_computing)

So a proposal to implement this woud be:

(Note the distinction between a Hazelcast cluster and an Openfire cluster - they may be in different states)

If a quorum value is configured, when a node starts, Openfire clustering remains "starting" until the node can see the quorum number of nodes in the Hazelcast cluster. These nodes would then agree on a senior member (currently, it's the oldest member of the cluster, I don't see a need to change that).

When a node leaves the Openfire cluster and the number of remaining nodes is less than the quorum value, the remaining node(s) would disable clustering and then immediately re-enable it. Clustering would then, as above, remain "starting" until the node can see the quroum number of nodes in the cluster.

Possible further enhancements;
While clustering is "starting" waiting for quorum, reject any new connections (XMPP, Bosh, server:server, etc.) to that node.
If clustering is disabled due to lack of quorum nodes, drop all existing connections.
This would further ensure that the isolated node does not carry out any actions when it is not part of the cluster.

@guusdk
Copy link
Member Author

guusdk commented Sep 15, 2020

This seems to trade consistency for availability. I can imagine that there are scenarios in which each of the other is preferred. We'd need to make sure that this behavior is highly configurable.

Unless I'm misunderstanding, the suggested approach would basically reduce or remove service from the entire service, when one cluster node fails. My gut feeling says that most deployments would favor to not lock/log off the entire domain in such a scenario, choosing availability over consistency.

@GregDThomas
Copy link
Contributor

Yes, it is a trade off. Typically you'd need an odd number of nodes, and just under half of them will fail before you lose the whole cluster.

But to make it explicit, I was only expecting the above behaviour if a quorum number was set. If no quorum was set, behaviour is as it is today.

@guusdk
Copy link
Member Author

guusdk commented Sep 15, 2020

A, right, I misunderstood that. My interpretation was that the entire cluster should grind to a halt when just one node disappears. That's not what you suggested: it's basically when the cluster falls under half-plus-one of the anticipated cluster size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants