-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block the execution of cluster tasks on non existing nodes #55
base: main
Are you sure you want to change the base?
Block the execution of cluster tasks on non existing nodes #55
Conversation
e6c47be
to
399864a
Compare
src/java/org/jivesoftware/openfire/plugin/util/cluster/ClusterPacketRouter.java
Outdated
Show resolved
Hide resolved
src/java/org/jivesoftware/openfire/plugin/util/cluster/ClusterPacketRouter.java
Outdated
Show resolved
Hide resolved
src/java/org/jivesoftware/openfire/plugin/util/cluster/ClusterPacketRouter.java
Outdated
Show resolved
Hide resolved
Also catch all exceptions (not only IllegalStateException) which occur if the node is not part of the cluster
399864a
to
2b21a20
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from one small suggestion: do we want this? With this change, code that invokes this has no way of knowing that something went wrong. Is the correct way of going about ignoring the problem here?
In what scenarios does this problem occur? If it is caused by a bug in Openfire, then I do not like hiding this problem, by simply gobbling it up.
} catch (IllegalStateException e) { | ||
logger.warn("Error while routing packet to remote node: " + e); | ||
} catch (Exception e) { | ||
logger.warn("Error while routing packet to remote node",e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How exactly do we reach a state where things are routed to non-existing nodes? If this hints at a bug in the Openfire implementation, then I think it warrants logging to 'error' rather than 'warn'.
Also, this is more nitpicking, but it would be useful to log the identifier of the non-existing node that this is being routed to.
logger.warn("Error while routing packet to remote node",e); | |
logger.error("Error while routing packet to remote node: {}", nodeID == null ? "(null)" : new String(nodeID, java.nio.charset.StandardCharsets.UTF_8), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @guusdk,
the node id is already part of the exception message:
2020.11.03 17:07:45.965 ERROR [hz.openfire.cached.thread-4] (org.jivesoftware.openfire.spi.RoutingTableImpl:279) - Primary packet routing failed java.lang.IllegalArgumentException: Requested node 5ed5407e-e7fb-4b87-8ef0-9aff16dc07a0 not found in cluster at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask(ClusteredCacheFactory.java:397) ~[hazelcast-2.5.1-SNAPSHOT.jar!/:?] at org.jivesoftware.util.cache.CacheFactory.doClusterTask(CacheFactory.java:701) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I see it would at least be logged on all other exceptions as well.
Also, we should have a changelog item (possibly linking to an issue or this PR that contains a rationale for the change) if we do choose to merge this. |
I too am fairly unsure about this. Which is partly why I've left it alone apart from a few early comment (that and a whole load of other pain, work and non-work on my plate right now). |
FWIW, I think there is definitely a case for supressing the exception that occurs in the following sequence of events;
|
That's the exception, I forgot to link the corresponding post from the forum. https://discourse.igniterealtime.org/t/null-pointer-exception-joining-a-room/88995/7?u=chp and https://discourse.igniterealtime.org/t/null-pointer-exception-joining-a-room/88995/10?u=chp
As you can see, the exception is thrown several times within a few milliseconds. I looked at it again and found that catching the IllegalArgumentException (and maybe the IllgegalStateException as before) is probably sufficient. In https://github.com/igniterealtime/openfire-hazelcast-plugin/blob/master/src/java/org/jivesoftware/openfire/plugin/util/cache/ClusteredCacheFactory.java#L384, the check whether the node is in the cluster or not is already done and just this IllegalArgumentException is thrown. I.e. the check whether the node is still in the cluster is redundant if the corresponding exception is caught. igniterealtime/Openfire#1749 is to fix one trigger of the problem. Each cluster node has a local or remote user in its maps for all MUC members. If a node goes down, an OccupantLeftEvent is sent for each member (AND ON EACH NODE) that was online on the node (one after the other). I.e. the presence that the user is offline is also sent to ALL remote users, also to the ones that were online on the offline node and for which no OccupantLeftEvent was sent yet. The exception is thrown if the cache of the routing table has not yet been updated and the route to the user still exists there. All exceptions are caught in RotuingTableImpl and only logged, so that this special exception can be suppressed here without any problems. |
Also catch the IllegalArgumentException which occurs if the node is not part of the cluster. This can happen even if the isClusterMember check before is true.