Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block the execution of cluster tasks on non existing nodes #55

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cpetzka
Copy link
Contributor

@cpetzka cpetzka commented Nov 3, 2020

Also catch the IllegalArgumentException which occurs if the node is not part of the cluster. This can happen even if the isClusterMember check before is true.

@cpetzka cpetzka force-pushed the block_execution_of_pakets_on_non_existing_nodes branch from e6c47be to 399864a Compare November 3, 2020 20:58
Also catch all exceptions (not only IllegalStateException) which occur if the node is not
part of the cluster
@cpetzka cpetzka force-pushed the block_execution_of_pakets_on_non_existing_nodes branch from 399864a to 2b21a20 Compare November 18, 2020 23:51
Copy link
Member

@guusdk guusdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from one small suggestion: do we want this? With this change, code that invokes this has no way of knowing that something went wrong. Is the correct way of going about ignoring the problem here?

In what scenarios does this problem occur? If it is caused by a bug in Openfire, then I do not like hiding this problem, by simply gobbling it up.

} catch (IllegalStateException e) {
logger.warn("Error while routing packet to remote node: " + e);
} catch (Exception e) {
logger.warn("Error while routing packet to remote node",e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly do we reach a state where things are routed to non-existing nodes? If this hints at a bug in the Openfire implementation, then I think it warrants logging to 'error' rather than 'warn'.

Also, this is more nitpicking, but it would be useful to log the identifier of the non-existing node that this is being routed to.

Suggested change
logger.warn("Error while routing packet to remote node",e);
logger.error("Error while routing packet to remote node: {}", nodeID == null ? "(null)" : new String(nodeID, java.nio.charset.StandardCharsets.UTF_8), e);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @guusdk,

the node id is already part of the exception message:

2020.11.03 17:07:45.965 ERROR [hz.openfire.cached.thread-4] (org.jivesoftware.openfire.spi.RoutingTableImpl:279) - Primary packet routing failed java.lang.IllegalArgumentException: Requested node 5ed5407e-e7fb-4b87-8ef0-9aff16dc07a0 not found in cluster at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask(ClusteredCacheFactory.java:397) ~[hazelcast-2.5.1-SNAPSHOT.jar!/:?] at org.jivesoftware.util.cache.CacheFactory.doClusterTask(CacheFactory.java:701) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I see it would at least be logged on all other exceptions as well.

@guusdk
Copy link
Member

guusdk commented Jan 5, 2021

Also, we should have a changelog item (possibly linking to an issue or this PR that contains a rationale for the change) if we do choose to merge this.

@GregDThomas
Copy link
Contributor

I too am fairly unsure about this. Which is partly why I've left it alone apart from a few early comment (that and a whole load of other pain, work and non-work on my plate right now).

@GregDThomas
Copy link
Contributor

FWIW, I think there is definitely a case for supressing the exception that occurs in the following sequence of events;

  1. Client code sends a ClusterTask to all other nodes in the cluster (A, B) - CacheFactory#doSynchronousClusterTask(ClusterTask<T> task, boolean includeLocalMember) or CacheFactory#doClusterTask(final ClusterTask<?> task)
  2. Openfire sends ClusterTask to node A
  3. Node B leaves the cluster
  4. Openfire sends ClusterTask to node B
  5. Hazelcast throws exception that node B is not in the cluster.

@cpetzka
Copy link
Contributor Author

cpetzka commented Jan 10, 2021

That's the exception, I forgot to link the corresponding post from the forum. https://discourse.igniterealtime.org/t/null-pointer-exception-joining-a-room/88995/7?u=chp and https://discourse.igniterealtime.org/t/null-pointer-exception-joining-a-room/88995/10?u=chp

2020.11.03 17:07:45.965 ERROR [hz.openfire.cached.thread-4] (org.jivesoftware.openfire.spi.RoutingTableImpl:279) - Primary packet routing failed
java.lang.IllegalArgumentException: Requested node 5ed5407e-e7fb-4b87-8ef0-9aff16dc07a0 not found in cluster
	at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask(ClusteredCacheFactory.java:397) ~[hazelcast-2.5.1-SNAPSHOT.jar!/:?]
	at org.jivesoftware.util.cache.CacheFactory.doClusterTask(CacheFactory.java:701) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.plugin.util.cluster.ClusterPacketRouter.routePacket(ClusterPacketRouter.java:45) ~[hazelcast-2.5.1-SNAPSHOT.jar!/:?]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableImpl.java:359) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:262) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.broadcast(LocalMUCRoom.java:1585) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest$1.run(BroadcastPresenceRequest.java:87) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.cluster.MUCRoomTask.execute(MUCRoomTask.java:81) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest.run(BroadcastPresenceRequest.java:82) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory$CallableTask.call(ClusteredCacheFactory.java:591) [hazelcast-2.5.1-SNAPSHOT.jar!/:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_265]
	at com.hazelcast.executor.impl.DistributedExecutorService$CallableProcessor.run(DistributedExecutorService.java:270) [hazelcast-3.12.5.jar!/:?]
	at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:227) [hazelcast-3.12.5.jar!/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]
	at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64) [hazelcast-3.12.5.jar!/:?]
	at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80) [hazelcast-3.12.5.jar!/:?]
2020.11.03 17:07:45.990 ERROR [hz.openfire.cached.thread-14] (org.jivesoftware.openfire.spi.RoutingTableImpl:279) - Primary packet routing failed
java.lang.IllegalArgumentException: Requested node 5ed5407e-e7fb-4b87-8ef0-9aff16dc07a0 not found in cluster
	at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask(ClusteredCacheFactory.java:397) ~[hazelcast-2.5.1-SNAPSHOT.jar!/:?]
	at org.jivesoftware.util.cache.CacheFactory.doClusterTask(CacheFactory.java:701) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.plugin.util.cluster.ClusterPacketRouter.routePacket(ClusterPacketRouter.java:45) ~[hazelcast-2.5.1-SNAPSHOT.jar!/:?]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableImpl.java:359) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:262) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.broadcast(LocalMUCRoom.java:1585) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest$1.run(BroadcastPresenceRequest.java:87) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.cluster.MUCRoomTask.execute(MUCRoomTask.java:81) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest.run(BroadcastPresenceRequest.java:82) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory$CallableTask.call(ClusteredCacheFactory.java:591) [hazelcast-2.5.1-SNAPSHOT.jar!/:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_265]
	at com.hazelcast.executor.impl.DistributedExecutorService$CallableProcessor.run(DistributedExecutorService.java:270) [hazelcast-3.12.5.jar!/:?]
	at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:227) [hazelcast-3.12.5.jar!/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]
	at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64) [hazelcast-3.12.5.jar!/:?]
	at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80) [hazelcast-3.12.5.jar!/:?]

As you can see, the exception is thrown several times within a few milliseconds.

I looked at it again and found that catching the IllegalArgumentException (and maybe the IllgegalStateException as before) is probably sufficient. In https://github.com/igniterealtime/openfire-hazelcast-plugin/blob/master/src/java/org/jivesoftware/openfire/plugin/util/cache/ClusteredCacheFactory.java#L384, the check whether the node is in the cluster or not is already done and just this IllegalArgumentException is thrown. I.e. the check whether the node is still in the cluster is redundant if the corresponding exception is caught.

igniterealtime/Openfire#1749 is to fix one trigger of the problem. Each cluster node has a local or remote user in its maps for all MUC members. If a node goes down, an OccupantLeftEvent is sent for each member (AND ON EACH NODE) that was online on the node (one after the other). I.e. the presence that the user is offline is also sent to ALL remote users, also to the ones that were online on the offline node and for which no OccupantLeftEvent was sent yet. The exception is thrown if the cache of the routing table has not yet been updated and the route to the user still exists there.

All exceptions are caught in RotuingTableImpl and only logged, so that this special exception can be suppressed here without any problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants