Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node fails to rejoin the cluster after network reconnect when using etcd for service discovery #1133

Open
pipiaha opened this issue Dec 23, 2024 · 0 comments

Comments

@pipiaha
Copy link

pipiaha commented Dec 23, 2024

Description:

I encountered an issue when using etcd as the service discovery mechanism for a ProtoActor cluster. When a node loses connection to etcd (for example, due to network fluctuation or during breakpoint debugging), and the network is restored, A gocoroutine to call startKeepAlive to re-register the lease. While the lease is successfully renewed, the current node does not seem to be re-added to the members list in etcd.Provider. This causes the ActorSystem to remain active but the node is no longer part of the cluster.

Steps to Reproduce:

  1. Start a node and use etcd for cluster service discovery. by default, keepAliveTTL=3s and retryInterval=1s.
provider, _ = etcd.NewWithConfig(b.Config.ClusterBaseKey, clientv3.Config{
			Endpoints: []string{"example.etcd.addr:2379"},
			Username:  "foo",
			Password:  "bar",
			//DialKeepAliveTime:    10 * time.Second,
			//DialKeepAliveTimeout: 10 * time.Second,
		})
  1. Disconnect the node from etcd due to network fluctuations or debugging.
  2. After network recovery, use the scheduled coroutine to call startKeepAlive and renew the lease.
  3. Notice that the members list in etcd.Provider does not include the current node, and as a result, the node's ActorSystem does not rejoin the cluster.

Expected Behavior:

After network recovery, the node should successfully re-register itself via startKeepAlive and be added back to the etcd.Provider members list. The node's ActorSystem should then rejoin the cluster and function normally.

Current Behavior:

The node fails to rejoin the cluster. Even though the lease is renewed, the members list in etcd.Provider is not updated to include the node, which causes the node's ActorSystem to no longer participate in the cluster.

Environment:

  • ProtoActor-Go version: v0.0.0-20240822202345-3c0e61ca19c9
  • etcd version: v3
  • Go version: go 1.22.7

Additional Information:

  • Will it work if keepAliveTTL configuration can be customized?

I would appreciate assistance on how to ensure the node can properly rejoin the cluster after network recovery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant