Node fails to rejoin the cluster after network reconnect when using etcd for service discovery #1133

pipiaha · 2024-12-23T11:34:01Z

Description:

I encountered an issue when using etcd as the service discovery mechanism for a ProtoActor cluster. When a node loses connection to etcd (for example, due to network fluctuation or during breakpoint debugging), and the network is restored, A gocoroutine to call startKeepAlive to re-register the lease. While the lease is successfully renewed, the current node does not seem to be re-added to the members list in etcd.Provider. This causes the ActorSystem to remain active but the node is no longer part of the cluster.

Steps to Reproduce:

Start a node and use etcd for cluster service discovery. by default, keepAliveTTL=3s and retryInterval=1s.

provider, _ = etcd.NewWithConfig(b.Config.ClusterBaseKey, clientv3.Config{
			Endpoints: []string{"example.etcd.addr:2379"},
			Username:  "foo",
			Password:  "bar",
			//DialKeepAliveTime:    10 * time.Second,
			//DialKeepAliveTimeout: 10 * time.Second,
		})

Disconnect the node from etcd due to network fluctuations or debugging.
After network recovery, use the scheduled coroutine to call startKeepAlive and renew the lease.
Notice that the members list in etcd.Provider does not include the current node, and as a result, the node's ActorSystem does not rejoin the cluster.

Expected Behavior:

After network recovery, the node should successfully re-register itself via startKeepAlive and be added back to the etcd.Provider members list. The node's ActorSystem should then rejoin the cluster and function normally.

Current Behavior:

The node fails to rejoin the cluster. Even though the lease is renewed, the members list in etcd.Provider is not updated to include the node, which causes the node's ActorSystem to no longer participate in the cluster.

Environment:

ProtoActor-Go version: v0.0.0-20240822202345-3c0e61ca19c9
etcd version: v3
Go version: go 1.22.7

Additional Information:

Will it work if keepAliveTTL configuration can be customized?

I would appreciate assistance on how to ensure the node can properly rejoin the cluster after network recovery.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node fails to rejoin the cluster after network reconnect when using etcd for service discovery #1133

Node fails to rejoin the cluster after network reconnect when using etcd for service discovery #1133

pipiaha commented Dec 23, 2024 •

edited

Loading

Node fails to rejoin the cluster after network reconnect when using etcd for service discovery #1133

Node fails to rejoin the cluster after network reconnect when using etcd for service discovery #1133

Comments

pipiaha commented Dec 23, 2024 • edited Loading

Description:

Steps to Reproduce:

Expected Behavior:

Current Behavior:

Environment:

Additional Information:

pipiaha commented Dec 23, 2024 •

edited

Loading