Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate why MemberReplace failpoint flakes on release-3.4 #18929

Open
4 tasks
serathius opened this issue Nov 20, 2024 · 5 comments
Open
4 tasks

Investigate why MemberReplace failpoint flakes on release-3.4 #18929

serathius opened this issue Nov 20, 2024 · 5 comments

Comments

@serathius
Copy link
Member

Bug report criteria

What happened?

In last robustness meeting we identified 3 flakes for memberReplace

All happening on release-3.4 and TestRobustnessExploratory/KubernetesHighTraffic/ClusterOfSize3/MemberReplace test

What did you expect to happen?

Issue not being specific to release-3.4

How can we reproduce it (as minimally and precisely as possible)?

There is no way to select failpoints via test name, but you can modify allFailpoints in test/robustness/failpoint/failpoint.go to leave only MemberReplace

And run it with GO_TEST_FLAGS='-v --run TestRobustnessExploratory/KubernetesHighTraffic/ClusterOfSize3 --count 100 --failfast --timeout 1h' make test-robustness-release-3.4

Anything else we need to know?

No response

Etcd version (please run commands below)

release-3.4 branch

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

@joshuazh-x
Copy link
Contributor

I can take a look at this.

@joshuazh-x
Copy link
Contributor

joshuazh-x commented Nov 21, 2024

Without PR #11639, MemberList returns local membership configuration without linearizable guarantee. The removed member may show up in the member list response. The issue is fixed in 3.5 and above so it shall be specific to 3.4.

Release 3.4

func (cs *ClusterServer) MemberList(ctx context.Context, r *pb.MemberListRequest) (*pb.MemberListResponse, error) {
membs := membersToProtoMembers(cs.cluster.Members())
return &pb.MemberListResponse{Header: cs.header(), Members: membs}, nil
}

Release 3.5

func (cs *ClusterServer) MemberList(ctx context.Context, r *pb.MemberListRequest) (*pb.MemberListResponse, error) {
if r.Linearizable {
if err := cs.server.LinearizableReadNotify(ctx); err != nil {
return nil, togRPCError(err)
}
}
membs := membersToProtoMembers(cs.cluster.Members())
return &pb.MemberListResponse{Header: cs.header(), Members: membs}, nil
}

@ahrtr
Copy link
Member

ahrtr commented Nov 22, 2024

Without PR #11639, MemberList returns local membership configuration without linearizable guarantee. The removed member may show up in the member list response.

Thanks for the analysis. One workaround is to issue a linearizable read request in between for 3.4.

@serathius
Copy link
Member Author

Didn't we want to backport it to v3.4? #11639 (comment)

@ahrtr
Copy link
Member

ahrtr commented Nov 22, 2024

Better not since it changes the proto buffer, also we haven't seen any related production issue in 3.4 so far. But I am not strongly against it as it's a compatible change, see #11639 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants