Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: use next instead of return to continue listen for events #867

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

spuun
Copy link
Member

@spuun spuun commented Dec 4, 2024

WHAT is this pull request doing?

When restarting a leader it could end up in a idling state, not turning into a follower. This because if it starts up fast enough it hasn't lost its leadership according to etcd, and the first election event will return the node as leader, and it can't be follower to itself.

Chaning the return to next will make it will wait for the next event that should be another node being promoted.

HOW can this pull request be tested?

Failed to create specs for this so testing manually for now.

@spuun spuun requested a review from a team as a code owner December 4, 2024 12:27
@carlhoerberg
Copy link
Member

But wait, I think the thinking was that if this node is the leader it should crash/close when it looses its leadership? Are we not doing that? A leader restarts, resumes its leadership, it should need to listen for other leadership changes because when it looses its leadership it should exit the process?

@spuun
Copy link
Member Author

spuun commented Dec 6, 2024

But wait, I think the thinking was that if this node is the leader it should crash/close when it looses its leadership? Are we not doing that? A leader restarts, resumes its leadership, it should need to listen for other leadership changes because when it looses its leadership it should exit the process?

Yeah, well, it's stopping for some reason. I think I may have found another bug now which is the real issue!

@spuun
Copy link
Member Author

spuun commented Dec 6, 2024

Hm, no. Because it returns from follow_leader it get stuck on leader election, and it's the follower (I'm testing with two nodes) that's elected. And since the old leader has returned from follow_leader it will never follow.

And because it's stuck in leader election without following, it means message loss if we publish to the new leader and it dies. No data has been synced and the "idling" non-follower will be leader but with old data.

@carlhoerberg
Copy link
Member

And because it's stuck in leader election without following, it means message loss if we publish to the new leader and it dies. No data has been synced and the "idling" non-follower will be leader but with old data.

how can the idling non-followr become leader? it should check the ISR before trying to become leader

@spuun
Copy link
Member Author

spuun commented Dec 6, 2024

And because it's stuck in leader election without following, it means message loss if we publish to the new leader and it dies. No data has been synced and the "idling" non-follower will be leader but with old data.

how can the idling non-followr become leader? it should check the ISR before trying to become leader

It's still in the ISR set, so it will just continue to leader election.

edit: let me check this again, i misinterpreted your comment at first...

@spuun
Copy link
Member Author

spuun commented Dec 6, 2024

I think this is the flow:

  1. Node 1 (leader) stops gracefully, but doesn't revoke leadership (fixed by Bugfix: revoke leadership on graceful shutdown #869)
  2. Node 1 instantly starts again, before a follower has been elected leader (because of lease ttl)
  3. Node 1 notice that it can't follow itself and returns from follow_leader
  4. Node 1 is still in ISR and can continue to leader election
  5. Lease TTL happens and node 2 is elected leader
  6. Node 1 won't follow becaues it returned from follow_leader
  7. Publish data to node 2 which won't be replicated to node 1
  8. Kill node 2
  9. New lease TTL, node 1 is elected leader
  10. Data loss because node 1 never synced

@spuun
Copy link
Member Author

spuun commented Dec 6, 2024

Maybe we should add a check to verify the node is in ISR when it's elected, else revoke lease and start over?

@carlhoerberg
Copy link
Member

Maybe we should add a check to verify the node is in ISR when it's elected, else revoke lease and start over?

yes!

@spuun
Copy link
Member Author

spuun commented Dec 10, 2024

I'll wait for #871 to be merged and test together with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants