Bugfix: use next instead of return to continue listen for events #867

spuun · 2024-12-04T12:27:08Z

WHAT is this pull request doing?

When restarting a leader it could end up in a idling state, not turning into a follower. This because if it starts up fast enough it hasn't lost its leadership according to etcd, and the first election event will return the node as leader, and it can't be follower to itself.

Chaning the return to next will make it will wait for the next event that should be another node being promoted.

HOW can this pull request be tested?

Failed to create specs for this so testing manually for now.

carlhoerberg · 2024-12-04T23:29:08Z

But wait, I think the thinking was that if this node is the leader it should crash/close when it looses its leadership? Are we not doing that? A leader restarts, resumes its leadership, it should need to listen for other leadership changes because when it looses its leadership it should exit the process?

spuun · 2024-12-06T07:22:08Z

But wait, I think the thinking was that if this node is the leader it should crash/close when it looses its leadership? Are we not doing that? A leader restarts, resumes its leadership, it should need to listen for other leadership changes because when it looses its leadership it should exit the process?

Yeah, well, it's stopping for some reason. I think I may have found another bug now which is the real issue!

spuun · 2024-12-06T07:40:57Z

Hm, no. Because it returns from follow_leader it get stuck on leader election, and it's the follower (I'm testing with two nodes) that's elected. And since the old leader has returned from follow_leader it will never follow.

And because it's stuck in leader election without following, it means message loss if we publish to the new leader and it dies. No data has been synced and the "idling" non-follower will be leader but with old data.

carlhoerberg · 2024-12-06T08:51:48Z

And because it's stuck in leader election without following, it means message loss if we publish to the new leader and it dies. No data has been synced and the "idling" non-follower will be leader but with old data.

how can the idling non-followr become leader? it should check the ISR before trying to become leader

spuun · 2024-12-06T08:59:25Z

And because it's stuck in leader election without following, it means message loss if we publish to the new leader and it dies. No data has been synced and the "idling" non-follower will be leader but with old data.

how can the idling non-followr become leader? it should check the ISR before trying to become leader

It's still in the ISR set, so it will just continue to leader election.

edit: let me check this again, i misinterpreted your comment at first...

spuun · 2024-12-06T10:17:02Z

I think this is the flow:

Node 1 (leader) stops gracefully, but doesn't revoke leadership (fixed by Bugfix: revoke leadership on graceful shutdown #869)
Node 1 instantly starts again, before a follower has been elected leader (because of lease ttl)
Node 1 notice that it can't follow itself and returns from follow_leader
Node 1 is still in ISR and can continue to leader election
Lease TTL happens and node 2 is elected leader
Node 1 won't follow becaues it returned from follow_leader
Publish data to node 2 which won't be replicated to node 1
Kill node 2
New lease TTL, node 1 is elected leader
Data loss because node 1 never synced

spuun · 2024-12-06T10:18:52Z

Maybe we should add a check to verify the node is in ISR when it's elected, else revoke lease and start over?

carlhoerberg · 2024-12-09T09:08:40Z

Maybe we should add a check to verify the node is in ISR when it's elected, else revoke lease and start over?

yes!

spuun · 2024-12-10T07:17:20Z

I'll wait for #871 to be merged and test together with that.

Bugfix: use next instead of return to continue to listen for events

2180d5f

spuun requested a review from a team as a code owner December 4, 2024 12:27

carlhoerberg approved these changes Dec 4, 2024

View reviewed changes

Check to be insync after elected leader

c281b0d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: use next instead of return to continue listen for events #867

Bugfix: use next instead of return to continue listen for events #867

spuun commented Dec 4, 2024

carlhoerberg commented Dec 4, 2024

spuun commented Dec 6, 2024

spuun commented Dec 6, 2024

carlhoerberg commented Dec 6, 2024

spuun commented Dec 6, 2024 •

edited

Loading

spuun commented Dec 6, 2024

spuun commented Dec 6, 2024

carlhoerberg commented Dec 9, 2024

spuun commented Dec 10, 2024

Bugfix: use next instead of return to continue listen for events #867

Are you sure you want to change the base?

Bugfix: use next instead of return to continue listen for events #867

Conversation

spuun commented Dec 4, 2024

WHAT is this pull request doing?

HOW can this pull request be tested?

carlhoerberg commented Dec 4, 2024

spuun commented Dec 6, 2024

spuun commented Dec 6, 2024

carlhoerberg commented Dec 6, 2024

spuun commented Dec 6, 2024 • edited Loading

spuun commented Dec 6, 2024

spuun commented Dec 6, 2024

carlhoerberg commented Dec 9, 2024

spuun commented Dec 10, 2024

spuun commented Dec 6, 2024 •

edited

Loading