-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ra_server: Update leaderboard after each command if cluster changed #417
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
FTR, the testcase in RabbitMQ that is failing because of this:
It takes a few runs locally to hit the problem. |
dumbbell
force-pushed
the
update-leaderboard-on-recovery
branch
from
February 6, 2024 13:11
4fd69f1
to
7f8727f
Compare
dumbbell
force-pushed
the
update-leaderboard-on-recovery
branch
from
February 6, 2024 14:55
f8410be
to
7eb516b
Compare
dumbbell
changed the title
ra_server: Emit
ra_server: Update leaderboard after each command if cluster changed
Feb 6, 2024
update_leaderboard
effect after applying ra_cluster_change
dumbbell
force-pushed
the
update-leaderboard-on-recovery
branch
from
February 6, 2024 14:58
7eb516b
to
301ca62
Compare
kjnilsson
reviewed
Feb 7, 2024
kjnilsson
reviewed
Feb 7, 2024
[Why] In a RabbitMQ test with a 5-node cluster where Erlang nodes are killed and restarted in a rolling fashion, I observed that the leaderboards might be inconsistent across the cluster. Logs from node 1: 2024-02-06 12:29:15.792058+01:00 [debug] <0.251.0> Trying to restart local Ra server for store "rabbitmq_metadata" in Ra system "coordination" 2024-02-06 12:29:15.810508+01:00 [debug] <0.296.0> RabbitMQ metadata store: ra_log:init recovered last_index_term {60,1} first index 0 2024-02-06 12:29:15.813225+01:00 [debug] <0.296.0> RabbitMQ metadata store: post_init -> recover in term: 1 machine version: 0 2024-02-06 12:29:15.813291+01:00 [debug] <0.296.0> RabbitMQ metadata store: recovering state machine version 0:0 from index 0 to 0 2024-02-06 12:29:15.813347+01:00 [debug] <0.296.0> RabbitMQ metadata store: recovery of state machine version 0:0 from index 0 to 0 took 0ms 2024-02-06 12:29:15.813413+01:00 [info] <0.251.0> Waiting for Khepri leader for 30000 ms, 9 retries left 2024-02-06 12:29:15.813428+01:00 [debug] <0.296.0> RabbitMQ metadata store: recover -> recovered in term: 1 machine version: 0 2024-02-06 12:29:15.813484+01:00 [debug] <0.296.0> RabbitMQ metadata store: recovered -> follower in term: 1 machine version: 0 2024-02-06 12:29:15.813520+01:00 [debug] <0.296.0> RabbitMQ metadata store: is not new, setting election timeout. 2024-02-06 12:29:15.818609+01:00 [info] <0.251.0> Khepri leader elected 2024-02-06 12:29:15.820842+01:00 [debug] <0.296.0> RabbitMQ metadata store: leader call - leader not known. Command will be forwarded once leader is known. 2024-02-06 12:29:16.176876+01:00 [debug] <0.296.0> RabbitMQ metadata store: pre_vote election called for in term 1 2024-02-06 12:29:16.182866+01:00 [debug] <0.296.0> RabbitMQ metadata store: follower -> pre_vote in term: 1 machine version: 0 2024-02-06 12:29:16.183051+01:00 [debug] <0.296.0> RabbitMQ metadata store: pre_vote granted #Ref<0.2919399944.1500250113.30962> for term 1 votes 1 2024-02-06 12:29:16.183247+01:00 [debug] <0.296.0> RabbitMQ metadata store: election called for in term 2 2024-02-06 12:29:16.186666+01:00 [debug] <0.296.0> RabbitMQ metadata store: pre_vote -> candidate in term: 2 machine version: 0 2024-02-06 12:29:16.186815+01:00 [debug] <0.296.0> RabbitMQ metadata store: vote granted for term 2 votes 1 2024-02-06 12:29:16.186996+01:00 [notice] <0.296.0> RabbitMQ metadata store: candidate -> leader in term: 2 machine version: 0 2024-02-06 12:29:16.187220+01:00 [alert] <0.296.0> RA LEADER RECORD: ClusterName=rabbitmq_metadata Leader={rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'} Members=[{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'}] 2024-02-06 12:29:16.190313+01:00 [debug] <0.296.0> RabbitMQ metadata store: applying ra cluster change to [{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-5-22632@localhost'}] 2024-02-06 12:29:16.190648+01:00 [debug] <0.296.0> RabbitMQ metadata store: applying ra cluster change to [{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-2-22488@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-5-22632@localhost'}] 2024-02-06 12:29:16.190878+01:00 [debug] <0.296.0> RabbitMQ metadata store: applying ra cluster change to [{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-2-22488@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-3-22536@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-5-22632@localhost'}] 2024-02-06 12:29:16.191393+01:00 [debug] <0.296.0> RabbitMQ metadata store: applying ra cluster change to [{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-2-22488@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-3-22536@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-4-22584@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-5-22632@localhost'}] 2024-02-06 12:29:16.200058+01:00 [debug] <0.296.0> RabbitMQ metadata store: enabling ra cluster changes in 2, index 61 2024-02-06 12:29:16.302892+01:00 [debug] <0.251.0> Khepri-based RabbitMQ metadata store ready Logs from node 2: 2024-02-06 12:29:18.714368+01:00 [debug] <0.260.0> Trying to restart local Ra server for store "rabbitmq_metadata" in Ra system "coordination" 2024-02-06 12:29:18.732537+01:00 [debug] <0.310.0> RabbitMQ metadata store: ra_log:init recovered last_index_term {74,2} first index 0 2024-02-06 12:29:18.735899+01:00 [debug] <0.310.0> RabbitMQ metadata store: post_init -> recover in term: 2 machine version: 0 2024-02-06 12:29:18.735944+01:00 [debug] <0.310.0> RabbitMQ metadata store: recovering state machine version 0:0 from index 0 to 60 2024-02-06 12:29:18.736488+01:00 [info] <0.260.0> Waiting for Khepri leader for 30000 ms, 9 retries left 2024-02-06 12:29:18.741696+01:00 [debug] <0.310.0> RabbitMQ metadata store: applying ra cluster change to [{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-5-22632@localhost'}] 2024-02-06 12:29:18.741784+01:00 [debug] <0.310.0> RabbitMQ metadata store: applying ra cluster change to [{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-2-22488@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-5-22632@localhost'}] 2024-02-06 12:29:18.741852+01:00 [debug] <0.310.0> RabbitMQ metadata store: applying ra cluster change to [{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-2-22488@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-3-22536@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-5-22632@localhost'}] 2024-02-06 12:29:18.741969+01:00 [debug] <0.310.0> RabbitMQ metadata store: applying ra cluster change to [{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-2-22488@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-3-22536@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-4-22584@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-5-22632@localhost'}] 2024-02-06 12:29:18.743421+01:00 [debug] <0.310.0> RabbitMQ metadata store: recovery of state machine version 0:0 from index 0 to 60 took 8ms 2024-02-06 12:29:18.743487+01:00 [debug] <0.310.0> RabbitMQ metadata store: recover -> recovered in term: 2 machine version: 0 2024-02-06 12:29:18.748414+01:00 [debug] <0.310.0> RabbitMQ metadata store: recovered -> follower in term: 2 machine version: 0 2024-02-06 12:29:18.748445+01:00 [debug] <0.310.0> RabbitMQ metadata store: is not new, setting election timeout. 2024-02-06 12:29:18.748591+01:00 [info] <0.260.0> Khepri leader elected 2024-02-06 12:29:18.748642+01:00 [debug] <0.310.0> RabbitMQ metadata store: leader call - leader not known. Command will be forwarded once leader is known. 2024-02-06 12:29:19.036638+01:00 [debug] <0.310.0> RabbitMQ metadata store: pre_vote election called for in term 2 2024-02-06 12:29:19.040483+01:00 [debug] <0.310.0> RabbitMQ metadata store: follower -> pre_vote in term: 2 machine version: 0 2024-02-06 12:29:19.040652+01:00 [debug] <0.310.0> RabbitMQ metadata store: pre_vote granted #Ref<0.772528231.2305294338.197063> for term 2 votes 1 2024-02-06 12:29:19.043116+01:00 [debug] <0.310.0> RabbitMQ metadata store: pre_vote -> follower in term: 2 machine version: 0 2024-02-06 12:29:19.043212+01:00 [debug] <0.310.0> RabbitMQ metadata store: is not new, setting election timeout. 2024-02-06 12:29:19.044068+01:00 [debug] <0.310.0> RabbitMQ metadata store: enabling ra cluster changes in 2, index 61 2024-02-06 12:29:19.061918+01:00 [alert] <0.310.0> RA LEADER RECORD: ClusterName=rabbitmq_metadata Leader={rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'} Members=[{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-2-22488@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-3-22536@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-4-22584@localhost'},{rabbitmq_metadata,'rmq-ct-rolling_kill_restart-5-22632@localhost'}] 2024-02-06 12:29:19.062191+01:00 [info] <0.310.0> RabbitMQ metadata store: detected a new leader {rabbitmq_metadata,'rmq-ct-rolling_kill_restart-1-22440@localhost'} in term 2 2024-02-06 12:29:19.062378+01:00 [debug] <0.310.0> RabbitMQ metadata store: Leader node 'rmq-ct-rolling_kill_restart-1-22440@localhost' may be down, setting pre-vote timeout 2024-02-06 12:29:19.151724+01:00 [debug] <0.260.0> Khepri-based RabbitMQ metadata store ready Logs on nodes 3, 4 and 5 are similar to node 2. Note that messages starting with "RA LEADER RECORD" were added to `ra_leaderboard:record/3` during debugging and are not committed to Ra. We can see that node 1 is elected the leader on recovery before the `ra_cluster_change` commands are applied. It stays the leader afterwards but never updates its leaderboard. [How] Instead of emitting an `update_leaderboard` effect after a cluster change command was appended or after some specific events, we check if the cluster changed after a command was applied and update the leaderboard if it did. This check is performed on the leader and all followers. It is also performed once a Ra server finished recovery. V2: This revision changes the approach. The first iteration updated the leaderboard after a `ra_cluster_change` command was applied in addition of the existing updates. V3: Refactor the approach and and ensure the leaderboard is updated after recovery as well as whenver the cluster changes on any member.
In Raft the cluster is changed when a cluster change command is written to the log. Hence to correctly recover the current cluster a state machine needs to scan until the end of the log on recovery to ensure it picks up any observed changes. Ra didn't do this and thus were vulnerable to pick up the wrong cluster configuration (which ofc would affect subsequent leader elections).
When the wal detects a new writer it creates a new ETS mem table and inserts an initial record into the open mem table.. hrm.. table. This is so that the open_mem_tables table can later be updated using ets:update_element/3. However the initial update was such that it indicated that it should contain a single entry which isn't yet true. The entires are written a little later during batch completion. This is an attempt to avoid this condition by setting the "To" field to -1 that of the "From" field.
kjnilsson
force-pushed
the
update-leaderboard-on-recovery
branch
from
February 8, 2024 16:50
ae4a61f
to
2c9a7fe
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why
In a RabbitMQ test with a 5-node cluster where Erlang nodes are killed and restarted in a rolling fashion, I observed that the leaderboards might be inconsistent across the cluster.
Logs from node 1:
Logs from node 2:
Logs on nodes 3, 4 and 5 are similar to node 2.
Note that messages starting with "RA LEADER RECORD" were added to
ra_leaderboard:record/3
during debugging and are not committed to Ra.We can see that node 1 is elected the leader on recovery before the
ra_cluster_change
commands are applied. It stays the leader afterwards but never update its leaderboard.How
Instead of emitting an
update_leaderboard
effect after a cluster change command was appended or after some specific events, we check if the cluster changed after a command was applied and update the leaderboard if it did.This check is performed on the leader and all followers. It is also performed once a Ra server finished recovery.
V2: This revision changes the approach. The first iteration updated the leaderboard after a
ra_cluster_change
command was applied in addition of the existing updates.