Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to improve high-availability when waitTillLeaderIsReadyOrStepDown #73

Open
jackjoesh opened this issue Apr 1, 2022 · 10 comments
Open

Comments

@jackjoesh
Copy link

If the leader needs to be really ready, it needs to apply all business data successfully after the election.
During this period of time, the leader cannot actually provide services. In Gringofts, the method waitTillLeaderIsReadyOrStepDown will be waiting.
If a lot of data needs to be applied, it will wait for a long time.
Does Gringofts have some high-availability solutions that can minimize the time when services are unavailable?
Thank you.

@huyumars
Copy link
Contributor

huyumars commented Apr 1, 2022

Hi thanks for your interests.
This step wait for two things, one is leader on stage and the second is applying the the lasted log.

  • The first one is very quick if Raft cluster is health.
  • For the second one,
    • if the instance is just startup, it might take time to recovery state from rocksdb and apply rest logs to the latest state.
    • If it is from follower to leader, we have a little trick. We have two single-thread loop, one is CPL for command processing which decodes command to event, then commits events to raft layer. Other is EAL for event applying, which applies events from raft log entry (committed) to maintain a latest state. For leader it has both two, and followers only have EAL. When a follower becomes a new leader it can directly use the EAL built state machine as its latest state. So the switch is also fast.

@jackjoesh
Copy link
Author

Hi thanks for your interests. This step wait for two things, one is leader on stage and the second is applying the the lasted log.

  • The first one is very quick if Raft cluster is health.

  • For the second one,

    • if the instance is just startup, it might take time to recovery state from rocksdb and apply rest logs to the latest state.
    • If it is from follower to leader, we have a little trick. We have two single-thread loop, one is CPL for command processing which decodes command to event, then commits events to raft layer. Other is EAL for event applying, which applies events from raft log entry (committed) to maintain a latest state. For leader it has both two, and followers only have EAL. When a follower becomes a new leader it can directly use the EAL built state machine as its latest state. So the switch is also fast.

Thank you for your quickly apply. Yes, I'm talking about the second one.
If we restartup the follower instance for release, and at this time, there happened to be a problem with the leader, and the currently restarted follower became the leader by voting. But this new leader may take a long time to load state from rocksdb snapshot apply rest logs to the latest state.
In this situation, what should we do?

@huyumars
Copy link
Contributor

huyumars commented Apr 2, 2022

Our state machine takes advantage of rocksdb. EAL only write rocksdb, and CPL read from memory cache and same shared rockdb.
You can image rocksdb state is the latested commited state, which also is the EAL state. And CPL state is the rocksdb state + uncommited command state in memory.
So when follower becomes to new leader, The swap processing of EAL and CPL is like CPL start to write data in memory on the base of lasted commited state in rocksdb, and EAL still write to the same rocksdb. So both of some are latest. And the recover is super fast

@jackjoesh
Copy link
Author

jackjoesh commented Apr 5, 2022

Our state machine takes advantage of rocksdb. EAL only write rocksdb, and CPL read from memory cache and same shared rockdb. You can image rocksdb state is the latested commited state, which also is the EAL state. And CPL state is the rocksdb state + uncommited command state in memory. So when follower becomes to new leader, The swap processing of EAL and CPL is like CPL start to write data in memory on the base of lasted commited state in rocksdb, and EAL still write to the same rocksdb. So both of some are latest. And the recover is super fast

Thank you, I get your design point.
Because we want to store idempotent and original request datas (one update may store N fund items) in rocksdb, so data is very large. I think very large data will affect the frequency of snapshots, resulting in a very large amount of data that needs to be applied on apply events. Finally, when follower to leader, the apply of EAL will also be very slow.
Do your actual business store idempotent and original request data in EAL rocksdb? how to solve this problem, thank you!

@huyumars
Copy link
Contributor

huyumars commented Apr 7, 2022

In practice, there is no extra EAL event to apply, since EAL should catch up with commited index.
Yes we have PROD system using are using gringofts, that's why we open source it and maintain it. This design has already be proven in our system.

@jackjoesh
Copy link
Author

thank you. I get it

@jackjoesh
Copy link
Author

In practice, there is no extra EAL event to apply, since EAL should catch up with commited index. Yes we have PROD system using are using gringofts, that's why we open source it and maintain it. This design has already be proven in our system.

Last question, how max tps can support in gringofts single group(just single group, not multi group)? Thank you

@jackyjia
Copy link
Contributor

Our state machine takes advantage of rocksdb. EAL only write rocksdb, and CPL read from memory cache and same shared rockdb. You can image rocksdb state is the latested commited state, which also is the EAL state. And CPL state is the rocksdb state + uncommited command state in memory. So when follower becomes to new leader, The swap processing of EAL and CPL is like CPL start to write data in memory on the base of lasted commited state in rocksdb, and EAL still write to the same rocksdb. So both of some are latest. And the recover is super fast

Thank you, I get your design point. Because we want to store idempotent and original request datas (one update may store N fund items) in rocksdb, so data is very large. I think very large data will affect the frequency of snapshots, resulting in a very large amount of data that needs to be applied on apply events. Finally, when follower to leader, the apply of EAL will also be very slow. Do your actual business store idempotent and original request data in EAL rocksdb? how to solve this problem, thank you!

the apply function should execute quickly, and it can execute quickly, since most heavy stuff has been handled in the process function, the apply just apply the processed result.

@jackyjia
Copy link
Contributor

jackyjia commented Apr 15, 2022

thank you. I get it

Just curious, which industry are you in and what's your use case?

@jackyjia
Copy link
Contributor

In practice, there is no extra EAL event to apply, since EAL should catch up with commited index. Yes we have PROD system using are using gringofts, that's why we open source it and maintain it. This design has already be proven in our system.

Last question, how max tps can support in gringofts single group(just single group, not multi group)? Thank you

It depends on several factors:

  1. how complicated the process logic is
  2. request payload
  3. network bandwidth
  4. cluster setup: usually more instances in the cluster, lower the throughput

In our production, tps is around ~8K.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants