Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: enhance the loading process of replicas particularly when a significant number of replicas are spread across multiple disks #2078

Merged
merged 43 commits into from
Dec 26, 2024

Conversation

empiredan
Copy link
Contributor

@empiredan empiredan commented Jul 18, 2024

Immediately after the replica server is started, all of the replicas under the data
directory would be loaded. Currently, a loading task is launched for each replica
directory. The tasks for loading replica directories are pushed into a partitioned
thread pool (namely THREAD_POOL_REPLICATION) disk by disk: only after all
of the replica directories on the current disk have been pushed into the thread
pool would the process move on to the next disk. Since the thread pool is partitioned
while the hash for each task is the total number of the tasks that have been in the
pool before this task is added, all of the replica directories on one disk would be
executed concurrently. This would lead to two problems once there are a great number
of replica directories on each disk:

  • I/O usage for each disk might become saturated: its %util might become 100%;
  • The entire loading process is blocked on each single disk: during a long period only
    one disk is keeping busy while others are idle.

The replica server seems getting stuck in loading replicas after it is started. This is
unacceptable and should be changed.

The improved version allows the replica directories on different disks to be loaded
simultaneously: every disk would be busy loading replicas. Also, loading tasks would
be pushed into a non-partitioned thread pool (i.e. THREAD_POOL_LOCAL_APP)
instead of the partitioned, making tasks across multiple threads auto-balanced to
prevent some threads from being starved while others are stuffed.

And new parameter is added to restrict the max number of replicas allowed to be loaded
simultaneously for each disk, in case that I/O usage for each disk becomes saturated.
Another parameter is added to ensure that the main thread waiting all loading tasks to
finished would not be blocked on one task too long while the number of tasks for loading
replica directories simultaneously on a single disk has reached its limit.

Parameters are added as follows:

[replication]
+ max_replicas_on_load_for_each_disk = 256
+ load_replica_max_wait_time_ms = 10

@github-actions github-actions bot added the cpp label Jul 18, 2024
@empiredan empiredan marked this pull request as ready for review July 30, 2024 09:13
src/replica/replica_stub.h Show resolved Hide resolved
src/replica/replica_stub.h Outdated Show resolved Hide resolved
src/replica/replica_stub.h Outdated Show resolved Hide resolved
src/replica/replica_stub.h Outdated Show resolved Hide resolved
src/replica/replica_stub.cpp Outdated Show resolved Hide resolved
src/replica/replica_stub.cpp Outdated Show resolved Hide resolved
src/replica/replica_stub.cpp Outdated Show resolved Hide resolved
src/replica/replica_stub.cpp Outdated Show resolved Hide resolved
src/replica/replica_stub.cpp Show resolved Hide resolved
src/replica/replica_stub.cpp Outdated Show resolved Hide resolved
src/replica/replica_stub.cpp Outdated Show resolved Hide resolved
src/replica/replica_stub.cpp Outdated Show resolved Hide resolved
src/replica/replica_stub.cpp Show resolved Hide resolved
src/replica/replica_stub.cpp Outdated Show resolved Hide resolved
src/replica/replica_stub.cpp Outdated Show resolved Hide resolved
@empiredan empiredan force-pushed the optimize-load-replica branch from 80f2a64 to 029319f Compare August 23, 2024 09:32
@empiredan empiredan force-pushed the optimize-load-replica branch from aaa9c43 to 0b047f9 Compare September 18, 2024 07:20
@empiredan empiredan force-pushed the optimize-load-replica branch 2 times, most recently from 9bffedc to 9a8e525 Compare September 23, 2024 10:49
@empiredan empiredan force-pushed the optimize-load-replica branch from 6e246bb to 018d3b4 Compare September 29, 2024 09:52
@empiredan empiredan force-pushed the optimize-load-replica branch from 018d3b4 to 1d8a460 Compare December 16, 2024 06:24
@github-actions github-actions bot removed the github label Dec 16, 2024
@empiredan empiredan added type/config-change Added or modified configuration that should be noted on release note of new version. type/performance performance optimization or tunning labels Dec 26, 2024
@empiredan empiredan merged commit d711b08 into apache:master Dec 26, 2024
108 of 110 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp scripts type/config-change Added or modified configuration that should be noted on release note of new version. type/performance performance optimization or tunning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants