issue 259/260: incremental resync #311

yamingk · 2024-02-08T05:15:04Z

This PR relies on another PR in sisl change to build/work: eBay/sisl#216

enable non-recursive timer to be armed and triggered during fetch remote data (normal I/O path) and resync mode.
Add error handling when originator is not there. e.g. orioginator crashed right after sending the appending log entries.
In this case, the member being resyned to is supposed to be receiving log entries from new leader with its corresponding originator id set in rreq, so that the next try of fetch remote will be directed to this new originator.
Add test case for incremental resync. Verified that it will batch of rreqs can be fetched fro originator verified the blocks written to disk.
staging handling, added dynamic config for maximum data size that can be fetched from remote. Add a test case with flip point to simulate staging to trigger the staging on sending side. Test case passed.

Other tests:

release build pass.
sanitize build pass.

conanfile.py

codecov-commenter · 2024-02-13T19:19:59Z

Codecov Report

Attention: 49 lines in your changes are missing coverage. Please review.

Comparison is base (5654699) 0.00% compared to head (4130eca) 47.66%.
Report is 1 commits behind head on master.

Files	Patch %	Lines
src/lib/replication/repl_dev/raft_repl_dev.cpp	0.00%	49 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@             Coverage Diff             @@
##           master     #311       +/-   ##
===========================================
+ Coverage    0.00%   47.66%   +47.66%     
===========================================
  Files         108      107        -1     
  Lines        9500     9309      -191     
  Branches     1257     1210       -47     
===========================================
+ Hits            0     4437     +4437     
+ Misses       9500     4396     -5104     
- Partials        0      476      +476

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/lib/replication/repl_dev/raft_repl_dev.cpp

sanebay · 2024-02-14T22:27:52Z

src/lib/replication/repl_dev/raft_repl_dev.cpp

+                       "retry when new leader start appending log entries",
+                       rreqs.front()->remote_blkid.server_id, e.error());
+                for (auto const& rreq : rreqs) {
+                    handle_error(rreq, RaftReplService::to_repl_error(e.error()));


not sure if originator down, how we switch to current leader.

@sanebay If originator is down, those requests are timed out. But later when a new leader is elected, that leader will send the new data and will create a new rreq and fetch data and commit.

If the originator is removed and new leader is elected and its sending resync, wont we keep timing out for each entry because we dont who the leader is. We wont be able to move forward. One option is to first for check for raft who is the leader and if not found, ask all replica's who the leader is.

It is explained at line:485. The new leader will re-send the append entries with originator set to this new leader. We just need to handle failure here if the originator goes down.

Is there any case that we can error out here but not getting new append entries? e.g temporary network flips? or even a restart before raft timing out (so that leader doesnt change)?

responded in other place for this comment (looks like a duplicated one?)

sanebay · 2024-02-14T22:32:36Z

src/tests/test_raft_repl_dev.cpp

+        g_helper->runner().set_num_tasks(SISL_OPTIONS["num_io"].as< uint64_t >());
+
+        // before replica-1 started, issue I/O so that replica-1 is lagging behind;
+        auto block_size = SISL_OPTIONS["block_size"].as< uint32_t >();


Wont replica1 restart by the time we wait for 5seconds causing replica1 never lagging behind. Will shutdown without remove files and start homestore of replica1 simulate that ?

I think Sanal is right, that we might restart replica-1, which might have resynced by the time the sleep is completed and thus would be a regular append. If so, we can improve the restart method to simulate a sleep between shutdown and restart in (restart_homestore method), which we can control to make sure replica-1 is down while this happens.

The shutdown actually takes some time to cmoplete, that is why we sleep 5 seconds here.
It can be proved everytime replica-1 will be lagging with more than 64 append entries.
Without this 5 seconds, it needs to do more than 10K I/Os which is fine but unnecessary and verification will also take time to increase the test case run time.

There is a 10s sleep between shutdown and start in start_homestore so there is no possibility the replica1 started before starting IO(i.e not possible never lagging behind).

But seems we want to simulate resync while IO are still going on, this is harder as we are expecting the 300 IOs span longer than (10-5) = 5s. Better we can continue the IO by triggering one more batch all the way till we see replica1 had joined back, and optionally do one more round of IO. then goes to this->wait_for_all_writes(exp_entries);

I tested 3000 IOs, and we can actually do that by default, but I thought we don't want to increase the time for homestore build for everyone. I think we better do it in long duration test: #319

src/lib/replication/repl_dev/raft_repl_dev.cpp

hkadayam · 2024-02-15T02:43:06Z

src/lib/replication/repl_dev/raft_repl_dev.cpp

@@ -184,16 +184,16 @@ void RaftReplDev::on_fetch_data_received(intrusive< sisl::GenericRpcData >& rpc_
    ctx->outstanding_read_cnt = fetch_req->request()->entries()->size();


I don't think we need to do Context and read count etc, which is reimplementing future/promise right? Suggest to create promise and move promise to callback and do setValue there. We don't need mutex, cv and atomic count explicltly.

We discussed this, somehow I was hitting hang while using the future/promise by collectAll* (tried a few variations) here (the receiving side doing async_write is already calling collectAllUnsafe). I think it is because here it is the server thread and collectAllUnsafe() somehow will block which will cause the hang? We can increase the server thread number to see if it can be resolved.

src/lib/manager_impl.cpp
27 constexpr auto grpc_server_threads = 1u;

But on the receving side doing async_write, it is the client thread and the batch append anyway blocks there if the data written is not completed yet.

src/lib/replication/repl_dev/raft_repl_dev.cpp

hkadayam · 2024-02-15T03:01:37Z

src/lib/replication/repl_dev/raft_repl_dev.cpp

    }
 }

-void RaftReplDev::fetch_data_from_remote(std::vector< repl_req_ptr_t >* rreqs) {
+void RaftReplDev::fetch_data_from_remote(std::vector< repl_req_ptr_t > rreqs) {


Nit: We were passing earlier as a vector pointer instead of vector by value is because, down the line we capture this vector in collectAllUnSafe().thenValue() and we do expensive multiple allocation. You might argue that this function is not in critical path and don't mind this additional capture. My only thinking is for any reason we use this function in future, may be it might be helpful, but its an nit optional comment.

We can still do std::move on a vector to avoid allocation (line:440 also does it), right?

It is a vector now because of the staging handling, and we can't not do start_indx, end_index with a pointer to vector as it won't work for std::move for the whole vector.

hkadayam · 2024-02-15T03:08:36Z

src/tests/test_raft_repl_dev.cpp

-        ASSERT_TRUE(leader_str != my_id_str) << "I am a follower, Leader id " << leader_str.c_str()
-                                             << " should not equals to my ID " << my_id_str.c_str();
+        LOGINFO("Wait for grpc connection to replica-1 to expire and removed from raft-groups.");
+        std::this_thread::sleep_for(std::chrono::seconds{5});


I feel instead of 5 seconds sleep, we can query the repl_dev to get the number of alive members as 2 and then proceed?

the issue I dealt with is replica-1 is already gone, but shutdown takes longer time (take cp, etc.) and if we start I/O immediately after repilca-1 is gone, the 300 I/O finishes too fast before replica-1 is back. I want to simulate resync while I/Os are still on-going.

… yk_inc_resync_new

xiaoxichen · 2024-02-19T07:01:23Z

src/lib/replication/repl_dev/raft_repl_dev.cpp

+        for (auto const& rreq : *rreqs) {
+            auto const& size = rreq->remote_blkid.blkid.blk_count() * get_blk_size();
+            if ((total_size_to_fetch + size) >= max_batch_size) {
+                fetch_data_from_remote(std::move(next_batch_rreqs));


nit: if the first rreq is already larger than max_batch_size we have a no-op fetch_data_from_remote though there seems a special handling in fetch_data_from_remote

maybe

for (auto const& rreq : *rreqs) { auto const& size = rreq->remote_blkid.blkid.blk_count() * get_blk_size(); total_size_to_fetch += size; next_batch_rreqs.emplace_back(rreq); if ((total_size_to_fetch ) >= max_batch_size) { fetch_data_from_remote(std::move(next_batch_rreqs)); ... } }

Normally I tend to update size after certain operation has been completed, for the ease of error handling, etc.

But you have a good point here, if a single rreq (not necessarily the 1st one)'s remote_blkid addresses as large as its maximum blk count, which is 65535, it could be as large as 64MB in nuboject case (1K blk size) and could be 256MB for a 4KB blk size, and we have to handle this case. This will go away if we use GRPC streaming API which we probably also need for baseline resync. Will mark as TODO for now as this could become a moon point if we switch to streaming API.

Will come out a new test case (in next PR) with one rreq's remote_blkid addressing 65535 blks, which can be replication's limit test.

xiaoxichen · 2024-02-19T07:20:06Z

src/lib/replication/repl_dev/raft_repl_dev.cpp

+                       "retry when new leader start appending log entries",
+                       rreqs.front()->remote_blkid.server_id, e.error());
+                for (auto const& rreq : rreqs) {
+                    handle_error(rreq, RaftReplService::to_repl_error(e.error()));


Is there any case that we can error out here but not getting new append entries? e.g temporary network flips? or even a restart before raft timing out (so that leader doesnt change)?

xiaoxichen · 2024-02-19T15:28:40Z

src/tests/test_raft_repl_dev.cpp

+    // step-0: do some IO before restart one member;
+    uint64_t exp_entries = 20;
+    if (g_helper->replica_num() == 0) {
+        g_helper->runner().set_num_tasks(20);


nit : use exp_entries

The data channel and log channel use the same grpc connection actually and if there is any network issue, a new leader should be elected. For the network flip, I also discussed this with Sanel the other day, that we probably should do retry, this will make the logic complicated but if this network flip (network failure, but still leader doesn't change) is a real thing, we don't have choice but to handle it. I will mark it as a TODO also (to avoid comlicating things before we are 100% sure needed).

xiaoxichen · 2024-02-19T16:20:29Z

src/tests/test_raft_repl_dev.cpp

+        g_helper->runner().set_num_tasks(SISL_OPTIONS["num_io"].as< uint64_t >());
+
+        // before replica-1 started, issue I/O so that replica-1 is lagging behind;
+        auto block_size = SISL_OPTIONS["block_size"].as< uint32_t >();


There is a 10s sleep between shutdown and start in start_homestore so there is no possibility the replica1 started before starting IO(i.e not possible never lagging behind).

But seems we want to simulate resync while IO are still going on, this is harder as we are expecting the 300 IOs span longer than (10-5) = 5s. Better we can continue the IO by triggering one more batch all the way till we see replica1 had joined back, and optionally do one more round of IO. then goes to this->wait_for_all_writes(exp_entries);

issue 259/260: incremental resync

d26e4d3

yamingk commented Feb 8, 2024

View reviewed changes

conanfile.py Outdated Show resolved Hide resolved

staging handling during fetch remote dat

c22f328

yamingk requested review from hkadayam and raakella1 February 8, 2024 19:25

This was linked to issues Feb 8, 2024

[Resync] Incremental resync -- leader side #259

Closed

[Resync] Incremental resync -- receiving side (remote) #260

Closed

yamingk added 3 commits February 8, 2024 20:06

adopt movable client blob change in nuraft_mesg

ebde05c

bump sisl/nuraft_mesg version to pick related change

e9a322a

remove testing option for nuraft_mesg as it has been removed from 3.0

30cc507

Merge branch 'master' into yk_inc_resync_new

62deca7

sanebay reviewed Feb 14, 2024

View reviewed changes

src/lib/replication/repl_dev/raft_repl_dev.cpp Outdated Show resolved Hide resolved

src/lib/replication/repl_dev/raft_repl_dev.cpp Show resolved Hide resolved

sanebay reviewed Feb 14, 2024

View reviewed changes

src/lib/replication/repl_dev/raft_repl_dev.cpp Show resolved Hide resolved

sanebay reviewed Feb 14, 2024

View reviewed changes

src/lib/replication/repl_dev/raft_repl_dev.cpp Outdated Show resolved Hide resolved

sanebay reviewed Feb 14, 2024

View reviewed changes

src/lib/replication/repl_dev/raft_repl_dev.cpp Show resolved Hide resolved

sanebay reviewed Feb 14, 2024

View reviewed changes

src/lib/replication/repl_dev/raft_repl_dev.cpp Show resolved Hide resolved

sanebay reviewed Feb 14, 2024

View reviewed changes

hkadayam reviewed Feb 15, 2024

View reviewed changes

src/lib/replication/repl_dev/raft_repl_dev.cpp Outdated Show resolved Hide resolved

hkadayam reviewed Feb 15, 2024

View reviewed changes

src/lib/replication/repl_dev/raft_repl_dev.cpp Show resolved Hide resolved

hkadayam reviewed Feb 15, 2024

View reviewed changes

src/lib/replication/repl_dev/raft_repl_dev.cpp Show resolved Hide resolved

hkadayam reviewed Feb 15, 2024

View reviewed changes

yamingk added 3 commits February 15, 2024 17:55

incorperate comments

ff76509

Merge branch 'yk_inc_resync_new' of github.com:yamingk/HomeStore into…

8797fe2

… yk_inc_resync_new

add delay

e9433b7

sanebay previously approved these changes Feb 16, 2024

View reviewed changes

xiaoxichen reviewed Feb 19, 2024

View reviewed changes

debug level and minor change

4130eca

yamingk dismissed sanebay’s stale review via 4130eca February 20, 2024 22:52

sanebay approved these changes Feb 21, 2024

View reviewed changes

xiaoxichen approved these changes Feb 21, 2024

View reviewed changes

yamingk merged commit 2f5f94d into eBay:master Feb 21, 2024
20 checks passed

yamingk deleted the yk_inc_resync_new branch February 21, 2024 01:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue 259/260: incremental resync #311

issue 259/260: incremental resync #311

yamingk commented Feb 8, 2024 •

edited

Loading

codecov-commenter commented Feb 13, 2024 •

edited

Loading

sanebay Feb 14, 2024

hkadayam Feb 15, 2024

sanebay Feb 15, 2024

yamingk Feb 15, 2024

xiaoxichen Feb 19, 2024

yamingk Feb 20, 2024

sanebay Feb 14, 2024

hkadayam Feb 15, 2024

yamingk Feb 15, 2024

xiaoxichen Feb 19, 2024

yamingk Feb 20, 2024

hkadayam Feb 15, 2024

yamingk Feb 15, 2024 •

edited

Loading

hkadayam Feb 15, 2024

yamingk Feb 15, 2024 •

edited

Loading

hkadayam Feb 15, 2024

yamingk Feb 15, 2024

xiaoxichen Feb 19, 2024

xiaoxichen Feb 19, 2024

yamingk Feb 20, 2024 •

edited

Loading

xiaoxichen Feb 19, 2024

xiaoxichen Feb 19, 2024

yamingk Feb 20, 2024

xiaoxichen Feb 19, 2024

		@@ -184,16 +184,16 @@ void RaftReplDev::on_fetch_data_received(intrusive< sisl::GenericRpcData >& rpc_
		ctx->outstanding_read_cnt = fetch_req->request()->entries()->size();

issue 259/260: incremental resync #311

issue 259/260: incremental resync #311

Conversation

yamingk commented Feb 8, 2024 • edited Loading

codecov-commenter commented Feb 13, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yamingk Feb 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yamingk Feb 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yamingk Feb 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yamingk commented Feb 8, 2024 •

edited

Loading

codecov-commenter commented Feb 13, 2024 •

edited

Loading

yamingk Feb 15, 2024 •

edited

Loading

yamingk Feb 15, 2024 •

edited

Loading

yamingk Feb 20, 2024 •

edited

Loading