Error handling on pre-append et al. #361

hkadayam · 2024-03-24T21:08:38Z

Major changes in this PR includes:

At present, there is no provision for consumer of replication to prevent the follower
from appending the logs. In case of linked data, its possible for followers to not have
enough space among other reasons to reject. This PR introduces pre-append and allow
consumers to return back error, which then again retried later. Not just allocation of
blocks for indirect data, but also this PR ensures that data is pulled, before accepting
pre-append on the follower that, there is no network related failures before writing the
log entry.
In case of log entry has arrived first in raft channel, before data channel, at present
we start the timer then to wait. Instead this PR starts a global reaper thread, which wakes
up often to look for any pending fetches and thus data path, just put the fetch request to
the queue. This avoids expensive timer creation in data path.
Added several robustness, error injection test cases to ensure stability under all conditions.
These tests, exposes quiet a few errors in-terms of deadlocks between fetch, wait and grpc client
threads. Fixed all of them.
Streamlined the test_raft_repl_dev testing, to be able to write the tests seamlessly.

Major changes in this PR includes: * At present, there is no provision for consumer of replication to prevent the follower from appending the logs. In case of linked data, its possible for followers to not have enough space among other reasons to reject. This PR introduces pre-append and allow consumers to return back error, which then again retried later. Not just allocation of blocks for indirect data, but also this PR ensures that data is pulled, before accepting pre-append on the follower that, there is no network related failures before writing the log entry. * In case of log entry has arrived first in raft channel, before data channel, at present we start the timer then to wait. Instead this PR starts a global reaper thread, which wakes up often to look for any pending fetches and thus data path, just put the fetch request to the queue. This avoids expensive timer creation in data path. * Added several robustness, error injection test cases to ensure stability under all conditions. * These tests, exposes quiet a few errors in-terms of deadlocks between fetch, wait and grpc client threads. Fixed all of them. * Streamlined the test_raft_repl_dev testing, to be able to write the tests seamlessly.

src/lib/common/homestore_config.fbs

src/tests/test_raft_repl_dev.cpp

xiaoxichen · 2024-03-25T07:50:54Z

src/lib/replication/repl_dev/raft_repl_dev.cpp

                auto raft_status = m_state_machine->propose_to_raft(rreq);
                if (raft_status != ReplServiceError::OK) { handle_error(rreq, raft_status); }
-            } else {
-                HS_DBG_ASSERT(false, "Error in writing data");
-                handle_error(rreq, ReplServiceError::DRIVE_WRITE_ERROR);
            }
        });
    } else {


Do we have a general idea what size of IO worth data channel? Thinking whether for small IO( i dont have an idea what is small means here) can be inlined as well.. We can find out later in performance test but just trying to see what is on top of your mind.

More we inline, the less journal space we might have. Since our object size is 1K min, I am not sure if we inline anything in HO (may be 1K object)

Those ShardInfo/PGInfo etc we now encoded them into header to avoid go through data channel. Which is fine but if we have small IO inline it doesnt necessary to encode into header.

Point taken for less journal space but might be better in terms of latency? Sorry do not have very strong idea , just want to hear your thinking.

xiaoxichen · 2024-03-26T17:29:55Z

src/lib/replication/repl_dev/raft_repl_dev.cpp

@@ -405,64 +423,83 @@ repl_req_ptr_t RaftReplDev::applier_create_req(repl_key const& rkey, sisl::blob
    // thing. So take state mutex and allocate the blk
    std::unique_lock< std::mutex > lg(rreq->state_mtx);
    if (rreq->state.load() & uint32_cast(repl_req_state_t::BLK_ALLOCATED)) { return rreq; }
+
+    auto hints_result = m_listener->get_blk_alloc_hints(user_header, data_size);


are we using the get_blk_alloc_hints as pre-append?

This might not enough as we might want to reject based on business logical. say if we get a putblob but the shard doesnt exists in the replica etc.

yes，what`s more, the pre-append should also be called in leader, since leader might reject some request base on some logic(business or not). for example , the seal_shard and put_blob conflict

maybe we need some change in nuraft side, add a callback before appending log(store_log_entry in nuraft)

@xiaoxichen put_blob before shard create is already covered in the follower, because all of them goes through this one call get_blk_alloc_hints even for put_blob. We could change the api name to something with more meaning and put more info on it. Lets keep it open and evolve that API.

@JacksonYao287 As long as leader is the always the proposer, I am not sure if we have to do anything for the leader. After all why would leader propose something it needs to veto on?

actually we use the get_blk_alloc_hints as on_pre_append. Can we make it more generic similar to on_pre_commit

//can we have data here as well? auto pre_append_result = m_listener->on_pre_append(user_header, data_size ); if (pre_append_result.hasError()) { //error handling } allocation_hind = pre_append_result.value();

Not blocking, we can revisit this in next PR.

yamingk · 2024-03-26T23:40:19Z

src/tests/test_raft_repl_dev.cpp

-    uint64_t db_num_writes() const { return m_num_commits.load(std::memory_order_relaxed); }
+    uint64_t db_commit_count() const {
+        std::shared_lock lk(db_mtx_);
+        return commit_count_;


Is there any difference between commit_count_ and m_num_commits?

You are right, it is redundant, will remove that.

xiaoxichen · 2024-03-27T14:51:31Z

src/lib/replication/repl_dev/raft_repl_dev.cpp

                auto raft_status = m_state_machine->propose_to_raft(rreq);
                if (raft_status != ReplServiceError::OK) { handle_error(rreq, raft_status); }
-            } else {
-                HS_DBG_ASSERT(false, "Error in writing data");
-                handle_error(rreq, ReplServiceError::DRIVE_WRITE_ERROR);
            }
        });
    } else {


Those ShardInfo/PGInfo etc we now encoded them into header to avoid go through data channel. Which is fine but if we have small IO inline it doesnt necessary to encode into header.

Point taken for less journal space but might be better in terms of latency? Sorry do not have very strong idea , just want to hear your thinking.

xiaoxichen · 2024-03-27T14:56:42Z

src/lib/replication/repl_dev/raft_repl_dev.cpp

@@ -405,64 +423,83 @@ repl_req_ptr_t RaftReplDev::applier_create_req(repl_key const& rkey, sisl::blob
    // thing. So take state mutex and allocate the blk
    std::unique_lock< std::mutex > lg(rreq->state_mtx);
    if (rreq->state.load() & uint32_cast(repl_req_state_t::BLK_ALLOCATED)) { return rreq; }
+
+    auto hints_result = m_listener->get_blk_alloc_hints(user_header, data_size);


actually we use the get_blk_alloc_hints as on_pre_append. Can we make it more generic similar to on_pre_commit

//can we have data here as well? auto pre_append_result = m_listener->on_pre_append(user_header, data_size ); if (pre_append_result.hasError()) { //error handling } allocation_hind = pre_append_result.value();

Not blocking, we can revisit this in next PR.

The merge-base changed after approval.

hkadayam added 5 commits March 23, 2024 06:10

Allow reject entries, various error tests, fixes - v1

f2dda3c

Allow reject entries, various error tests, fixes - v2

f2c83b7

Fixed cases that could cause hang

b252423

Remove obsolete code

4d5a2d4

hkadayam requested review from yamingk, xiaoxichen and sanebay March 24, 2024 21:08

xiaoxichen reviewed Mar 25, 2024

View reviewed changes

src/lib/common/homestore_config.fbs Show resolved Hide resolved

Resolved some build failures

3e55e81

yamingk reviewed Mar 25, 2024

View reviewed changes

src/tests/test_raft_repl_dev.cpp Show resolved Hide resolved

xiaoxichen reviewed Mar 26, 2024

View reviewed changes

yamingk reviewed Mar 26, 2024

View reviewed changes

xiaoxichen previously approved these changes Mar 27, 2024

View reviewed changes

Merge branch 'master' into pre_append_reject

06030b1

hkadayam merged commit 45c637e into eBay:master Mar 27, 2024
20 checks passed

hkadayam deleted the pre_append_reject branch March 27, 2024 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error handling on pre-append et al. #361

Error handling on pre-append et al. #361

hkadayam commented Mar 24, 2024

xiaoxichen Mar 25, 2024

hkadayam Mar 27, 2024

xiaoxichen Mar 27, 2024

xiaoxichen Mar 26, 2024

JacksonYao287 Mar 27, 2024 •

edited

Loading

hkadayam Mar 27, 2024

xiaoxichen Mar 27, 2024

yamingk Mar 26, 2024

hkadayam Mar 27, 2024

xiaoxichen Mar 27, 2024

xiaoxichen Mar 27, 2024

Error handling on pre-append et al. #361

Error handling on pre-append et al. #361

Conversation

hkadayam commented Mar 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JacksonYao287 Mar 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JacksonYao287 Mar 27, 2024 •

edited

Loading