-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error handling on pre-append et al. #361
Conversation
Major changes in this PR includes: * At present, there is no provision for consumer of replication to prevent the follower from appending the logs. In case of linked data, its possible for followers to not have enough space among other reasons to reject. This PR introduces pre-append and allow consumers to return back error, which then again retried later. Not just allocation of blocks for indirect data, but also this PR ensures that data is pulled, before accepting pre-append on the follower that, there is no network related failures before writing the log entry. * In case of log entry has arrived first in raft channel, before data channel, at present we start the timer then to wait. Instead this PR starts a global reaper thread, which wakes up often to look for any pending fetches and thus data path, just put the fetch request to the queue. This avoids expensive timer creation in data path. * Added several robustness, error injection test cases to ensure stability under all conditions. * These tests, exposes quiet a few errors in-terms of deadlocks between fetch, wait and grpc client threads. Fixed all of them. * Streamlined the test_raft_repl_dev testing, to be able to write the tests seamlessly.
auto raft_status = m_state_machine->propose_to_raft(rreq); | ||
if (raft_status != ReplServiceError::OK) { handle_error(rreq, raft_status); } | ||
} else { | ||
HS_DBG_ASSERT(false, "Error in writing data"); | ||
handle_error(rreq, ReplServiceError::DRIVE_WRITE_ERROR); | ||
} | ||
}); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a general idea what size of IO worth data channel? Thinking whether for small IO( i dont have an idea what is small
means here) can be inlined as well.. We can find out later in performance test but just trying to see what is on top of your mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More we inline, the less journal space we might have. Since our object size is 1K min, I am not sure if we inline anything in HO (may be 1K object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those ShardInfo/PGInfo etc we now encoded them into header to avoid go through data channel. Which is fine but if we have small IO inline it doesnt necessary to encode into header.
Point taken for less journal space but might be better in terms of latency? Sorry do not have very strong idea , just want to hear your thinking.
@@ -405,64 +423,83 @@ repl_req_ptr_t RaftReplDev::applier_create_req(repl_key const& rkey, sisl::blob | |||
// thing. So take state mutex and allocate the blk | |||
std::unique_lock< std::mutex > lg(rreq->state_mtx); | |||
if (rreq->state.load() & uint32_cast(repl_req_state_t::BLK_ALLOCATED)) { return rreq; } | |||
|
|||
auto hints_result = m_listener->get_blk_alloc_hints(user_header, data_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we using the get_blk_alloc_hints
as pre-append?
This might not enough as we might want to reject based on business logical. say if we get a putblob but the shard doesnt exists in the replica etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes,what`s more, the pre-append should also be called in leader, since leader might reject some request base on some logic(business or not). for example , the seal_shard and put_blob conflict
maybe we need some change in nuraft side, add a callback before appending log(store_log_entry in nuraft)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaoxichen put_blob before shard create is already covered in the follower, because all of them goes through this one call get_blk_alloc_hints
even for put_blob. We could change the api name to something with more meaning and put more info on it. Lets keep it open and evolve that API.
@JacksonYao287 As long as leader is the always the proposer, I am not sure if we have to do anything for the leader. After all why would leader propose something it needs to veto on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually we use the get_blk_alloc_hints
as on_pre_append
. Can we make it more generic similar to on_pre_commit
//can we have data here as well?
auto pre_append_result = m_listener->on_pre_append(user_header, data_size );
if (pre_append_result.hasError()) {
//error handling
}
allocation_hind = pre_append_result.value();
Not blocking, we can revisit this in next PR.
uint64_t db_num_writes() const { return m_num_commits.load(std::memory_order_relaxed); } | ||
uint64_t db_commit_count() const { | ||
std::shared_lock lk(db_mtx_); | ||
return commit_count_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any difference between commit_count_
and m_num_commits
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, it is redundant, will remove that.
auto raft_status = m_state_machine->propose_to_raft(rreq); | ||
if (raft_status != ReplServiceError::OK) { handle_error(rreq, raft_status); } | ||
} else { | ||
HS_DBG_ASSERT(false, "Error in writing data"); | ||
handle_error(rreq, ReplServiceError::DRIVE_WRITE_ERROR); | ||
} | ||
}); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those ShardInfo/PGInfo etc we now encoded them into header to avoid go through data channel. Which is fine but if we have small IO inline it doesnt necessary to encode into header.
Point taken for less journal space but might be better in terms of latency? Sorry do not have very strong idea , just want to hear your thinking.
@@ -405,64 +423,83 @@ repl_req_ptr_t RaftReplDev::applier_create_req(repl_key const& rkey, sisl::blob | |||
// thing. So take state mutex and allocate the blk | |||
std::unique_lock< std::mutex > lg(rreq->state_mtx); | |||
if (rreq->state.load() & uint32_cast(repl_req_state_t::BLK_ALLOCATED)) { return rreq; } | |||
|
|||
auto hints_result = m_listener->get_blk_alloc_hints(user_header, data_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually we use the get_blk_alloc_hints
as on_pre_append
. Can we make it more generic similar to on_pre_commit
//can we have data here as well?
auto pre_append_result = m_listener->on_pre_append(user_header, data_size );
if (pre_append_result.hasError()) {
//error handling
}
allocation_hind = pre_append_result.value();
Not blocking, we can revisit this in next PR.
The merge-base changed after approval.
Major changes in this PR includes:
At present, there is no provision for consumer of replication to prevent the follower
from appending the logs. In case of linked data, its possible for followers to not have
enough space among other reasons to reject. This PR introduces pre-append and allow
consumers to return back error, which then again retried later. Not just allocation of
blocks for indirect data, but also this PR ensures that data is pulled, before accepting
pre-append on the follower that, there is no network related failures before writing the
log entry.
In case of log entry has arrived first in raft channel, before data channel, at present
we start the timer then to wait. Instead this PR starts a global reaper thread, which wakes
up often to look for any pending fetches and thus data path, just put the fetch request to
the queue. This avoids expensive timer creation in data path.
Added several robustness, error injection test cases to ensure stability under all conditions.
These tests, exposes quiet a few errors in-terms of deadlocks between fetch, wait and grpc client
threads. Fixed all of them.
Streamlined the test_raft_repl_dev testing, to be able to write the tests seamlessly.