Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error handling on pre-append et al. #361

Merged
merged 7 commits into from
Mar 27, 2024
Merged

Conversation

hkadayam
Copy link
Contributor

Major changes in this PR includes:

  • At present, there is no provision for consumer of replication to prevent the follower
    from appending the logs. In case of linked data, its possible for followers to not have
    enough space among other reasons to reject. This PR introduces pre-append and allow
    consumers to return back error, which then again retried later. Not just allocation of
    blocks for indirect data, but also this PR ensures that data is pulled, before accepting
    pre-append on the follower that, there is no network related failures before writing the
    log entry.

  • In case of log entry has arrived first in raft channel, before data channel, at present
    we start the timer then to wait. Instead this PR starts a global reaper thread, which wakes
    up often to look for any pending fetches and thus data path, just put the fetch request to
    the queue. This avoids expensive timer creation in data path.

  • Added several robustness, error injection test cases to ensure stability under all conditions.

  • These tests, exposes quiet a few errors in-terms of deadlocks between fetch, wait and grpc client
    threads. Fixed all of them.

  • Streamlined the test_raft_repl_dev testing, to be able to write the tests seamlessly.

Major changes in this PR includes:

* At present, there is no provision for consumer of replication to prevent the follower
from appending the logs. In case of linked data, its possible for followers to not have
enough space among other reasons to reject. This PR introduces pre-append and allow
consumers to return back error, which then again retried later. Not just allocation of
blocks for indirect data, but also this PR ensures that data is pulled, before accepting
pre-append on the follower that, there is no network related failures before writing the
log entry.

* In case of log entry has arrived first in raft channel, before data channel, at present
we start the timer then to wait. Instead this PR starts a global reaper thread, which wakes
up often to look for any pending fetches and thus data path, just put the fetch request to
the queue. This avoids expensive timer creation in data path.

* Added several robustness, error injection test cases to ensure stability under all conditions.

* These tests, exposes quiet a few errors in-terms of deadlocks between fetch, wait and grpc client
threads. Fixed all of them.

* Streamlined the test_raft_repl_dev testing, to be able to write the tests seamlessly.
auto raft_status = m_state_machine->propose_to_raft(rreq);
if (raft_status != ReplServiceError::OK) { handle_error(rreq, raft_status); }
} else {
HS_DBG_ASSERT(false, "Error in writing data");
handle_error(rreq, ReplServiceError::DRIVE_WRITE_ERROR);
}
});
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a general idea what size of IO worth data channel? Thinking whether for small IO( i dont have an idea what is small means here) can be inlined as well.. We can find out later in performance test but just trying to see what is on top of your mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More we inline, the less journal space we might have. Since our object size is 1K min, I am not sure if we inline anything in HO (may be 1K object)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those ShardInfo/PGInfo etc we now encoded them into header to avoid go through data channel. Which is fine but if we have small IO inline it doesnt necessary to encode into header.

Point taken for less journal space but might be better in terms of latency? Sorry do not have very strong idea , just want to hear your thinking.

@@ -405,64 +423,83 @@ repl_req_ptr_t RaftReplDev::applier_create_req(repl_key const& rkey, sisl::blob
// thing. So take state mutex and allocate the blk
std::unique_lock< std::mutex > lg(rreq->state_mtx);
if (rreq->state.load() & uint32_cast(repl_req_state_t::BLK_ALLOCATED)) { return rreq; }

auto hints_result = m_listener->get_blk_alloc_hints(user_header, data_size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we using the get_blk_alloc_hints as pre-append?

This might not enough as we might want to reject based on business logical. say if we get a putblob but the shard doesnt exists in the replica etc.

Copy link
Contributor

@JacksonYao287 JacksonYao287 Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes,what`s more, the pre-append should also be called in leader, since leader might reject some request base on some logic(business or not). for example , the seal_shard and put_blob conflict

maybe we need some change in nuraft side, add a callback before appending log(store_log_entry in nuraft)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaoxichen put_blob before shard create is already covered in the follower, because all of them goes through this one call get_blk_alloc_hints even for put_blob. We could change the api name to something with more meaning and put more info on it. Lets keep it open and evolve that API.

@JacksonYao287 As long as leader is the always the proposer, I am not sure if we have to do anything for the leader. After all why would leader propose something it needs to veto on?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we use the get_blk_alloc_hints as on_pre_append. Can we make it more generic similar to on_pre_commit

//can we have data here as well?
auto pre_append_result = m_listener->on_pre_append(user_header, data_size );
if (pre_append_result.hasError()) {
     //error handling
 } 
 allocation_hind = pre_append_result.value();

Not blocking, we can revisit this in next PR.

uint64_t db_num_writes() const { return m_num_commits.load(std::memory_order_relaxed); }
uint64_t db_commit_count() const {
std::shared_lock lk(db_mtx_);
return commit_count_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any difference between commit_count_ and m_num_commits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it is redundant, will remove that.

xiaoxichen
xiaoxichen previously approved these changes Mar 27, 2024
auto raft_status = m_state_machine->propose_to_raft(rreq);
if (raft_status != ReplServiceError::OK) { handle_error(rreq, raft_status); }
} else {
HS_DBG_ASSERT(false, "Error in writing data");
handle_error(rreq, ReplServiceError::DRIVE_WRITE_ERROR);
}
});
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those ShardInfo/PGInfo etc we now encoded them into header to avoid go through data channel. Which is fine but if we have small IO inline it doesnt necessary to encode into header.

Point taken for less journal space but might be better in terms of latency? Sorry do not have very strong idea , just want to hear your thinking.

@@ -405,64 +423,83 @@ repl_req_ptr_t RaftReplDev::applier_create_req(repl_key const& rkey, sisl::blob
// thing. So take state mutex and allocate the blk
std::unique_lock< std::mutex > lg(rreq->state_mtx);
if (rreq->state.load() & uint32_cast(repl_req_state_t::BLK_ALLOCATED)) { return rreq; }

auto hints_result = m_listener->get_blk_alloc_hints(user_header, data_size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we use the get_blk_alloc_hints as on_pre_append. Can we make it more generic similar to on_pre_commit

//can we have data here as well?
auto pre_append_result = m_listener->on_pre_append(user_header, data_size );
if (pre_append_result.hasError()) {
     //error handling
 } 
 allocation_hind = pre_append_result.value();

Not blocking, we can revisit this in next PR.

@hkadayam hkadayam dismissed xiaoxichen’s stale review March 27, 2024 17:02

The merge-base changed after approval.

@hkadayam hkadayam merged commit 45c637e into eBay:master Mar 27, 2024
20 checks passed
@hkadayam hkadayam deleted the pre_append_reject branch March 27, 2024 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants