Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync_flush should wait is_completed not is_active. #461

Closed
wants to merge 4 commits into from

Conversation

xiaoxichen
Copy link
Collaborator

Log entry added into StreamTracker in write_async , through m_records.create(). Once create the status is active.

Later on in HomeLogStore::on_write_completion it update the tracker and in the m_records.update, is_completed set to true.

We should wait for is_completed == true to indicate a log entry has been persisted.

@xiaoxichen
Copy link
Collaborator Author

I dont know how much removing async will remove this part of code @JacksonYao287

@@ -405,7 +408,7 @@ void HomeLogStore::flush_sync(logstore_seq_num_t upto_seq_num) {
if (upto_seq_num == invalid_lsn()) { upto_seq_num = m_records.active_upto(); }

// if we have flushed already, we are done
if (!m_records.status(upto_seq_num).is_active) { return; }
if (m_records.status(upto_seq_num).is_completed) { return; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This status is a tri-state check.

  1. Inactive: We have not received this lsn at all
  2. Active: Received LSN, yet to flush
  3. Completed: Completed flush.

So we only need to flush if it is not active. This change will flush when it is inactive as well. So I don't think we should make this change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the point, we need to flush only active, it make sense.
However I am not getting how is_active == false implies log getting flushed? are we reset bit in m_active_slot_bits when doing update? I didnt find this part of code in SISL.

@@ -416,13 +419,13 @@ void HomeLogStore::flush_sync(logstore_seq_num_t upto_seq_num) {

// Step 2: After marking this lsn, we again do a check, to avoid a race where completion checked for no lsn
// and the lsn is stored in step 1 above.
if (!m_records.status(upto_seq_num).is_active) { return; }
if (m_records.status(upto_seq_num).is_completed) { return; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.


// Step 3: Force a flush (with least threshold)
m_logdev->flush_if_needed(1);

// Step 4: Wait for completion
m_sync_flush_cv.wait(lk, [this, upto_seq_num] { return !m_records.status(upto_seq_num).is_active; });
m_sync_flush_cv.wait_for(lk, std::chrono::milliseconds(10), [this, upto_seq_num] { return m_records.status(upto_seq_num).is_completed; });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming it is temporary, because caller expect data at the end of it. Either we return an error or we assert? In any case, after new flush changes, we probably don't need this cv so at that point it is moot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is temporary for solving the racing we had seen. It is not right , but generally ensure we dont stuck on this CV infinitely for concurrent issue.

@xiaoxichen xiaoxichen force-pushed the fix_log branch 2 times, most recently from 542fb2b to cab0a0f Compare July 15, 2024 08:15
Previous code will add `ndevices` simulated drive into device list
which make each replica go with one real drive and ndevices simulated
drive.

Those simulated drives are identified as FAST, meta/log were on them.
Due to the very limited size of simulated drive, we can hit size limit
in long running test.

Fixing by honor input from hs_repl_test_common.

After this fix, if only one drive passed in for a replica of
test_raft_repl_dev,  that drive will be used as FAST. All services
will be started on that real drive.

Signed-off-by: Xiaoxi Chen <[email protected]>
@xiaoxichen xiaoxichen closed this Jul 17, 2024
@xiaoxichen xiaoxichen mentioned this pull request Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants