Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index crash recovery test cases and fixes #458

Merged
merged 8 commits into from
Aug 2, 2024

Conversation

hkadayam
Copy link
Contributor

  • Fixed the case where if cp is not flushed and homestore crashed, restart would not init certain key parameters. This is fixed by taking cp everytime we do formatting and then we persist this information to homestore first_blk.

  • Fixed the crash simulator issue where restart doesn't initialize existing HSTestHelper instances

  • Several fixes in IndexTable to make sure we can restart and reapply the logs

@@ -109,7 +109,7 @@ class AppendBlkAllocator : public BlkAllocator {
std::string to_string() const override;

void cp_flush(CP* cp) override;

void recovery_completed() override {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to be explicit notification that recovery is completed, so that it can rebuild the caches (for certain blk allocators)

return BlkAllocStatus::SUCCESS;
}

void FixedBlkAllocator::recovery_completed() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the recovery processing from first alloc to the explicit call recovery_completed()

~CrashSimulator() = default;

void crash() {
if (m_restart_cb) {
m_crashed.update([](auto* s) { *s = true; });

// We can restart on a new thread to allow other operations to continue
std::thread t([this]() { m_restart_cb(); });
std::thread t([cb = std::move(m_restart_cb)]() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not access this pointer (instead move the cb out of this variable)

@@ -149,6 +149,7 @@ class DeviceManager {

bool is_first_time_boot() const { return m_first_time_boot; }
void format_devices();
void commit_formatting();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yamingk The files device.h, device_manager.cpp, hs_super_blk.h and homestore.cpp are relevant changes regarding taking cp on first_time_boot related changes. Please review these.

}

first_block* fblk = r_cast< first_block* >(buf);
fblk->formatting_done = 0x1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of there are three pdevs, and 1 got persisted with formatting_done set to 0x1, before 2 other pdevs persist their superblock, we crash.

On reboot, what we want is we should return need_format true in homestore::start().
However, on reboot in above case, in DeviceManager::DeviceManager, when we detect 1st pdev is_valid() is true (its formatting_done already set to 0x1), we will set m_first_time_boot as false, which will return need_format as false, which is not expected, right?

We want to change DeviceManager::DeviceManager to only return m_first_time_boot as true when all pdevs is_valid() return true, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be fixed globally and decide strategically, not just for formatting, but in general its a valid device or not. Currently we need superblk from only one device to be valid and if they are, we proceed. We can choose to do it only if all (but we need to worry about new device insertion). I would just use is_valid() and let the DeviceManager::DeviceManager make a decision.

hkadayam and others added 2 commits July 31, 2024 14:44
* Fixed the case where if cp is not flushed and homestore crashed, restart
would not init certain key parameters. This is fixed by taking cp everytime
we do formatting and then we persist this information to homestore first_blk.

* Fixed the crash simulator issue where restart doesn't initialize existing
HSTestHelper instances

* Several fixes in IndexTable to make sure we can restart and reapply the logs
Add visualize
change diff and some print
@hkadayam hkadayam force-pushed the test_index_crash_recovery branch 2 times, most recently from 2afe7f8 to 12d9398 Compare August 2, 2024 01:46
@hkadayam hkadayam force-pushed the test_index_crash_recovery branch from 12d9398 to ed6bd0d Compare August 2, 2024 01:47
@hkadayam hkadayam force-pushed the test_index_crash_recovery branch from dd9c2da to e255e13 Compare August 2, 2024 05:33
hkadayam and others added 3 commits August 1, 2024 22:34
* Fixed the case where if cp is not flushed and homestore crashed, restart
would not init certain key parameters. This is fixed by taking cp everytime
we do formatting and then we persist this information to homestore first_blk.

* Fixed the crash simulator issue where restart doesn't initialize existing
HSTestHelper instances

* Several fixes in IndexTable to make sure we can restart and reapply the logs

if (buf->m_up_buffer != real_up_buf) {
real_up_buf->m_wait_for_down_buffers.increment(1);
buf->m_up_buffer = real_up_buf;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly

fmt::format_to(std::back_inserter(str), " [EMPTY] ");
return str;
// return "";
// #if 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please return it backif 0 to not break homeobject . I will fix it soon to work for it too, soon

@hkadayam hkadayam merged commit 104e54e into master Aug 2, 2024
21 checks passed
@hkadayam hkadayam deleted the test_index_crash_recovery branch August 2, 2024 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants