Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage 'erase_from' does not work #16

Open
ultrabug opened this issue Jul 25, 2017 · 8 comments
Open

storage 'erase_from' does not work #16

ultrabug opened this issue Jul 25, 2017 · 8 comments
Assignees

Comments

@ultrabug
Copy link

Hi

I have been testing multiple times to "corrupt" a log file by removing a random line from it to test the log trimming functionality and it fails to catch up to the leader last log.

https://github.com/zhebrak/raftos/blob/master/raftos/state.py#L413

On my debugging, it seems related to the last_log_index calculation or a sort of mismatch between the leader and the follower indexes but I couldn't find out the fix yet.

Would you be so kind to have a look and tell me if you can reproduce this or if you have an idea on the fix?

Thanks!

@zhebrak zhebrak self-assigned this Jul 25, 2017
@zhebrak
Copy link
Owner

zhebrak commented Jul 25, 2017

Hey, @ultrabug! Raft algorithm does not provide this kind of corruption fixing mechanism.

In your example:

  1. We commit several log entries for the follower.
  2. Followers' last log entry index and term matches correct one from the leader.
  3. We try to send a new update and follower only checks previous entry, not the entire log file.
  4. Everything commits and we have correct current state with mismatching logs in the past.

However, if you remove some log entries from the top or add couple wrong ones, it should fix it.

You can check out Raft paper, page 4 — https://raft.github.io/raft.pdf

Let me know if I'm wrong :)

@ultrabug
Copy link
Author

Thanks for your quick reply @zhebrak.

I know that the corruption fixing is up for the implementation. However I think we have a race condition in raftos erase_from logic.

If the log is small, the bad log from the follower is correctly reset but when the log is large it is not.

@zhebrak
Copy link
Owner

zhebrak commented Jul 25, 2017

Are you sure it's not the last entry or entries that you remove when it's fixing itself correctly?
Follower handles updates synchronously, so I can't see any race conditions here.

@ultrabug
Copy link
Author

Are you sure it's not the last entry or entries that you remove when it's fixing itself correctly?

I'm sorry, I don't understand what you mean ?

@zhebrak
Copy link
Owner

zhebrak commented Jul 25, 2017

Follower only fixes log entries when the last one does not match leaders' index & term. So it's not possible to erase something when current entry checks out. If the log is small the probability of removing exactly the last log entry is higher (with random choice), that's why I asked you if that's the case.

@ultrabug
Copy link
Author

ultrabug commented Jul 25, 2017

Follower only fixes log entries when the last one does not match leaders' index & term

This is not what I understand form the code actually

        # If an existing entry conflicts with a new one (same index but different terms),
        # delete the existing entry and all that follow it
        new_index = data['prev_log_index'] + 1
        try:
            if self.log[new_index]['term'] != data['term'] or (
                self.log.last_log_index != prev_log_index
            ):
                self.log.erase_from(new_index)
        except IndexError:
            pass

Looks to me that it can be triggered at any line of the log when the follower comes back up and is fast-forwarding from the leader. If for a reason line 500 of a 2000 lines log matches this condition, we would remove the last 1501 lines from the log and we should start back from 499.

I guess I'll have to provide a gist to replicate the issue so you can see it happening by yourself.

@zhebrak
Copy link
Owner

zhebrak commented Jul 25, 2017

Yes, I as said before, it fixes all inconsistency until the last followers' index and term match the one coming from the leader. It removes everything after the matching index-term pair.

@ultrabug
Copy link
Author

Ok I have created a reproducible scenario here:

http://ultrabug.fr/github/raftos_gist.tar.gz

To use it after unpacking the archive:

  • use two terminals (two nodes to run)
  • cd raftos_gist
  • on the first terminal run python3 worker.py -i 2
  • on the second terminal run python3 worker.py -i 3

On another terminal you can check the logs of the worker id 3 : logs/127.0.0.1_8003.log

  • Expected result: trimmed log file getting fast forwarded up to 1463 lines (just like logs/127.0.0.1_8002.log)
  • Experienced result: infinite loop on rewriting of the log file

You can also reproduce this by:

  • stopping the workers
  • copying logs/127.0.0.1_8002.log to logs/127.0.0.1_8003.log
  • opening logs/127.0.0.1_8003.log
  • removing a random line from logs/127.0.0.1_8003.log
  • starting both workers again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants