Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent/handle panicked thread/actor #141

Open
iwanbk opened this issue Dec 11, 2024 · 6 comments · May be fixed by #153
Open

prevent/handle panicked thread/actor #141

iwanbk opened this issue Dec 11, 2024 · 6 comments · May be fixed by #153
Assignees

Comments

@iwanbk
Copy link
Member

iwanbk commented Dec 11, 2024

During development, sometimes i got this kind of message

thread 'main' panicked at zstor/src/actors/....

and then all the things will be failed with this error

Zstor error: error during waiting for async task completion: Mailbox has closed

Currently it is my code that still under development, but i remember that i got it before, which unfortunately i didn't check it deeper.

Considering Murphy's Law

Anything that can go wrong will go wrong."

Looks like we could improve it a bit by either prevent it or handle it.

@scottyeager
Copy link

I think it would be good to recover in as many cases as possible, using whatever strategy makes the most sense.

But if I understand correctly zstor continues to run after a panic on the main thread? Even if we maximized our chances of recovering, I think it's better to exit than to stay alive in a non functional state. Then at least the process manager can restart the process and work can continue.

@iwanbk
Copy link
Member Author

iwanbk commented Dec 14, 2024

But if I understand correctly zstor continues to run after a panic on the main thread?

Yes, should be, but the main thread should be very hard to crash. Because the work is relatively simple than the worker thread.

I think it's better to exit than to stay alive in a non functional state. Then at least the process manager can restart the process and work can continue.

agree with this.

@iwanbk iwanbk self-assigned this Jan 1, 2025
@iwanbk
Copy link
Member Author

iwanbk commented Jan 1, 2025

I've checked the Actix supervisor https://docs.rs/actix/latest/actix/struct.Supervisor.html
Some thoughts:

  • There are a lot of learns/researchs to do, because the example in the docs is only simple while our Actor is quite complex with dependencies and restart capabilities
  • With that much works, there is still possibility of bug because of the complexity
  • there is no really real gain compared to simply abort the whole program and restart it using zinit or systemd

So, i think the best way for now is abort the whole program.
The drawback of aborting is unclean exit, should be fine for now because:

  • the user could retry
  • open connections to zdb could be cleaned up eventually

@iwanbk iwanbk linked a pull request Jan 1, 2025 that will close this issue
@scottyeager
Copy link

This sounds good to me. In the context of qsfs, I've also added a script to periodically check for success and retry any failed store operations. Zstor isn't expected to guarantee retries in the face of general failures, like sudden power loss, either.

@iwanbk
Copy link
Member Author

iwanbk commented Jan 2, 2025

I've also added a script to periodically check for success and retry any failed store operations

how you do this? by checking the logs periodically?

@scottyeager
Copy link

how you do this? by checking the logs periodically?

You can see the approach here: https://github.com/threefoldtech/quantum-storage/blob/master/lib/retry-uploads.sh

It's using the check command and comparing the returned hash against a locally computed hash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants