prevent/handle panicked thread/actor #141

iwanbk · 2024-12-11T10:15:31Z

During development, sometimes i got this kind of message

thread 'main' panicked at zstor/src/actors/....

and then all the things will be failed with this error

Zstor error: error during waiting for async task completion: Mailbox has closed

Currently it is my code that still under development, but i remember that i got it before, which unfortunately i didn't check it deeper.

Considering Murphy's Law

Anything that can go wrong will go wrong."

Looks like we could improve it a bit by either prevent it or handle it.

Prevent it: unwrap is one of the main source, and it is a lot in 0-stor-v2 code
Handle it: i found Actix Supervisor https://docs.rs/actix/latest/actix/struct.Supervisor.html need to check it deeper.
restart 0-stor

The text was updated successfully, but these errors were encountered:

scottyeager · 2024-12-13T23:32:51Z

I think it would be good to recover in as many cases as possible, using whatever strategy makes the most sense.

But if I understand correctly zstor continues to run after a panic on the main thread? Even if we maximized our chances of recovering, I think it's better to exit than to stay alive in a non functional state. Then at least the process manager can restart the process and work can continue.

iwanbk · 2024-12-14T00:35:11Z

But if I understand correctly zstor continues to run after a panic on the main thread?

Yes, should be, but the main thread should be very hard to crash. Because the work is relatively simple than the worker thread.

I think it's better to exit than to stay alive in a non functional state. Then at least the process manager can restart the process and work can continue.

agree with this.

iwanbk · 2025-01-01T04:59:13Z

I've checked the Actix supervisor https://docs.rs/actix/latest/actix/struct.Supervisor.html
Some thoughts:

There are a lot of learns/researchs to do, because the example in the docs is only simple while our Actor is quite complex with dependencies and restart capabilities
With that much works, there is still possibility of bug because of the complexity
there is no really real gain compared to simply abort the whole program and restart it using zinit or systemd

So, i think the best way for now is abort the whole program.
The drawback of aborting is unclean exit, should be fine for now because:

the user could retry
open connections to zdb could be cleaned up eventually

scottyeager · 2025-01-01T23:57:40Z

This sounds good to me. In the context of qsfs, I've also added a script to periodically check for success and retry any failed store operations. Zstor isn't expected to guarantee retries in the face of general failures, like sudden power loss, either.

iwanbk · 2025-01-02T04:45:46Z

I've also added a script to periodically check for success and retry any failed store operations

how you do this? by checking the logs periodically?

scottyeager · 2025-01-09T18:30:43Z

how you do this? by checking the logs periodically?

You can see the approach here: https://github.com/threefoldtech/quantum-storage/blob/master/lib/retry-uploads.sh

It's using the check command and comparing the returned hash against a locally computed hash.

iwanbk self-assigned this Jan 1, 2025

iwanbk linked a pull request Jan 1, 2025 that will close this issue

feat(panic): abort on panic. #153

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prevent/handle panicked thread/actor #141

prevent/handle panicked thread/actor #141

iwanbk commented Dec 11, 2024 •

edited

Loading

scottyeager commented Dec 13, 2024

iwanbk commented Dec 14, 2024

iwanbk commented Jan 1, 2025

scottyeager commented Jan 1, 2025

iwanbk commented Jan 2, 2025 •

edited

Loading

scottyeager commented Jan 9, 2025

prevent/handle panicked thread/actor #141

prevent/handle panicked thread/actor #141

Comments

iwanbk commented Dec 11, 2024 • edited Loading

scottyeager commented Dec 13, 2024

iwanbk commented Dec 14, 2024

iwanbk commented Jan 1, 2025

scottyeager commented Jan 1, 2025

iwanbk commented Jan 2, 2025 • edited Loading

scottyeager commented Jan 9, 2025

iwanbk commented Dec 11, 2024 •

edited

Loading

iwanbk commented Jan 2, 2025 •

edited

Loading