You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When using hivemind.moe.Server to host experts in a background thread, bootstrapping will fail over and over again, repeatedly - leading to a complete deadlock. I am forced to restart my application repeatedly, sometimes upwards of 10 times, before a single bootstrap will work.
As you can imagine, this is very frustrating - especially during development, when I need to iterate quickly.
If there is a better way to host experts, or if you could tell me how to resolve this problem - I would GREATLY appreciate it!
To Reproduce
Run this script a few times in a row. It will ALWAYS fail, eventually:
So, it appears that using Server.create() to launch experts is much more reliable. While I can probably work with this new method, I remember now why I moved-away from it in the first place:
Server.create() makes an assumption that experts may only accept hidden_dim as an argument during initialization:
expert=name_to_block[expert_cls](hidden_dim)
I find this annoying, but maybe I'll find a clean way to work round the limitation.
Okay, yeah - inheriting from Server and overwriting the create() method seems like a decent, not terrible solution for my needs. The other code is still buggy, but perhaps we were never supposed to use it that way. Will close this issue for now.
I spoke too soon. The bootstrapping problem returned, even when using my customized Server.create() method. However, I think I've found the source of the problem.
For whatever reason, this piece of code is REQUIRED, to prevent issues during bootstrapping:
You don't actually have to use this data. The mere act of calling dht.get_visible_maddrs() from within the server creation thread is enough to prevent bootstrapping issues. Without this code, bootstrap can fail up to 90% of the time.
Clearly, this a bug - and one that should be somewhat easy to fix.
Describe the bug
When using
hivemind.moe.Server
to host experts in a background thread, bootstrapping will fail over and over again, repeatedly - leading to a complete deadlock. I am forced to restart my application repeatedly, sometimes upwards of 10 times, before a single bootstrap will work.As you can imagine, this is very frustrating - especially during development, when I need to iterate quickly.
If there is a better way to host experts, or if you could tell me how to resolve this problem - I would GREATLY appreciate it!
To Reproduce
Run this script a few times in a row. It will ALWAYS fail, eventually:
Environment
The text was updated successfully, but these errors were encountered: