This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Parallelization on multiple nodes on a HPC-cluster #2179
Labels
question
Further information is requested
🚀 Feature Request
I have a HPC cluster at my hands, and we want to use ax/botorch to optimize hyperparameters from neural networks. Currently, we use HyperOpt, which allows you to have worker-processes on different nodes that communicate to a single MongoDB server, which coordinates which hyperparameter constellations should be tried out, and which already have been tried out, and saves the results.
We have hundreds of nodes, and with HyperOpt, we can run a worker on each of them and let it train a neural network on a certain parameter configuration and then use the result to further find good possible points. We'd love to do something similiar, but with ax/botorch, but I just cannot get it to work.
I've tried using multiprocessing, like suggested here -> facebook/Ax#896 , but it didnt work out for me. I got, depending on the code I tried to use, many different error messages, like "DataRequiredError: All trials for current model have been generated, but not enough data has been observed to fit next model. Try again when more data are available.", but many more, too many to fit them all here.
I've been looking through the documentation, and I thought I may use OptimizationLoop in ax, together with run_async, to create a temporary file that a worker and work on, and then, when done, can return, but but it turned out that the only thing that this option does is to trigger an assertion:
assert not run_async, "OptimizationLoop does not yet support async."
.Is there any example on how I could do that? I'd prefer botorch, as, if I understood it correctly, offers a more abstract interface, but as long as it's possible, if someone here tells me "use ax, it's easy with that", I'll do that as well.
In short, again, what I have and what I want:
Is there any option, or an example that I was not able to find, how to do that? I'd really be happy if someone just pointed out a (very simple) example of how something like that could be achieved.
The text was updated successfully, but these errors were encountered: