Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid workers waiting for training of surrogate model to finish #1170

Open
bbudescu opened this issue Nov 23, 2024 · 18 comments
Open

Avoid workers waiting for training of surrogate model to finish #1170

bbudescu opened this issue Nov 23, 2024 · 18 comments
Assignees

Comments

@bbudescu
Copy link

bbudescu commented Nov 23, 2024

Motivation

Here's how a cpu load graph looks like for a multi-objective optimization session using the multi-fidelity facade that ran for about 46 hours an a 64 core machine (no hyperthreading, n_workers=64), finishing almost 20k trials on a bit over 16k distinct configurations (two rungs).

Screenshot 2024-11-23 at 11-59-52 Instances EC2 eu-west-1

One can see that cpu utilizations decreases to less than 50% after the first 12 hours. It then drops to under 40% after another 10 hours (by this time, 12.6k trials were finished in total).

Previous Discussion

I thought that another cause for this degradation in performance might be Hyperband, and thought that using ASHA (#1169) instead would help eliminate that hypothesis, however, after @eddiebergman's #1169 (comment), I understand the problem is caused by workers waiting to get another suggestion from the surrogate model

Potential solution

  • train the random forest in a different thread / process
  • replace a version of the RF with a newly trained one only when the training is done (similar to double buffering)
  • workers should always get configs using the currently available RF, even if a new RF is training in the background
  • optionally: use an occupancy threshold, e.g., 90%, and allow worker threads to wait for training to finish only if the percentage of workers that are idle waiting for the new RF version is below 10%.
  • optionally: add gpu support to accelerate training of Random Forest
  • optionally: perhaps add the option to decrease the number of workers running the target function by 1 once the RF trainer occupies a CPU core for more than 50% of time
@dengdifan
Copy link
Contributor

Hi @bbudescu,
Thanks for raising this issue and providing potential solutions. As far as I can see, the potential solutions that you proposed are mostly related to the RF. We are also planning to replace the current RF packages: #1116

However, the new RF model might still rely on the existing packages (e.g., sklearn). Since we are only a small team maintaining SMAC, we might not have enough manpower to customize a new random forest package. However, if you have any good ideas on how to implement this efficiently (in Python) or you would like to create a new PR regarding the RF replacement stuff. We are happy to help you (on how to integrate this into SMAC).

@bbudescu
Copy link
Author

Hi @dengdifan,

Thanks for your reply above. Yes, I am aware of the #1116, however, the main point I was trying to get across in this issue was making the ask operation asynchronous w.r.t. RF model training, i.e., being able to query an existing RF model for a new configuration to try anytime (even if the model is lagging behind, i.e., it hasn't been updated with the absolute latest results), rather than being forced to wait until training finishes, because it's better to use the cpu cores rather than to keep them unoccupied more than half of the time. I mean, even if it's not looking in the best places, it still explores the config space, and that's preferable to not doing anything at all.

Now, I haven't looked into the code, but I assume this doesn't have anything to do with the choice of random forest package, but just with running RF training in a separate thread. Perhaps one of the optional suggestions I made, namely, the one where I suggested adding GPU support, might be relevant to the choice of RF implementation.

@dengdifan
Copy link
Contributor

Hi @bbudescu
Thanks for the clarification. However, first, because of global interpreter lock (GIL), it is not easy to implement multi-threading in pure python environment, if you have any idea how to do this efficiently, a PR is welcomed.

Seond, SMAC is implemented based on BO, an algorithm designed for optimizing expensive black-box functions, such as training a deep neural network. Usually, the expense of evaluating a function is much more significant than training the surrogate model. It is still unclear if directly starting a new configuration is better than waiting until the current surrogate model provides a new candidate.

Additionally, making ask operation asynchronous is equivalent to running one core with BO and the other cores with random search (since the probability that two runs end at exactly the same time is 0, meaning that once one core (core A) that runs a random configuration is free, the core that runs BO (core B) should still be occupied, and this core (core A) will continue a random sampling since B cannot provide A any useful information). In that case, you could set the argument probability argument in function get_random_design as (num_cores-1)/(num_cores): https://github.com/automl/SMAC3/blob/main/smac/facade/hyperparameter_optimization_facade.py#L172

@bbudescu
Copy link
Author

bbudescu commented Dec 5, 2024

Hi @dengdifan

a PR is welcomed.

I've been working on something, and I just added #1178 to share it here. It's not yet tested, but it's a start, I guess.

because of global interpreter lock (GIL), it is not easy to implement multi-threading in pure python environment

That's quite right :). I had to do a bit of reading before finally deciding at least on what to try

if you have any idea how to do this efficiently

Now, while looking into this I've taken a few ideas into consideration, namely, threading, multiprocessing, concurrent.futures, dask futures and asyncio. Each has its pros and cons, and I finally decided to have a stab at it with plain threads.

Other than the fact that I used threading before more than the others, so I'm familiar with the framework, so I don't have to spend as much time learning how to use another platform, I thought they would be a good choice because w.r.t. IPC for passing datasets and models to and from the training thread:

  • threading is the most straightforward to use, i.e., it's just plain python, basically (of course, with proper synchronization)
  • multiprocessing:
  • dask has the upside that it's already being used within SMAC3, but the downside that, if we're scheduling the training session the same as we do every worker process, it might be executed on a different machine, so there might be some overhead transferring the data and model (maybe scatter data)

Of course, because of the GIL, the main optimization loop gets less cpu time, but although I haven't tested, I expect that the main thread would spend most time in the training section anyway. And I don't think the other parts in the loop are very CPU intensive anyway. Or are they? Do you know if, perhaps the inference or something takes really long?

Seond, SMAC is implemented based on BO, an algorithm designed for optimizing expensive black-box functions, such as training a deep neural network. Usually, the expense of evaluating a function is much more significant than training the surrogate model.

Well, other than expanding SMAC's use cases to things like database knob tuning and other tasks that also have a low evaluation time for the target cost function, it's also an impeding the scalability of SMAC. E.g., let's say you want to train a neural net that takes, on average, 2 hours on a gpu to fit to a dataset and you need to find the optimal parameters fast. In this case you might opt to do something like renting 256 machines on aws so that you increase the chance you get a decent neural net within a tight budget like 24 hours. In this case you will face the same problem, because you'll get new results every 30 seconds anyway, not because it takes little to train neural nets, but because of the number of training processes.

It is still unclear if directly starting a new configuration is better than waiting until the current surrogate model provides a new candidate.

If you're paying for an aws instance with a gpu you'd hate it to see it stay idle. You'd even prefer it even running a random config rather than just doing nothing.

Additionally, making ask operation asynchronous is equivalent to running one core with BO and the other cores with random search

Well, I'm not sure I understand this correctly, but if I do, then I don't think that is the case other than in the beginning. So, here's an example at the opposite end: let's say that you already finished 1,000,000 trials, and trained the RF on that data. Let's now say that a worker process finishes running the 1,000,001st trial, reports results and asks for a new configuration to try. What happens in the current setting is that the worker needs to wait for the model to retrain taking into consideration that last result, so it can benefit from the extra information added by that extra sample in the RF training set.

Now, my hypothesis is that there is only a minor benefit of doing this, as I don't think the expected improvement in the cost function caused by making the suggestion based on 1,000,001 vs 1,000,000 is going to be that much different anyway. And also, when setting retrain_after most of the trial configs will be sampled from outdated models anyway.

So, to be clear, I'm not suggesting using random configurations, but, rather, to adapt the retrain_after parameter dynamically, based on resource availability and demands for retraining.

LE: Here is a comparison between the performance of some IPC methods when sharing numpy data.

@dengdifan
Copy link
Contributor

Hi @bbudescu

Now I am confusing by your two examples.

If you're paying for an aws instance with a gpu you'd hate it to see it stay idle. You'd even prefer it even running a random config rather than just doing nothing.

In your first example (the neural network training one), you suggest sampling a new configuration randomly:

Well, I'm not sure I understand this correctly, but if I do, then I don't think that is the case other than in the beginning.

This is exactly the case for training a RF in a separated thread. Assuming that you have two cores, one core already finishes its configuration evaluation. Since your RF is still under training (you can not expect that both RF training and configuration evaluation ends at exact the same time clock for both CPUs), you cannot expect how long it would take to finish the RF training in another core(thread). Then in that case you simply start a new configuration randomly (as suggested by your AWS example) even the RF training will end just 1 CPU clock after. Therefore, in this case, you always have the core 1 running BO+evaluation and core 2 running random evaluation.

So, to be clear, I'm not suggesting using random configurations, but, rather, to adapt the retrain_after parameter dynamically, based on resource availability and demands for retraining.

Here you suggest to adapt the retrain_after parameters dynamically instead of sampling new configurations randomly. However, how should this value be scheduled? To which degree should we increase this value? Should we always apply this strategy or we only apply this if the number of budgets is higher than a threshold? Since we are aiming at arbitrary budgets and number of evaluations, all these need to be carefully designed to avoid performance degeneration

@bbudescu
Copy link
Author

bbudescu commented Dec 6, 2024

Here you suggest to adapt the retrain_after parameters dynamically instead of sampling new configurations randomly.

Sorry, I tried to make a comparison. Conceptually, training in a different thread would be somewhat similar to when retrain_after is greater than 1, because you'd use the same surrogate model, without any retraining, for generating multiple trial configs. I wasn't particularly clear about this.

In your first example (the neural network training one), you suggest sampling a new configuration randomly:
You'd even prefer it even running a random config rather than just doing nothing.

Yes, the key word here is "even". I think even a random config is better than nothing, but what I propose is to use a trained random forest to suggest new configs, just we shouldn't care so much that it was trained on the absolute latest results, just as in the case when retrain_after is greater than 1.

This is exactly the case for training a RF in a separated thread. Assuming that you have two cores, one core already finishes its configuration evaluation. Since your RF is still under training (you can not expect that both RF training and configuration evaluation ends at exact the same time clock for both CPUs), you cannot expect how long it would take to finish the RF training in another core(thread). Then in that case you simply start a new configuration randomly (as suggested by your AWS example) even the RF training will end just 1 CPU clock after. Therefore, in this case, you always have the core 1 running BO+evaluation and core 2 running random evaluation.

Ok, so look, let's say we have 2 cores. One core always trains (because training takes longer than evaluation), and one core always evaluates the cost function (i.e., it's a worker that runs trials):

  1. in the beginning, you have neither a model, nor the data to train one, so first, you need to run a few random trials
  2. once you have a bunch of data points, you can train an RF model. Let's assume that there are, like, 10 random trials (n_configs parameter of InitialDesign constructor).
  3. then, thread A starts training an RF model on the 10 data points
  4. thread B requests a config suggestion, but since the surrogate model is not yet done, it needs to wait for thread A to finish training
  5. thread A finishes training model v1 (trained on 10 previous trial results), and then just idles away because it hasn't received any new data to train an RF on
  6. thread B receives a configuration (config_11) and starts evaluating the cost function
  7. thread B finishes evaluating config_11 and reports back results
  8. thread A starts training model v2 (based on 11 trial results)
  9. thread B asks for a suggestion of a config to try next
  10. DESIRED BEHAVIOR: instead of waiting for model v2 to finish training (the old behavior), get the next config to try (config_12) immediately using the already available model v1 (has been trained on 10 of the 11 available data points, so it's NOT random). Thread B can, thus, start working and doesn't have to remain idle, waiting. This behavior of using the same model (v1) to suggest multiple configs to try even when it hasn't been trained with all the available data points is somewhat similar to using a retrain_after > 1.

NOTE: Indeed, at step 5 above, it's also possible for thread A to start running another random trial, but this is NOT the subject of this feature request, but rather a potential further optimization similar to what you said above.

NOTE 2: actually, what I think would be more relevant and more in line with what you asked (and I replied to), is that at step 4 above thread B can just start a random config without waiting for the first model to finish training. However, again, this would be just an optimization, and NOT the point of this issue and associated PR

@dengdifan
Copy link
Contributor

Hi @bbudescu,

As you said, we could simply set retrain_after greater than 1 or add a scheduler for retrain_after arguments to solve this problem. In my opinion, there is no need for us to train the RF in a separate thread that might result in potential conflict or problems.

@bbudescu
Copy link
Author

bbudescu commented Dec 11, 2024

Well, the problem with setting retrain_after to something fixed and constant is a one-size-fits-all approach which might even make the worst of both worlds:

  • in the beginning, when there isn't much data to train on, it will make unnecessarily bad suggestions because it trains less often than it can
  • in the end, when training takes a lot, we might still end up waiting for the training to finish

A tradeoff is hard to find, and even the best tradeoff will be suboptimal.

And, after all, I've just pushed some code to implement this feature. I know it's a potential risk, but, as far as I have come to understand, you guys are ok with at least offering some support for integration (I'm not talking about you implementing it yourself).

LE: It's not done yet, i.e., there are no tests, I haven't added in a way to switch to the old behavior, and I still have a few questions about, e.g., what to do with the AbstractModel._rng which gets overwritten within RandomForest.__init__, but I think the implementation is, at least in a stage where it can be tested until it works fine.

@bbudescu
Copy link
Author

bbudescu commented Dec 11, 2024

or add a scheduler for retrain_after

This should also be done dynamically, because, as a user, it's kinda hard to anticipate how long training will take on the current number of finished trials, how long running a trial will take, as it also depends on the hyperparameters in the suggested config, which evolve over time etc. I'm thinking there would be value in adding estimators for all of these, but I feel that just running a second thread on the main cpu when there's some new data available is easier.

LE: The alternative for the user would be to do trial and error until you get the schedule right, but that beats the whole purpose of making the session faster / more efficient

@bbudescu
Copy link
Author

Hi @dengdifan,

I think I was able to bring the code to a state in which it can be tested.

As per my #1178 (comment), could this, perhaps, be solved by a compromise like merging the pull request and exposing the option of using concurrent background training as an experimental feature, at the user's (documented) risk?

@dengdifan
Copy link
Contributor

Hi @bbudescu
Thanks for your effort. However, my concern is still: since the RF training process in your PR is asynchronous, you cannot guarantee that once a surrogate model is trained (and a new set of candidate acquisition function values are proposed), multiple workers are ready for running new configurations (as I proposed, the probability of this should be 0). So, for me, training the surrogate model in a subprocess might not provide much difference compared to our current implementation.

@bbudescu
Copy link
Author

Hi @dengdifan,

Thanks for your response. I'm really sorry, but I still don't understand what you mean, unfortunately. Would you care to, maybe elaborate a bit further, or perhaps try to explain in a different fashion than before? Perhaps, take a look at the code I propose in PR #1178, and show me where this might happen. I mean, the behavior using my code is definitely not what I was after, so by all means, if you can provide some insight, I'm really interested.

Perhaps also take into account the latest measurements and conclusions I posted in this #1178 (comment).

It seemed to me like @eddiebergman's comments in the PR sounded as if the stuff I proposed seemed to him like an idea he'd be on board with. @eddiebergman, could you, maybe, please help us get on the same page?

LE: I'm only guessing here, so I could be dead wrong, so please forgive me if I am. Do you, perhaps, think that I'm using Python's async and await? I meant "asynchronous" not in the strict sense of the python coroutines and asyncio, but in the wider sense of doing things concurrently, in parallel, i.e., moving forward without waiting for an operation to finish after launching it into execution.

@dengdifan
Copy link
Contributor

Hi @bbudescu

let's still use the example of the two workers scenario. Assuming that we have two workers A and B and we already have 10000 finished configurations.

That at time 0:00, A is free while B is still running, So we train RF with A.
While at time 0:01, B finishes its run while A is still training its surrogate model, regardless if it is trained in a separated thread. Since B cannot estimate how long A will finish its training, then either it starts a new configuration randomly, selecting a new configuration in the suggested candidate queues (when retrain_after is larger than 1) or waits until the acquisition function optimizer suggests the next configuration (this is the case for our current implementation if retrain_after =1). So personally, I would prefer to adjust the retrain_after values rather than training surrogate models in a background thread

@bbudescu
Copy link
Author

bbudescu commented Jan 3, 2025

Hi @dengdifan, and Happy New Year!

Thanks for taking the time to answer.

So personally, I would prefer to adjust the retrain_after values rather than training surrogate models in a background thread

The thought has crossed my mind (actually, this was my initial idea as well, see this, and this comments I made back in November when considering implementing ASHA first #1169, before even opening the current issue), but I think it's a bit harder to implement a dynamic scheduler that looks at current developments of the acquisition function's surface in a rigorous manner.

In order to make a decision on whether or not to start training or just keep sampling from the current model, one would have to somehow figure out which decision is more advantageous, and to do that, I'm thinking that one could compute something like the Expected Improvement of doing another training as an estimated benefit in a cost/benefit analysis. To establish whether it's worth it, perhaps a framework like Value of Information might prove useful.

To make an estimate of the cpu load, one would have to know:

  • how long the current trials are going to keep running
  • how long it will take to train another model
  • ideally, for how long trials on configs that are going to be sampled in the future will run

For these, we need additional models:

how long the current trials are going to keep running

I guess it'd make sense to consider training a model (another random forest, or just add another output to the current RF, in addition to the one that does cost estimation) that is able to predict how long we'd have to wait for the current trial to finish, given the sampled params. Optionally, one could take into account previous trial durations as input features (in an autoregressive manner) that could be used to improve the prediction when the previous samples don't cover the sampling space very well.

how long it will take to train another model

Also, it'd make sense to fit another model, perhaps not something as complicated as a random forest, but maybe to just do a regression to find some optimal coefficients for known functions, since we know that the computational complexity of training an RF should look something like O(n · log(n) · f ·k), where n is the number of samples, m is the number of features and k is the number of decision trees.

deally, for how long trials on configs that are going to be sampled in the future will run

Perhaps a simple regression will provide an accurate enough estimate of how long trials of configs sampled in the future will take.

My conclusion

All in all, I think that this looks more like a good subject for research, rather than just optimizing the resource usage a bit, and there has to be some easier way to achieve this.

I think there is more value in estimating the trial time given a certain configuration, because solvers, in general, when making the decisions on what to try next, should factor in trial duration and, perhaps even pecuniary cost to make better use of the given budget, and to increase the chances of finding a lower cost within the constraints.

LE: Perhaps other, simpler heuristics can be used to maintain the CPU load above a certain threshold. E.g., if the minimal load target is, say, 90%, one could start a new training when, since the last training finished, nine times the last training's duration had passed.

Also, as you can see, I sometimes tend overcomplicate things. If I've slid beyond reason, I apologize.

@bbudescu
Copy link
Author

bbudescu commented Jan 3, 2025

Now, back to the problem at hand. From #1170 (comment)

it starts a new configuration randomly

Why does it have to be random? Since "we already have 10000 finished configurations", presumably some model must have already been trained. Why not have the acquisition function optimizer use that older model to suggests the next configuration to run a trial on?

@bbudescu
Copy link
Author

bbudescu commented Jan 3, 2025

Another option would be to just use a fixed retrain_after, but that would mean that for a lot of cases, in the beginning of the optimization session, we'd be using models that are unnecessarily old when the acquisition function optimizer suggests the next configuration. And also, after a bunch of other trials, the fixed value of the retrain_after will prove too low, so we might end up getting the worst of both worlds, so using unnecessarily uninformed guesses, while still ending up keeping cpus idle while waiting for training to finish.

Ideally, the retrain_after would be 1 in the beginning, and it would increase when it notices that the cpu load is starting to decrease, but not before.

Having the user specify the schedule doesn't really work either, because he wouldn't possibly know it beforehand because trial times are dependent on params sampled by the acq fn optimizer, and he wouldn't know the training times either. So he could only figure the schedule out through trial and error, thus defeating the purpose of finding the best combo within a given budget of time and other types of costs (e.g., pecuniary).

@bbudescu
Copy link
Author

bbudescu commented Jan 3, 2025

Also, in #1170 (comment), I had listed another potential improvement, that now I'm thinking isn't all that great:

  • optionally: use an occupancy threshold, e.g., 90%, and allow worker threads to wait for training to finish only if the percentage of workers that are idle waiting for the new RF version is below 10%.

This, I think, will have the undesired effect of having a maximum CPU load of exactly 90% as soon as the models start taking long enough to train.

@bbudescu
Copy link
Author

bbudescu commented Jan 3, 2025

Also, take a look at this: #1178 (comment). It seems that currently the bottleneck is somewhere else altogether, because the cpu load is bad even when no model is being used, both with and without hyperband, so perhaps some thorough profiling is of higher priority right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants