Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train rf model in background thread #1178

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

bbudescu
Copy link

@bbudescu bbudescu commented Dec 5, 2024

work in progress on #1170

@eddiebergman
Copy link
Contributor

eddiebergman commented Dec 5, 2024

Heyo, just a question why this is done as a thread and not a subprocess. Thread is easier by many stretches but has quite a few downsides compared to a process, at least in Python:

Thread:

  • Will not be parallel, i.e. fights for computation of main process (it's a Python thing). It's more that it will go-back-and-forth between main process computation and this thread. This is due to the GIL.
  • Uses same memory as main process, easier to transfer data for training.
  • Thread's can not be be interrupted and can cause the main process to not close, handling Ctrl+C or other system signals can be difficult.

Process:

  • Will be parallel.
  • Has seperate memory from main process, requires sending data to subprocess. Usually quite cheap and handled trivially using something like the builtin ProcessPoolExecutor.
  • System level signals propogate to subprocesses, no special handling required.

@bbudescu
Copy link
Author

bbudescu commented Dec 5, 2024

Will not be parallel, i.e. fights for computation of main process (it's a Python thing). It's more that it will go-back-and-forth between main process computation and this thread. This is due to the GIL.

Yes, that's what I was on about in #1170 (comment):

Of course, because of the GIL, the main optimization loop gets less cpu time, but although I haven't tested, I expect that the main thread would spend most time in the training section anyway. And I don't think the other parts in the loop are very CPU intensive anyway. Or are they? Do you know if, perhaps the inference or something takes really long?

So, actually, I think with running it in a separate thread, instead of having e.g., 90% of the time of the main process doing training, we now might end up giving it only 50%, and leaving the other 50% for other stuff like reporting and inference etc. if there's a demand for that.

Also, having threads instead of processes (either through multiprocessing or dask), doesn't take away one core from the workers doing cost function evaluation. Perhaps that's not much of an advantage, though.

Thread's can not be be interrupted and can cause the main process to not close, handling Ctrl+C or other system signals can be difficult.

I think this should already be fixed by daemonizing the thread (docs, further explanation).

Has seperate memory from main process, requires sending data to subprocess. Usually quite cheap and handled trivially using something like the builtin ProcessPoolExecutor.

Not sure how random forest serialization is handled. I know that for Queues objects are pickled, but I'm not sure how concurrent.futures pass return values. I know that pyrfr.binary_rss_forest exposes ascii_string_representation(), which can be used for serialization.

LE: Oh, wait... I just noticed that it also implements __getstate__ and __setstate__, which rely on ascii_string_representation internally, so it should be pickleable.

Also, for the training data (X and y and the derived data), as I also wrote in #1170 (comment), one could also use shared memory arrays because they're contiguous and homogeneous numpy arrays, which can be buffered in multiprocessing.Array-like objects.

@bbudescu
Copy link
Author

bbudescu commented Dec 5, 2024

Thread's can not be be interrupted and can cause the main process to not close, handling Ctrl+C or other system signals can be difficult.

Ah, damn, I wasn't aware of python/cpython#80116. (also this)

@eddiebergman
Copy link
Contributor

Heyo, some extra feedback:

  • Daemonizing threads is a dangerous route too. They effectively detach themselves from the main process, and so if there is a bug and the thread runs forever, well, good luck. I've also explored a lot of this in pynisher, which is consequently what SMAC uses to run evaluations in a subprocess with the time (and memory?) limits. I just remember threads being the bane of lots of issues in the tests for that library.
  • Re serialization for processes: You're indeed right that mmep of arrays is efficient and useful. Rather than re-solve the wheel, you could use joblib, which is already a dependancy of scikit-learn. It already wraps a lot of the Python native ways of concurrent running of things, as well as implementing their own wrapper around ProcessPoolExecutor which natively handles mmep on objects that can be mem-mapped. It's a bit less low level, and may prevent certain operations but give the ubiqutous use of scikit-learn and how stable they've made it, it's probably best to rely on their implementations. (They also have threads fyi)

Tldr; concurrent programming in python is a mystical art with so many-caveats, and really experimentation is the only way to identify what works consistently. Try out joblib and if it works, I think that's the simplest solution with minimal chances for surprises later on.

Bogdan Budescu added 6 commits December 10, 2024 17:04
…ry (backed by mmap on posix-compliant OSs) for data, and queues for synchronization
- work in progress: switch from threads to processes
…ed within background thread

- finish first version of switch from threads to processes
@bbudescu
Copy link
Author

bbudescu commented Dec 11, 2024

Hi @eddiebergman,

I couldn't figure out a way to use joblib to streamline the implementation of the behavior I wanted, namely, to have a loop that continuously trains on new data as it arrives, and that always makes the current model immediately available whenever _predict is called (i.e., even if another model is currently training on more recent data), so I ended up keeping the old code structure, but switched from threading to the builtin multiprocessing.

Now, the code I pushed isn't yet tested, neither manually/visually, nor by unit tests. Also, I'd like to add a flag that would allow the user to switch to the old, synchronous behavior, which I think would help test things, and also allow users to fall back on the old code until the new code is reliable enough. However, all the logic should be there (unless I made some gross mistakes), so that should give us at least a basis for discussion.

In this stage, I wanted to ask you about how the AbstractModel._rng member (a np.random.RandomState that is overwritten by a pyrfr.regression.default_random_engine within RandomForest.__init__) is supposed to be used outside of the training logic (I imagine it might be serialized when saving the experiment state?), and how to handle it best. Currently, I've basically made the main thread agnostic of the rng used for training (except for the seed that is fed to it on initialization). Should I pass it back to the main process through the Queue I'm using for models? In this case, do you have any idea on how to handle serialization / [un-]pickling?

Also, given the non-deterministic nature of OS scheduling, running the same optimization session twice, even with the same rng seed might result in different results, especially if running with a different number of workers, but I understand that's already the case , as per the warning in the Parallelism section of the Advanced Usage/Parallelism pagein the docs:

Warning: When using multiple workers, SMAC is not reproducible anymore.

Other than that, if you want to throw an eye on what I've implemented so far and, if you, perhaps, find any flagrant mistakes, any feedback and support would be greatly appreciated.

Thanks,

Bogdan

Bogdan Budescu added 10 commits December 11, 2024 17:58
… wait until training is done before being able to query the model to suggest a new config to try
- better synchronization / signalling between optimization loop and training loop
- refactor:
  - improve shared array semantics
  - encapsulation: reuse more allocation / cleanup code
  - defensive: extra checks
- other minor fixes / improvements
… multiple sessions in parallel, e.g., when running tests
…ests to terminate the training loop process gracefully (to as high as an extent as possible)

- add (and then disable) some code that prints to console to help debug inter-process synchronization
- refactor: renames for improved legibility
- refactor: encapsulate and reuse
- add option to run testing code without pytest for debug
- modify some testing code to avoid deprecation warnings
- refactor:
  - renames: improved legibility
  - easier debug printing for sync between opt and train loop processes
@bbudescu
Copy link
Author

bbudescu commented Dec 13, 2024

Hi @eddiebergman,

I managed to bring the code to a point where it passes all the tests without hanging. Indeed, it wasn't a pretty job :).

Could you now, please, advise on what's needed for this PR to be merged?

Here are some potential improvements I could think of:

1. UPDATE: DONE: Option to Disable Concurrency

Add a flag to revert to old implementation, and not just behavior. Note, I've already added the option to wait for training to be completed after every submission of fresh training data, but I was thinking of disabling multiprocessing altogether)

2. Unit Tests

Would you like me to add some unit tests? How do you think they should look like?

3. Training Data IPC

In order to decrease risks, I could eliminate my own homebrew mmap-backed implementation of a dynamic array and use some other way to pass the training data between the optimization loop and the training loop process.

You suggested joblib above, but I don't think the futures-like interface can really suit my need to implement a loop discarding all but the latest data (or, at least, I wasn't able to figure out how to use it to attain my goals). However, upon further investigation, I found out that, under the hood, joblib uses cloudpickle (pypi, src) for more complex objects (like numpy.arrays) that can't be handled by the builtin pickle. An alternative to cloudpickle would be dill (pypi, docs, src), which has been around for quite some time.

I know you would prefer not to add further dependencies to the project, but I thought it doesn't hurt to ask :). I found the multiprocess package (pypi, docs, src) that acts a drop-in replacement for the multiprocessing builtin module that uses dill for serialization. Both the project's PyPI release history and github commit history show that it still has regular maintenance updates, and it has had activity since 2015, being part of another project, with some ngo backing and some academic connection (at least in the beginning of the project, the project author and maintainer was affiliated with CalTech and other institutes).

This would also simplify the inter-process signalling required to release and unlink the shared memory, and would avoid potential leaks (see 5., below).

4. RNG

Still haven't heard back from you about the RandomForest._rng member that overrides AbstractModel._rng

5. Leaks, hangs

As per the code here, in some complex dependency graphs, the garbage collector doesn't get to call the __del__'s of various objects, so execution just hangs unless the user explicitly calls close() on the model or objects containing it.

This shouldn't normally be an issue neither in the common use cases, nor when running tests with pytest (I needed to add code to explicitly kill the background process in the fixture teardown), but still, I think if this should be fixed if possible.

@bbudescu
Copy link
Author

bbudescu commented Dec 13, 2024

Hi again,

I just added an option to disable multiprocessing altogether and switching back to the old behavior. Would this, perhaps, help allow the merging of the pull request, and somehow expose this to the user as an experimental feature, at his own risk?

Thanks!

@bbudescu
Copy link
Author

bbudescu commented Dec 18, 2024

I ran a session similar to the one the data from which the cpu load plot in #1170 (comment) was generated, and also the performance reported in the motivation section of #1169 (comment):

I was checking out an optimization session running on 64 cores and noticed the cpu load drop below 50% after 12 hours. Not sure yet exactly how many trials it had gotten through at that point (after 22 hours it got through 12600 trials, and cpu load is under 40%).

Currently, it doesn't seem that the code I wrote has managed to fix it.

So here are some plots:

1. The original 46h session

Screenshot 2024-11-23 at 11-59-52 Instances EC2 eu-west-1

2. Another 48h session (smooth)

Screenshot 2024-12-16 at 14-49-13 Instances EC2 eu-west-1

LE: 20.4k trials on 17k distinct configs in 48h

3.a. UPDATED: Current 24h Session

Screenshot 2024-12-19 at 10-12-43 Instances EC2 eu-west-1

3.b. UPDATED: Current 24h Session (smooth)

Screenshot 2024-12-19 at 10-14-07 Instances EC2 eu-west-1

Observation

After 24h the session got through 14450 trials on 64 cores, and cpu load is about 42%.

LE: 22k trials on 18.4k distinct configs in 48h

Conclusion

Either the code is broken, or the bottleneck happens for a different reason, e.g., because of Hyperband.

LE: There's a non-insignificant gain of about 10% in the number of trials/configs when enabling the code in this PR (i.e., between 2 and 3 above). So, in light of these new observations, I think it's that a new hypothesis, that this code fixes a bottleneck but there might be another one somewhere else, seems to become more likely.

Next steps:

  • reproduce behavior quicker, i.e., without requiring a 64 core machine and wait for at least 12 hours to get a definite observation of a drop in cpu load
  • do some profiling to measure on which lock most workers spend most time waiting on (LE: [Feature Request] Profiling #1181); EVEN LATER EDIT: actually, given the execution model, I think it's enough to profile the main process running the optimization loop
  • LE: a quick way to eliminate the hyperband intensifier would be to run the same session without it

…more easily identifiable in monitoring tools
@bbudescu
Copy link
Author

bbudescu commented Dec 21, 2024

Additional CPU load graphs

Hi guys. I ran another bunch of optimization sessions turning off various components of SMAC to try and identify where the bottleneck lies. Here are the results:

1. Hyperband disabled

1.1. With Background Training

Screenshot 2024-12-21 at 11-17-23 Instances EC2 eu-west-1
6.3k trials in 48 hours

1.2. Without Background Training (my smac3 clone)

Screenshot 2024-12-21 at 11-26-41 Instances EC2 eu-west-1
6.16k trials in 48 hours

1.4. Without Background Training (original smac3 repo)

This was just to test if my code introduces any bottlenecks.
Screenshot 2024-12-21 at 11-31-31 Instances EC2 eu-west-1
6.24k trials in 48 hours

2. Random Sampling

  • no Random Forest Surrogate Model

2.1. Hyperband Disabled

Screenshot 2025-01-03 at 18-36-00 Instances EC2 eu-west-1

  • ~ 5.5k-6k trials in 46 hours
  • TODO: trial count after 48 hours

2.2. With Hyperband Enabled

Screenshot 2025-01-03 at 18-37-07 Instances EC2 eu-west-1

  • TODO: trial count

Conclusions

It seems now that the problem lies neither with training the RF model (when doing random sampling, there are fewer trials done than when using RF; mind, however, that trial time, although deterministic, varies with the chosen params), neither with Hyperband. I guess that some profiling is highly necessary to identifiy the performance bottlenecks in a parallel setting.

Later, I will update the above graphs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants