Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too slow. GPU support please #973

Open
jarlva opened this issue Oct 20, 2022 · 26 comments
Open

Too slow. GPU support please #973

jarlva opened this issue Oct 20, 2022 · 26 comments

Comments

@jarlva
Copy link

jarlva commented Oct 20, 2022

Hi, firstly, apologize in advance for using bug report instead of Discussion and feature requests. I posted a request there a while back with no activity.

Please add support for cuda pandas, cudf, to accelerate things. I have an nvidia gpu that sits idle while waiting 3 hours each time I change something. I believe lots of folks here are in the same situation.
Please, make it happen.

@jarlva jarlva added the bug label Oct 20, 2022
@adhoc-research
Copy link

+1. Have a 3 million row dataframe, a 16 hour wait time, and 4 V100s sitting idle :)

@arturdaraujo
Copy link

@EQU1 gpu would be great but I recently tested tsfresh on Linux and it was 35x times faster. I can't say why this happened because most of my code run 1.2 to 1.3 times faster on Linux on average. I used Ubuntu on WSL.

@kempa-liehr
Copy link
Collaborator

Thanks for the suggestion of using cudf. I will have a look into this package.

@arturdaraujo
Copy link

Please take a look on #972

@nils-braun
Copy link
Collaborator

nils-braun commented Feb 19, 2023

Thank you all for your input @jarlva, @rushatrai and @arturdaraujo !
Sorry for the delayed (or even no) responses to your requests in the last times.

I personally do not have the bandwidth to implement this feature myself, but we are welcoming any kind of contributions (this is how open source works, please note that we do not need to be the only one doing the implementations ;-)). If one of you has a bit of experience with cudf (or any other package in this context) and would like to contribute parts or a full implementation, we are very happy to hear about this and collaborate! A GPU implementation would definitely be very nice to have.

@arturdaraujo
Copy link

Ideally I think the first step would be to implement numba or cython for a speed up

@beckernick
Copy link

Would someone be able to share a reproducible example of the code they're running that they'd like to be able to run faster (with GPUs or otherwise)? My recollection is that a few of the operations took up most of the time when I've used tsfresh in the past, but I don't know if my experience was representative.

It would be great to document examples that illustrate the bottlenecks.

@nils-braun
Copy link
Collaborator

@beckernick - from our experience, basically everything with the marker high_comp_cost in https://github.com/blue-yonder/tsfresh/blob/main/tsfresh/feature_extraction/feature_calculators.py has a bad performance behavior if the size of the timeseries starts to grow.
If users want to perform only those feature calculators with a faster runtime, we recommend using the EfficientFCParameters, which removes those.

@aurora5161
Copy link

@beckernick - from our experience, basically everything with the marker high_comp_cost in https://github.com/blue-yonder/tsfresh/blob/main/tsfresh/feature_extraction/feature_calculators.py has a bad performance behavior if the size of the timeseries starts to grow. If users want to perform only those feature calculators with a faster runtime, we recommend using the EfficientFCParameters, which removes those.

Hi, sorry to disturb, recently I used tsfresh in time series data. Because my data is large(more than 10 million rows),it can not run on my computer. I tried to modify code to fit on GPU, but it failed, such as error info "cudf.core.series can not be used in numpy of fft function". So I have a question, Can we use spark-rapids that leverage GPUs to accelerate tsfresh and do not need to modify the code, because tsfresh can run on spark?

@beckernick
Copy link

The challenge with using GPUs here is that much of the work is happening in the user-defined functions (UDFs) mentioned above that are applied on the DataFrame Groupby objects. And these specific UDFs happen to be ones that can't be translated to run on the GPU "as is". Using Spark RAPIDS or cuDF would allow you to accelerate the dataframe operations, but even if you could smoothly pass the GPU dataframes around inside tsfresh you'd still be bottlenecked on the UDFs running on your CPU(s).

It may be possible to write the computationally expensive UDFs to use the GPU and get a speedup, but it would likely require a rewrite of the functions from first principles.

@arturdaraujo
Copy link

Gpu would be complicated to implement gpu guys. the next step here would be to implement numba. 10x to 20x speed up is a significant change...

I already implemented my version of tafresh using numba for minimal functions. Numba loves loops so I imagine it can even be above 20x speed up

@dom-white
Copy link
Contributor

It may be possible to get a decent speed up without GPU support.

As tsfresh uses Parallelization by default, this can cause perfomance issues if using the underlying python modules like SciPy and Scikit-learn which also (by default) attempt to distribute load between all processor cores when they drop down into c libraries.

This can lead to severe over provisioning, where processors spend most of their time context switching rather than doing useful work.

I was recently looking into performance issues with our own python notebook that we implemented multiprocessing on, and noticed that by forcing the underlying libraries to remain single threaded I saw a massive speed increase when using the multiprocessing module https://docs.python.org/3/library/multiprocessing.html

I then enforced the same changes for my tsfresh notebook and it went from taking around 7 minutes to feature extract each of my time series data files (using efficient parameters) to just 16 seconds!
Admittedly this may be an extreme saving example as I was running this on a sever with ~100 cores, so you're mileage may vary.

To get the underlying libraries to stay single core you need to do the following exports BEFORE starting the python environment you are using, otherwise there will be no difference:

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1

Here are a couple of links where I found some of this useful info:
https://thomasjpfan.github.io/parallelism-python-libraries-design/
https://docs.dask.org/en/stable/array-best-practices.html?highlight=OMP_NUM_THREADS#avoid-oversubscribing-threads

If this does help people, then it may be worth updating some part the documentaion to reflect this configuration adjustment.

@arturdaraujo
Copy link

Can you show more code of how that works? Like a full script on applying this

@dom-white
Copy link
Contributor

I think it shold be just the case of running those three lines in your shell, before invoking whatever python environment you are running tsfresh in.
For me I am using Jupyter lab inside a docker environment, so I have added the extra environment variables to the docker compose yml file that controls it.
I think you could use a package like python-dotenv to set environment variables for your python enviroment.

@dom-white
Copy link
Contributor

Just to clairfy that you don't need an uber server to take advantage of this. If I stop the over provisioning from happening even on my laptop which is running tsfresh inside a linux virtual machine with only access to half my cpu cores (I have a core i7 machine), I still saw a 6.5x improvement:

original:

Feature Extraction: 100%|¦¦¦¦¦¦¦¦¦¦| 40/40 [08:40<00:00, 13.02s/it] 

with forcing libraries to single core:

Feature Extraction: 100%|¦¦¦¦¦¦¦¦¦¦| 40/40 [01:20<00:00,  2.02s/it]

@nils-braun
Copy link
Collaborator

Hi @dom-white - very good! What you describe makes a lot of sense 👍. It might be even worse in tsfresh compared to other use-cases because we call so many different C functions (because there are many feature extractors) and therefore have the context switching even more often (?).
Would you like to add this to the documentation? I definitely think this is worth mentioning. Or do you think it might even make sense to set this by default (only for multiprocessing, because I assume this makes it slower in single processing)?

@dom-white
Copy link
Contributor

Hi @nils-braun, yes I think it would be helpful to add some information on this to the documentaion.
I think it is a bit tricky as people run tsfresh under different environments and os's, and I have only got this to work by setting these envirorment variables before launching the python envrionment. So before adding, I could look into the simplest most universal way of setting these enviroment variables easily

@dom-white
Copy link
Contributor

Hi @nils-braun, I managed to get this going on my 2nd attempt directly within jupyterlabs, so I have had a go at updating the documentation and have created a pull request for it

@arturdaraujo
Copy link

If we can make this a default feature it will be a major upgrade for the package!! Thanks man

@dom-white
Copy link
Contributor

If we can make this a default feature it will be a major upgrade for the package!! Thanks man

Unfortunaltely I think the only way of enabling this is via a user adding the environment vairables as shown in the documentation pull request I added. If the envirornment variables were set within tsfresh, tsfresh itself would have to do this before any other module was imported in, so it would still require documentation to the user that they would need to import tsfresh in first before anything else. I think how it is documented may be the best solution.
It may be worth adding a link to the new section from one of the pages most likely to be read, like the FAQ page to make sure it gets seen. e.g. FAQ: Is there anything I can do to speed up processing?

@SoCool1345
Copy link

SoCool1345 commented Feb 25, 2024

os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
Add the above code and it can work in win10

@YamaByte
Copy link
Contributor

I tried @dom-white's method and it did indeed sped tsfresh up quite drastically (21M rows, 3 features)!

However, do take note for those who wish to use this workaround: remember to revert the environment variables (those 3 that were listed) once you're done extracting if you intend to do some machine learning.

Immediately after my tsfresh feature extraction step (within the same kernel session), I was grid searching through XGBoost classifier hyper-parameters on my GPU with the extracted features. However, the training run time was significantly slower than I expected. Upon inspecting my GPU utilization, I noticed that it was oscillating between 0 to 100% utilization at ~30s intervals when it should be constantly at 100% utilization until the grid search algorithm ends.

Turns out, it was the CPU causing the bottleneck, seen from an extremely low CPU utilization (<10%). I do know that many GPU-enabled machine learning algorithms fall back on CPU for certain intermediary computations (e.g. loss calculation for Keras neural networks), and likely XGBoost is doing the same somewhere along its pipeline. As such, the thread limitations proposed significantly affected these operations.

After saving my tsfresh features locally to drive, rebooting my python environment without those limitations, and restarting the machine learning pipeline, my training times were as fast as before: from ~36s per model -> 4s (significant if you are brute-force building and grid searching through 1350 models).

Hope this helps anyone out there who wants to speed up both feature extraction and model training!

@dom-white
Copy link
Contributor

I tried @dom-white's method and it did indeed sped tsfresh up quite drastically (21M rows, 3 features)!

However, do take note for those who wish to use this workaround: remember to revert the environment variables (those 3 that were listed) once you're done extracting if you intend to do some machine learning.

Immediately after my tsfresh feature extraction step (within the same kernel session), I was grid searching through XGBoost classifier hyper-parameters on my GPU with the extracted features. However, the training run time was significantly slower than I expected. Upon inspecting my GPU utilization, I noticed that it was oscillating between 0 to 100% utilization at ~30s intervals when it should be constantly at 100% utilization until the grid search algorithm ends.

Turns out, it was the CPU causing the bottleneck, seen from an extremely low CPU utilization (<10%). I do know that many GPU-enabled machine learning algorithms fall back on CPU for certain intermediary computations (e.g. loss calculation for Keras neural networks), and likely XGBoost is doing the same somewhere along its pipeline. As such, the thread limitations proposed significantly affected these operations.

After saving my tsfresh features locally to drive, rebooting my python environment without those limitations, and restarting the machine learning pipeline, my training times were as fast as before: from ~36s per model -> 4s (significant if you are brute-force building and grid searching through 1350 models).

Hope this helps anyone out there who wants to speed up both feature extraction and model training!

That's a good point. I had a notebook purely for tsfresh feature extraction so did not encounter this issue. Thanks for highlighting its possible negative effect

@nils-braun
Copy link
Collaborator

Thanks for sharing with the community @YamaByte! Would you think it makes sense to add one sentence about this issue into the respective docs that @dom-white added? Happy to review your PR :)

@beyondguo
Copy link

beyondguo commented Nov 4, 2024

It may be possible to get a decent speed up without GPU support.

To get the underlying libraries to stay single core you need to do the following exports BEFORE starting the python environment you are using, otherwise there will be no difference:

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1

Hi @dom-white Thanks for sharing this trick.

I used this setting and set n_jobs=64 in roll_time_series and extract_features functions. However, I noticed that 69 processes were running but the CPU utilization is only 500%. What could be the problem? I have a lot of cores on my machine but it seems that my resources are not fullly utilized for speeding up. @nils-braun

@nils-braun
Copy link
Collaborator

🤔 the feature calculators are only doing number crunching and no IO, so I would not see why there is no 100% usage per core. Is the data you are using maybe close to your memory limit and your OS needs to swap all the time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests