Too slow. GPU support please #973

jarlva · 2022-10-20T06:57:30Z

Hi, firstly, apologize in advance for using bug report instead of Discussion and feature requests. I posted a request there a while back with no activity.

Please add support for cuda pandas, cudf, to accelerate things. I have an nvidia gpu that sits idle while waiting 3 hours each time I change something. I believe lots of folks here are in the same situation.
Please, make it happen.

adhoc-research · 2022-10-20T17:27:46Z

+1. Have a 3 million row dataframe, a 16 hour wait time, and 4 V100s sitting idle :)

arturdaraujo · 2022-11-17T21:03:53Z

@EQU1 gpu would be great but I recently tested tsfresh on Linux and it was 35x times faster. I can't say why this happened because most of my code run 1.2 to 1.3 times faster on Linux on average. I used Ubuntu on WSL.

kempa-liehr · 2022-11-17T22:04:50Z

Thanks for the suggestion of using cudf. I will have a look into this package.

arturdaraujo · 2022-11-18T18:21:49Z

Please take a look on #972

nils-braun · 2023-02-19T17:34:21Z

Thank you all for your input @jarlva, @rushatrai and @arturdaraujo !
Sorry for the delayed (or even no) responses to your requests in the last times.

I personally do not have the bandwidth to implement this feature myself, but we are welcoming any kind of contributions (this is how open source works, please note that we do not need to be the only one doing the implementations ;-)). If one of you has a bit of experience with cudf (or any other package in this context) and would like to contribute parts or a full implementation, we are very happy to hear about this and collaborate! A GPU implementation would definitely be very nice to have.

arturdaraujo · 2023-02-19T17:51:31Z

Ideally I think the first step would be to implement numba or cython for a speed up

beckernick · 2023-02-20T14:29:48Z

Would someone be able to share a reproducible example of the code they're running that they'd like to be able to run faster (with GPUs or otherwise)? My recollection is that a few of the operations took up most of the time when I've used tsfresh in the past, but I don't know if my experience was representative.

It would be great to document examples that illustrate the bottlenecks.

nils-braun · 2023-02-22T20:42:07Z

@beckernick - from our experience, basically everything with the marker high_comp_cost in https://github.com/blue-yonder/tsfresh/blob/main/tsfresh/feature_extraction/feature_calculators.py has a bad performance behavior if the size of the timeseries starts to grow.
If users want to perform only those feature calculators with a faster runtime, we recommend using the EfficientFCParameters, which removes those.

aurora5161 · 2023-02-23T15:08:55Z

@beckernick - from our experience, basically everything with the marker high_comp_cost in https://github.com/blue-yonder/tsfresh/blob/main/tsfresh/feature_extraction/feature_calculators.py has a bad performance behavior if the size of the timeseries starts to grow. If users want to perform only those feature calculators with a faster runtime, we recommend using the EfficientFCParameters, which removes those.

Hi, sorry to disturb, recently I used tsfresh in time series data. Because my data is large(more than 10 million rows),it can not run on my computer. I tried to modify code to fit on GPU, but it failed, such as error info "cudf.core.series can not be used in numpy of fft function". So I have a question, Can we use spark-rapids that leverage GPUs to accelerate tsfresh and do not need to modify the code, because tsfresh can run on spark?

beckernick · 2023-02-23T15:29:14Z

The challenge with using GPUs here is that much of the work is happening in the user-defined functions (UDFs) mentioned above that are applied on the DataFrame Groupby objects. And these specific UDFs happen to be ones that can't be translated to run on the GPU "as is". Using Spark RAPIDS or cuDF would allow you to accelerate the dataframe operations, but even if you could smoothly pass the GPU dataframes around inside tsfresh you'd still be bottlenecked on the UDFs running on your CPU(s).

It may be possible to write the computationally expensive UDFs to use the GPU and get a speedup, but it would likely require a rewrite of the functions from first principles.

arturdaraujo · 2023-02-23T16:25:57Z

Gpu would be complicated to implement gpu guys. the next step here would be to implement numba. 10x to 20x speed up is a significant change...

I already implemented my version of tafresh using numba for minimal functions. Numba loves loops so I imagine it can even be above 20x speed up

dom-white · 2023-09-18T15:24:22Z

It may be possible to get a decent speed up without GPU support.

As tsfresh uses Parallelization by default, this can cause perfomance issues if using the underlying python modules like SciPy and Scikit-learn which also (by default) attempt to distribute load between all processor cores when they drop down into c libraries.

This can lead to severe over provisioning, where processors spend most of their time context switching rather than doing useful work.

I was recently looking into performance issues with our own python notebook that we implemented multiprocessing on, and noticed that by forcing the underlying libraries to remain single threaded I saw a massive speed increase when using the multiprocessing module https://docs.python.org/3/library/multiprocessing.html

I then enforced the same changes for my tsfresh notebook and it went from taking around 7 minutes to feature extract each of my time series data files (using efficient parameters) to just 16 seconds!
Admittedly this may be an extreme saving example as I was running this on a sever with ~100 cores, so you're mileage may vary.

To get the underlying libraries to stay single core you need to do the following exports BEFORE starting the python environment you are using, otherwise there will be no difference:

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1

Here are a couple of links where I found some of this useful info:
https://thomasjpfan.github.io/parallelism-python-libraries-design/
https://docs.dask.org/en/stable/array-best-practices.html?highlight=OMP_NUM_THREADS#avoid-oversubscribing-threads

If this does help people, then it may be worth updating some part the documentaion to reflect this configuration adjustment.

arturdaraujo · 2023-09-18T17:23:31Z

Can you show more code of how that works? Like a full script on applying this

dom-white · 2023-09-18T17:44:45Z

I think it shold be just the case of running those three lines in your shell, before invoking whatever python environment you are running tsfresh in.
For me I am using Jupyter lab inside a docker environment, so I have added the extra environment variables to the docker compose yml file that controls it.
I think you could use a package like python-dotenv to set environment variables for your python enviroment.

dom-white · 2023-09-21T12:25:50Z

Just to clairfy that you don't need an uber server to take advantage of this. If I stop the over provisioning from happening even on my laptop which is running tsfresh inside a linux virtual machine with only access to half my cpu cores (I have a core i7 machine), I still saw a 6.5x improvement:

original:

Feature Extraction: 100%|¦¦¦¦¦¦¦¦¦¦| 40/40 [08:40<00:00, 13.02s/it]

with forcing libraries to single core:

Feature Extraction: 100%|¦¦¦¦¦¦¦¦¦¦| 40/40 [01:20<00:00,  2.02s/it]

nils-braun · 2023-09-23T20:26:37Z

Hi @dom-white - very good! What you describe makes a lot of sense 👍. It might be even worse in tsfresh compared to other use-cases because we call so many different C functions (because there are many feature extractors) and therefore have the context switching even more often (?).
Would you like to add this to the documentation? I definitely think this is worth mentioning. Or do you think it might even make sense to set this by default (only for multiprocessing, because I assume this makes it slower in single processing)?

dom-white · 2023-09-25T09:26:59Z

Hi @nils-braun, yes I think it would be helpful to add some information on this to the documentaion.
I think it is a bit tricky as people run tsfresh under different environments and os's, and I have only got this to work by setting these envirorment variables before launching the python envrionment. So before adding, I could look into the simplest most universal way of setting these enviroment variables easily

dom-white · 2023-09-25T15:16:07Z

Hi @nils-braun, I managed to get this going on my 2nd attempt directly within jupyterlabs, so I have had a go at updating the documentation and have created a pull request for it

arturdaraujo · 2023-09-25T18:22:09Z

If we can make this a default feature it will be a major upgrade for the package!! Thanks man

dom-white · 2023-09-26T10:14:41Z

If we can make this a default feature it will be a major upgrade for the package!! Thanks man

Unfortunaltely I think the only way of enabling this is via a user adding the environment vairables as shown in the documentation pull request I added. If the envirornment variables were set within tsfresh, tsfresh itself would have to do this before any other module was imported in, so it would still require documentation to the user that they would need to import tsfresh in first before anything else. I think how it is documented may be the best solution.
It may be worth adding a link to the new section from one of the pages most likely to be read, like the FAQ page to make sure it gets seen. e.g. FAQ: Is there anything I can do to speed up processing?

SoCool1345 · 2024-02-25T02:45:16Z

os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
Add the above code and it can work in win10

YamaByte · 2024-04-17T13:33:14Z

I tried @dom-white's method and it did indeed sped tsfresh up quite drastically (21M rows, 3 features)!

However, do take note for those who wish to use this workaround: remember to revert the environment variables (those 3 that were listed) once you're done extracting if you intend to do some machine learning.

Immediately after my tsfresh feature extraction step (within the same kernel session), I was grid searching through XGBoost classifier hyper-parameters on my GPU with the extracted features. However, the training run time was significantly slower than I expected. Upon inspecting my GPU utilization, I noticed that it was oscillating between 0 to 100% utilization at ~30s intervals when it should be constantly at 100% utilization until the grid search algorithm ends.

Turns out, it was the CPU causing the bottleneck, seen from an extremely low CPU utilization (<10%). I do know that many GPU-enabled machine learning algorithms fall back on CPU for certain intermediary computations (e.g. loss calculation for Keras neural networks), and likely XGBoost is doing the same somewhere along its pipeline. As such, the thread limitations proposed significantly affected these operations.

After saving my tsfresh features locally to drive, rebooting my python environment without those limitations, and restarting the machine learning pipeline, my training times were as fast as before: from ~36s per model -> 4s (significant if you are brute-force building and grid searching through 1350 models).

Hope this helps anyone out there who wants to speed up both feature extraction and model training!

dom-white · 2024-04-17T15:47:46Z

I tried @dom-white's method and it did indeed sped tsfresh up quite drastically (21M rows, 3 features)!

However, do take note for those who wish to use this workaround: remember to revert the environment variables (those 3 that were listed) once you're done extracting if you intend to do some machine learning.

Immediately after my tsfresh feature extraction step (within the same kernel session), I was grid searching through XGBoost classifier hyper-parameters on my GPU with the extracted features. However, the training run time was significantly slower than I expected. Upon inspecting my GPU utilization, I noticed that it was oscillating between 0 to 100% utilization at ~30s intervals when it should be constantly at 100% utilization until the grid search algorithm ends.

Turns out, it was the CPU causing the bottleneck, seen from an extremely low CPU utilization (<10%). I do know that many GPU-enabled machine learning algorithms fall back on CPU for certain intermediary computations (e.g. loss calculation for Keras neural networks), and likely XGBoost is doing the same somewhere along its pipeline. As such, the thread limitations proposed significantly affected these operations.

After saving my tsfresh features locally to drive, rebooting my python environment without those limitations, and restarting the machine learning pipeline, my training times were as fast as before: from ~36s per model -> 4s (significant if you are brute-force building and grid searching through 1350 models).

Hope this helps anyone out there who wants to speed up both feature extraction and model training!

That's a good point. I had a notebook purely for tsfresh feature extraction so did not encounter this issue. Thanks for highlighting its possible negative effect

nils-braun · 2024-04-17T21:10:26Z

Thanks for sharing with the community @YamaByte! Would you think it makes sense to add one sentence about this issue into the respective docs that @dom-white added? Happy to review your PR :)

beyondguo · 2024-11-04T09:25:48Z

It may be possible to get a decent speed up without GPU support.

To get the underlying libraries to stay single core you need to do the following exports BEFORE starting the python environment you are using, otherwise there will be no difference:
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1

Hi @dom-white Thanks for sharing this trick.

I used this setting and set n_jobs=64 in roll_time_series and extract_features functions. However, I noticed that 69 processes were running but the CPU utilization is only 500%. What could be the problem? I have a lot of cores on my machine but it seems that my resources are not fullly utilized for speeding up. @nils-braun

nils-braun · 2024-11-05T20:35:53Z

🤔 the feature calculators are only doing number crunching and no IO, so I would not see why there is no 100% usage per core. Is the data you are using maybe close to your memory limit and your OS needs to swap all the time?

jarlva added the bug label Oct 20, 2022

kempa-liehr added enhancement and removed bug labels Nov 17, 2022

YamaByte mentioned this issue Apr 22, 2024

Update tsfresh_on_a_cluster.rst #1069

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too slow. GPU support please #973

Too slow. GPU support please #973

jarlva commented Oct 20, 2022 •

edited

Loading

adhoc-research commented Oct 20, 2022

arturdaraujo commented Nov 17, 2022

kempa-liehr commented Nov 17, 2022

arturdaraujo commented Nov 18, 2022

nils-braun commented Feb 19, 2023 •

edited

Loading

arturdaraujo commented Feb 19, 2023

beckernick commented Feb 20, 2023

nils-braun commented Feb 22, 2023

aurora5161 commented Feb 23, 2023

beckernick commented Feb 23, 2023

arturdaraujo commented Feb 23, 2023

dom-white commented Sep 18, 2023

arturdaraujo commented Sep 18, 2023

dom-white commented Sep 18, 2023

dom-white commented Sep 21, 2023

nils-braun commented Sep 23, 2023

dom-white commented Sep 25, 2023

dom-white commented Sep 25, 2023

arturdaraujo commented Sep 25, 2023

dom-white commented Sep 26, 2023

SoCool1345 commented Feb 25, 2024 •

edited

Loading

YamaByte commented Apr 17, 2024

dom-white commented Apr 17, 2024

nils-braun commented Apr 17, 2024

beyondguo commented Nov 4, 2024 •

edited

Loading

nils-braun commented Nov 5, 2024

Too slow. GPU support please #973

Too slow. GPU support please #973

Comments

jarlva commented Oct 20, 2022 • edited Loading

adhoc-research commented Oct 20, 2022

arturdaraujo commented Nov 17, 2022

kempa-liehr commented Nov 17, 2022

arturdaraujo commented Nov 18, 2022

nils-braun commented Feb 19, 2023 • edited Loading

arturdaraujo commented Feb 19, 2023

beckernick commented Feb 20, 2023

nils-braun commented Feb 22, 2023

aurora5161 commented Feb 23, 2023

beckernick commented Feb 23, 2023

arturdaraujo commented Feb 23, 2023

dom-white commented Sep 18, 2023

arturdaraujo commented Sep 18, 2023

dom-white commented Sep 18, 2023

dom-white commented Sep 21, 2023

nils-braun commented Sep 23, 2023

dom-white commented Sep 25, 2023

dom-white commented Sep 25, 2023

arturdaraujo commented Sep 25, 2023

dom-white commented Sep 26, 2023

SoCool1345 commented Feb 25, 2024 • edited Loading

YamaByte commented Apr 17, 2024

dom-white commented Apr 17, 2024

nils-braun commented Apr 17, 2024

beyondguo commented Nov 4, 2024 • edited Loading

nils-braun commented Nov 5, 2024

jarlva commented Oct 20, 2022 •

edited

Loading

nils-braun commented Feb 19, 2023 •

edited

Loading

SoCool1345 commented Feb 25, 2024 •

edited

Loading

beyondguo commented Nov 4, 2024 •

edited

Loading