-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too slow. GPU support please #973
Comments
+1. Have a 3 million row dataframe, a 16 hour wait time, and 4 V100s sitting idle :) |
@EQU1 gpu would be great but I recently tested tsfresh on Linux and it was 35x times faster. I can't say why this happened because most of my code run 1.2 to 1.3 times faster on Linux on average. I used Ubuntu on WSL. |
Thanks for the suggestion of using cudf. I will have a look into this package. |
Please take a look on #972 |
Thank you all for your input @jarlva, @rushatrai and @arturdaraujo ! I personally do not have the bandwidth to implement this feature myself, but we are welcoming any kind of contributions (this is how open source works, please note that we do not need to be the only one doing the implementations ;-)). If one of you has a bit of experience with cudf (or any other package in this context) and would like to contribute parts or a full implementation, we are very happy to hear about this and collaborate! A GPU implementation would definitely be very nice to have. |
Ideally I think the first step would be to implement numba or cython for a speed up |
Would someone be able to share a reproducible example of the code they're running that they'd like to be able to run faster (with GPUs or otherwise)? My recollection is that a few of the operations took up most of the time when I've used tsfresh in the past, but I don't know if my experience was representative. It would be great to document examples that illustrate the bottlenecks. |
@beckernick - from our experience, basically everything with the marker |
Hi, sorry to disturb, recently I used tsfresh in time series data. Because my data is large(more than 10 million rows),it can not run on my computer. I tried to modify code to fit on GPU, but it failed, such as error info "cudf.core.series can not be used in numpy of fft function". So I have a question, Can we use spark-rapids that leverage GPUs to accelerate tsfresh and do not need to modify the code, because tsfresh can run on spark? |
The challenge with using GPUs here is that much of the work is happening in the user-defined functions (UDFs) mentioned above that are applied on the DataFrame Groupby objects. And these specific UDFs happen to be ones that can't be translated to run on the GPU "as is". Using Spark RAPIDS or cuDF would allow you to accelerate the dataframe operations, but even if you could smoothly pass the GPU dataframes around inside tsfresh you'd still be bottlenecked on the UDFs running on your CPU(s). It may be possible to write the computationally expensive UDFs to use the GPU and get a speedup, but it would likely require a rewrite of the functions from first principles. |
Gpu would be complicated to implement gpu guys. the next step here would be to implement numba. 10x to 20x speed up is a significant change... I already implemented my version of tafresh using numba for minimal functions. Numba loves loops so I imagine it can even be above 20x speed up |
It may be possible to get a decent speed up without GPU support. As tsfresh uses Parallelization by default, this can cause perfomance issues if using the underlying python modules like SciPy and Scikit-learn which also (by default) attempt to distribute load between all processor cores when they drop down into c libraries. This can lead to severe over provisioning, where processors spend most of their time context switching rather than doing useful work. I was recently looking into performance issues with our own python notebook that we implemented multiprocessing on, and noticed that by forcing the underlying libraries to remain single threaded I saw a massive speed increase when using the multiprocessing module https://docs.python.org/3/library/multiprocessing.html I then enforced the same changes for my tsfresh notebook and it went from taking around 7 minutes to feature extract each of my time series data files (using efficient parameters) to just 16 seconds! To get the underlying libraries to stay single core you need to do the following exports BEFORE starting the python environment you are using, otherwise there will be no difference:
Here are a couple of links where I found some of this useful info: If this does help people, then it may be worth updating some part the documentaion to reflect this configuration adjustment. |
Can you show more code of how that works? Like a full script on applying this |
I think it shold be just the case of running those three lines in your shell, before invoking whatever python environment you are running tsfresh in. |
Just to clairfy that you don't need an uber server to take advantage of this. If I stop the over provisioning from happening even on my laptop which is running tsfresh inside a linux virtual machine with only access to half my cpu cores (I have a core i7 machine), I still saw a 6.5x improvement: original:
with forcing libraries to single core:
|
Hi @dom-white - very good! What you describe makes a lot of sense 👍. It might be even worse in tsfresh compared to other use-cases because we call so many different C functions (because there are many feature extractors) and therefore have the context switching even more often (?). |
Hi @nils-braun, yes I think it would be helpful to add some information on this to the documentaion. |
Hi @nils-braun, I managed to get this going on my 2nd attempt directly within jupyterlabs, so I have had a go at updating the documentation and have created a pull request for it |
If we can make this a default feature it will be a major upgrade for the package!! Thanks man |
Unfortunaltely I think the only way of enabling this is via a user adding the environment vairables as shown in the documentation pull request I added. If the envirornment variables were set within tsfresh, tsfresh itself would have to do this before any other module was imported in, so it would still require documentation to the user that they would need to import tsfresh in first before anything else. I think how it is documented may be the best solution. |
os.environ["OMP_NUM_THREADS"] = "1" |
I tried @dom-white's method and it did indeed sped tsfresh up quite drastically (21M rows, 3 features)! However, do take note for those who wish to use this workaround: remember to revert the environment variables (those 3 that were listed) once you're done extracting if you intend to do some machine learning. Immediately after my tsfresh feature extraction step (within the same kernel session), I was grid searching through XGBoost classifier hyper-parameters on my GPU with the extracted features. However, the training run time was significantly slower than I expected. Upon inspecting my GPU utilization, I noticed that it was oscillating between 0 to 100% utilization at ~30s intervals when it should be constantly at 100% utilization until the grid search algorithm ends. Turns out, it was the CPU causing the bottleneck, seen from an extremely low CPU utilization (<10%). I do know that many GPU-enabled machine learning algorithms fall back on CPU for certain intermediary computations (e.g. loss calculation for Keras neural networks), and likely XGBoost is doing the same somewhere along its pipeline. As such, the thread limitations proposed significantly affected these operations. After saving my tsfresh features locally to drive, rebooting my python environment without those limitations, and restarting the machine learning pipeline, my training times were as fast as before: from ~36s per model -> 4s (significant if you are brute-force building and grid searching through 1350 models). Hope this helps anyone out there who wants to speed up both feature extraction and model training! |
That's a good point. I had a notebook purely for tsfresh feature extraction so did not encounter this issue. Thanks for highlighting its possible negative effect |
Thanks for sharing with the community @YamaByte! Would you think it makes sense to add one sentence about this issue into the respective docs that @dom-white added? Happy to review your PR :) |
Hi @dom-white Thanks for sharing this trick. I used this setting and set |
🤔 the feature calculators are only doing number crunching and no IO, so I would not see why there is no 100% usage per core. Is the data you are using maybe close to your memory limit and your OS needs to swap all the time? |
Hi, firstly, apologize in advance for using bug report instead of Discussion and feature requests. I posted a request there a while back with no activity.
Please add support for cuda pandas, cudf, to accelerate things. I have an nvidia gpu that sits idle while waiting 3 hours each time I change something. I believe lots of folks here are in the same situation.
Please, make it happen.
The text was updated successfully, but these errors were encountered: