-
-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration with deep learning frameworks #268
Comments
Thanks @lesteve! I'm relatively unfamiliar with these libraries, especially with their distributed runtimes, where Dask may be most useful. I'd be curious to hear from people who have experience here. |
I'm also unfamiliar with using any modern deep learning library. I'd love to hear from people who do use them, want to use them in a distributed way, and have experienced some pain while trying to do so. (also cc'ing @bnaul @seibert) My current understanding is that people do the following:
|
Personally, I'm curious to see what workflows that might potentially use Dask and a deep learning framework would look like. That is something productive that people can do now that might help to focus the discussion. |
I guess this is not really integration per se so I opened a different issue #281 about the use case that some people around me are trying to tackle by combining dask and deep learning frameworks. |
Questions like that are welcome. It's nice to identify issues that come up
in practice, even if they are less research-y.
…On Wed, Jul 4, 2018 at 4:53 AM, Loïc Estève ***@***.***> wrote:
I guess this is not really integration per se so I opened a different
issue #281 <#281> about the use
case that some people are trying to tackle by combining dask and deep
learning frameworks.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#268 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszFNAxWdWyykdSZ8n1IKnFUuKBPgvks5uDIKMgaJpZM4VAn0f>
.
|
Currently This is not compatible with the modern approach of using dask pools, or even multiprocess pools. I don't really see a simple way around that. |
My professional experience in which I believe daks-ml would be great. I work with remote sensing data. If you are thinking of Google Maps that is only part of the story. Going to the point, we have images of several gigabytes (tens of thousands of pixel, tens of channels) acquired tens times each year. Thanks to NASA and ESA (European Space Agency), we have 4 great pools of images available for free: MODIS, LANDSAT-8, Sentinel-1 and Sentinel-2. Now, I am facing the issue to train some deep network on these data. They are all available on AWS (https://registry.opendata.aws/?search=satellite%20imagery), so it makes perfect sense to avoid the downloading and train the model directly on the cloud. As features we use bands from one (or more!) datasource, but also time must be taken into account in some way (this not something dask can help with, but it could somehow affect the design). The training is usually supervised, the target being classification (water, urban, grassland, cropland, ...). We are also interested in 3 main kind of classification:
Please, take into account that we have a lot of data, and that data storage usually costs more than processing. So, in some cases it may be preferable to perform data preprocessing/normalization at each batch, in some other cases we may prefer to cache them (instance based). For the same reason, along and across epochs we would like to minimize data movement, but yet mix batches coming from different images in most possible ways. Sorry for the long post, I hope to have been clear enough about my use case. |
I also use primarily remote sensing data, but my use case is more on the model inference stage. Suppose you have already trained a model to your satisfaction. This is typically done with smaller samples, which absolutely could be supported by dask using a windowed approach. Once the model is trained I would like to apply it to a remote sensing dataset which is typically quite large, think 10k's of rows and columns. Loading that full dataset into memory is often times problematic, so I think dask could help here as well. |
Thanks Joseph. Based on your description,
dask/dask-examples#35 sounds quite similar to
your workflow. It'd be nice if we could develop that into a fully-formed
example.
…On Thu, Jan 17, 2019 at 12:29 PM Joseph McGlinchy ***@***.***> wrote:
I also use primarily remote sensing data, but my use case is more on the
model inference stage. Suppose you have already trained a model to your
satisfaction. This is typically done with smaller samples, which absolutely
could be supported by dask using a windowed approach. Once the model is
trained I would like to apply it to a remote sensing dataset which is
typically quite large, think 10k's of rows and columns. Loading that full
dataset into memory is often times problematic, so I think dask could help
here as well.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#268 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIo2MQ5OQimUMuaXHhXBgdJs6UN4eks5vEMEBgaJpZM4VAn0f>
.
|
dask/distributed#2581 is on training a PyTorch model with Dask. |
I've made progress using a Dask DataFrame with the Keras https://anaconda.org/defusco/keras-dask/notebook I'm close to getting |
Thanks Albert, looks interesting. I think providing a DaskGenerator, or at
least documentation on how to write one, would be very useful.
A couple questions
1. With your DaskGenerator, if I do `gen = DaskGenerator(X, y); gen[0];
gen[1]`, then I *think* that you'll end up redoing a bunch of computation.
Does that sound correct? In this example I think you would end up refitting
the entire StandardScaler, splitting, etc. The solution is to persist the
transformed data, either in memory on the cluster, perhaps as part of
DaskGenerator.__init__, or on disk. And then the data passed to
DaskGenerator would be loaded from disk. (LMK if I'm not making sense)
2. Are you using distributed at all? It doesn't look like it. I ask because
I've never gotten Keras / Tensorflow to work properly in multiple processes.
…On Wed, Jun 12, 2019 at 12:10 PM Albert DeFusco ***@***.***> wrote:
I've made progress using a Dask DataFrame with the Keras .fit_generator()
method to mimic Incremental for Scikit-Learn.
https://anaconda.org/defusco/keras-dask/notebook
I'm close to getting dask_ml.wrappers.ParallelPostfit, and
keras.wrappers.scikit_learn.KerasClassifier() working with sklearn
piplines.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#268?email_source=notifications&email_token=AAKAOIQVSNPGVSZXW5LRCNTP2EUXPA5CNFSM4FICPUP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXRETQI#issuecomment-501369281>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIRICB3JDRFN7LA7353P2EUXPANCNFSM4FICPUPQ>
.
|
|
I was thinking of using Dask-ML to replace the multi-gpu model that exists Keras. So than we have a dask worker per-GPU instead of per-core as it is usually. |
Not sure what you are trying to do exactly and I am not an expert on Keras so I just want to comment on this part:
There is a way to have a dask-worker per-GPU setting |
Hmm, it's a bit more complicated than that. We would also need a parameter server in the CPU land so that it accumulates weight updates from all the models. But that would make |
@corrado9999 , have you developed a good ml data preparation workflow since your question? I have a similar project I'm working on. The problem I see is that batches are generated randomly, and so that requires random selections of the dataset. As long as the dataset is not in memory, selections will be slow. But I think https://www.tensorflow.org/guide/data_performance provides ways around that with prefetching. |
@skeller88 , no, unfortunately I have not worked on this (yet).
Sure we need a pipeline, in order to perform such slow actions while the model is training. The TF API you pointed out is actually very interesting, not sure how it could be integrated in this framework though. |
Gotcha. I ended up using the tensorflow datasets API directly. It has batch prefetching, which makes performance a bit better. See my code here. |
One straightforward integration would be hooking in Tensorboards HParams dashboard to visualize how well different parameters performed. Here’s a description of basic usage: https://www.tensorflow.org/tensorboard/hyperparameter_tuning_with_hparams I think this would amount to writing logs in a specific format. Here’s how to get some example logs, and run tensorboard on those logs (code pulled from the post above): $ wget -q 'https://storage.googleapis.com/download.tensorflow.org/tensorboard/hparams_demo_logs.zip'
$ unzip -q hparams_demo_logs.zip -d logs/hparam_demo
$ tensorboard --logdir logs/hparam_demo |
Dask clusters can now be used with PyTorch's distributed framework, thanks to the work of Saturn Cloud at https://github.com/saturncloud/dask-pytorch-ddp. This allows use of a Dask cluster with PyTorch's distributed framework (see "Getting Started with Distributed Data Parallel" for an example). It's similar to the (now archived) dask-tensorflow. From their README,
|
Collecting related thoughts from #210.
@stsievert in #210 (comment)
@stsievert in #210 (comment)
@TomAugspurger in #210 (comment)
The text was updated successfully, but these errors were encountered: