Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with deep learning frameworks #268

Open
lesteve opened this issue Jul 3, 2018 · 21 comments
Open

Integration with deep learning frameworks #268

lesteve opened this issue Jul 3, 2018 · 21 comments
Labels
Roadmap Larger, self-contained pieces of work.

Comments

@lesteve
Copy link
Member

lesteve commented Jul 3, 2018

Collecting related thoughts from #210.

@stsievert in #210 (comment)

Especially with PyTorch. It certainly feels like there should be an integration with PyTorch because it has torch.distributed and torch.multiprocessing

@stsievert in #210 (comment)

Optimization:

  • I see these algs (e.g., Hogwild) as exploiting dask's distributed architecture
  • These will require a parameter server. Can we make this general and integrate with (for example) CuPy/Chainer and PyTorch?

@TomAugspurger in #210 (comment)

Dask-Tensorflow

  • Review new datasets API, anything we should do there?
@TomAugspurger
Copy link
Member

Thanks @lesteve!

I'm relatively unfamiliar with these libraries, especially with their distributed runtimes, where Dask may be most useful.

I'd be curious to hear from people who have experience here.

@mrocklin
Copy link
Member

mrocklin commented Jul 3, 2018

I'm also unfamiliar with using any modern deep learning library. I'd love to hear from people who do use them, want to use them in a distributed way, and have experienced some pain while trying to do so. (also cc'ing @bnaul @seibert)

My current understanding is that people do the following:

  1. Build up a model. For this they use Tensorflow, Keras, PyTorch, etc.. I don't think Dask has any role to play here

  2. Choose an optimizer, like AdaGrad. Dask might play a role here, or it might modify this optimizer, similar to the approach taken in Hovorod

    # Build model...
    loss = ...
    opt = tf.train.AdagradOptimizer(0.01 * hvd.size())
    
    # Add Horovod Distributed Optimizer
    opt = hvd.DistributedOptimizer(opt)
  3. Load and preprocess data. Numpy/Pandas is probably active here, maybe Dask as well

  4. Hand that data to the model/optimizer for training

@mrocklin
Copy link
Member

mrocklin commented Jul 3, 2018

Personally, I'm curious to see what workflows that might potentially use Dask and a deep learning framework would look like. That is something productive that people can do now that might help to focus the discussion.

@lesteve
Copy link
Member Author

lesteve commented Jul 4, 2018

I guess this is not really integration per se so I opened a different issue #281 about the use case that some people around me are trying to tackle by combining dask and deep learning frameworks.

@mrocklin
Copy link
Member

mrocklin commented Jul 4, 2018 via email

@TomAugspurger TomAugspurger added the Roadmap Larger, self-contained pieces of work. label Aug 16, 2018
@arthurmensch
Copy link

Currently pytorch only documented way of doing Hogwild training is through using Process() commands and inheritance of models. See for instance https://github.com/pytorch/examples/blob/master/mnist_hogwild/main.py.

This is not compatible with the modern approach of using dask pools, or even multiprocess pools. I don't really see a simple way around that.

@corrado9999
Copy link

My professional experience in which I believe daks-ml would be great.

I work with remote sensing data. If you are thinking of Google Maps that is only part of the story. Going to the point, we have images of several gigabytes (tens of thousands of pixel, tens of channels) acquired tens times each year. Thanks to NASA and ESA (European Space Agency), we have 4 great pools of images available for free: MODIS, LANDSAT-8, Sentinel-1 and Sentinel-2.

Now, I am facing the issue to train some deep network on these data. They are all available on AWS (https://registry.opendata.aws/?search=satellite%20imagery), so it makes perfect sense to avoid the downloading and train the model directly on the cloud. As features we use bands from one (or more!) datasource, but also time must be taken into account in some way (this not something dask can help with, but it could somehow affect the design).

The training is usually supervised, the target being classification (water, urban, grassland, cropland, ...). We are also interested in 3 main kind of classification:

  1. Pure pixel based: each pixel is assigned to one class, without knowing where it is located
  2. Pixel based with neighborhood: this is where CNN come to play, because we want to exploit information coming from near pixel in order to estimate some spatial features
  3. Object based: we aggregate pixels based on some vector data (known a priori)

Please, take into account that we have a lot of data, and that data storage usually costs more than processing. So, in some cases it may be preferable to perform data preprocessing/normalization at each batch, in some other cases we may prefer to cache them (instance based). For the same reason, along and across epochs we would like to minimize data movement, but yet mix batches coming from different images in most possible ways.

Sorry for the long post, I hope to have been clear enough about my use case.

@joemcglinchy
Copy link

I also use primarily remote sensing data, but my use case is more on the model inference stage. Suppose you have already trained a model to your satisfaction. This is typically done with smaller samples, which absolutely could be supported by dask using a windowed approach. Once the model is trained I would like to apply it to a remote sensing dataset which is typically quite large, think 10k's of rows and columns. Loading that full dataset into memory is often times problematic, so I think dask could help here as well.

@TomAugspurger
Copy link
Member

TomAugspurger commented Jan 17, 2019 via email

@stsievert
Copy link
Member

dask/distributed#2581 is on training a PyTorch model with Dask.

@AlbertDeFusco
Copy link

AlbertDeFusco commented Jun 12, 2019

I've made progress using a Dask DataFrame with the Keras .fit_generator() method to mimic dask_ml.wrappers.Incremental for Scikit-Learn.

https://anaconda.org/defusco/keras-dask/notebook

I'm close to getting dask_ml.wrappers.ParallelPostfit, and keras.wrappers.scikit_learn.KerasClassifier working with sklearn piplines.

@TomAugspurger
Copy link
Member

TomAugspurger commented Jun 12, 2019 via email

@AlbertDeFusco
Copy link

@TomAugspurger ,

  1. Yes, it would be good to persist (if possible) any data given to DaskGenerator.

  2. No, I have not attempted this with Distributed.

@CMCDragonkai
Copy link

I was thinking of using Dask-ML to replace the multi-gpu model that exists Keras. So than we have a dask worker per-GPU instead of per-core as it is usually.

@lesteve
Copy link
Member Author

lesteve commented Oct 15, 2019

Not sure what you are trying to do exactly and I am not an expert on Keras so I just want to comment on this part:

So than we have a dask worker per-GPU instead of per-core as it is usually.

There is a way to have a dask-worker per-GPU setting CUDA_VISIBLE_DEVICES environment variable in the worker processes and people have been doing it for some time (for example here is a Scipy 2016 video about combining dask and numba). A more recent approach is https://github.com/rapidsai/dask-cuda.

@CMCDragonkai
Copy link

Hmm, it's a bit more complicated than that. We would also need a parameter server in the CPU land so that it accumulates weight updates from all the models. But that would make dask-ml a full ML framework akin to Keras, and instead of going dask -> keras -> tensorflow, it would be dask -> tensorflow.

@skeller88
Copy link

My professional experience in which I believe daks-ml would be great.

I work with remote sensing data. If you are thinking of Google Maps that is only part of the story. Going to the point, we have images of several gigabytes (tens of thousands of pixel, tens of channels) acquired tens times each year. Thanks to NASA and ESA (European Space Agency), we have 4 great pools of images available for free: MODIS, LANDSAT-8, Sentinel-1 and Sentinel-2.

Now, I am facing the issue to train some deep network on these data. They are all available on AWS (https://registry.opendata.aws/?search=satellite%20imagery), so it makes perfect sense to avoid the downloading and train the model directly on the cloud. As features we use bands from one (or more!) datasource, but also time must be taken into account in some way (this not something dask can help with, but it could somehow affect the design).

The training is usually supervised, the target being classification (water, urban, grassland, cropland, ...). We are also interested in 3 main kind of classification:

  1. Pure pixel based: each pixel is assigned to one class, without knowing where it is located
  2. Pixel based with neighborhood: this is where CNN come to play, because we want to exploit information coming from near pixel in order to estimate some spatial features
  3. Object based: we aggregate pixels based on some vector data (known a priori)

Please, take into account that we have a lot of data, and that data storage usually costs more than processing. So, in some cases it may be preferable to perform data preprocessing/normalization at each batch, in some other cases we may prefer to cache them (instance based). For the same reason, along and across epochs we would like to minimize data movement, but yet mix batches coming from different images in most possible ways.

Sorry for the long post, I hope to have been clear enough about my use case.

@corrado9999 , have you developed a good ml data preparation workflow since your question? I have a similar project I'm working on.

The problem I see is that batches are generated randomly, and so that requires random selections of the dataset. As long as the dataset is not in memory, selections will be slow. But I think https://www.tensorflow.org/guide/data_performance provides ways around that with prefetching.

@corrado9999
Copy link

@skeller88 , no, unfortunately I have not worked on this (yet).

The problem I see is that batches are generated randomly, and so that requires random selections of the dataset. As long as the dataset is not in memory, selections will be slow.

Sure we need a pipeline, in order to perform such slow actions while the model is training. The TF API you pointed out is actually very interesting, not sure how it could be integrated in this framework though.

@skeller88
Copy link

Gotcha. I ended up using the tensorflow datasets API directly. It has batch prefetching, which makes performance a bit better. See my code here.

@stsievert
Copy link
Member

One straightforward integration would be hooking in Tensorboards HParams dashboard to visualize how well different parameters performed. Here’s a description of basic usage: https://www.tensorflow.org/tensorboard/hyperparameter_tuning_with_hparams

I think this would amount to writing logs in a specific format. Here’s how to get some example logs, and run tensorboard on those logs (code pulled from the post above):

$ wget -q 'https://storage.googleapis.com/download.tensorflow.org/tensorboard/hparams_demo_logs.zip'
$ unzip -q hparams_demo_logs.zip -d logs/hparam_demo
$ tensorboard --logdir logs/hparam_demo

@stsievert
Copy link
Member

Dask clusters can now be used with PyTorch's distributed framework, thanks to the work of Saturn Cloud at https://github.com/saturncloud/dask-pytorch-ddp. This allows use of a Dask cluster with PyTorch's distributed framework (see "Getting Started with Distributed Data Parallel" for an example). It's similar to the (now archived) dask-tensorflow. From their README,

dask-pytorch-ddp is a Python package that makes it easy to train PyTorch models on Dask clusters using distributed data parallel. The intended scope of the project is

  • bootstrapping PyTorch workers on top of a Dask cluster
  • using distributed data stores (e.g., S3) as normal PyTorch datasets
  • mechanisms for tracking and logging intermediate results, training statistics, and checkpoints.

cc @hhuuggoo and @skirmer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Roadmap Larger, self-contained pieces of work.
Projects
None yet
Development

No branches or pull requests

10 participants