-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding dask and batching (async discussion) #13
Comments
Questions top of mind:
|
Couple of comments:
This also means you are not using any paralellism (try to record the CPU useage while loading, I bet it never exceeds 100%). Finally there might be some caching going on here, which could explain the fluctuations in the load time, even though these might also be random. Bottomline you should use bigger batches! Insert jaws meme
I think as long as you overwrite the object you are good and the old data will be garbage collected.
what matters most (I think) is how many of the chunks you have to load initially. If you cross chunk boundaries you will load all the chunks into memory that you touch.
This might be a good read. |
Wanted to open a thread to inquire about best practices regarding dask chunking.
Ok imagine you have ingested some dataset that is over 100GB, so definitely not fitting into memory. You want to train an ML model using this dataset.
Are there any dask optimizations for this process?
Ran a simple test:
Was surprised by the fact that batch size seemingly had no effect on load time.
The text was updated successfully, but these errors were encountered: