Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding dask and batching (async discussion) #13

Open
SammyAgrawal opened this issue Aug 9, 2024 · 2 comments
Open

Understanding dask and batching (async discussion) #13

SammyAgrawal opened this issue Aug 9, 2024 · 2 comments

Comments

@SammyAgrawal
Copy link

Wanted to open a thread to inquire about best practices regarding dask chunking.

Screenshot 2024-08-09 at 3 58 25 PM

Ok imagine you have ingested some dataset that is over 100GB, so definitely not fitting into memory. You want to train an ML model using this dataset.

Are there any dask optimizations for this process?

Ran a simple test:

import time
import random
times = []
batches = []
random.randint(0, 10)
sizes = [1, 2, 4, 8, 16, 32, 64, 128]
for sz in sizes:
    start = random.randint(0, 5000)
    now = time.time()
    batch = ds_out.isel(time=slice(start, start+sz))
    batch.load()
    print(batch.dims)
    times.append(time.time() - now)
    batches.append(batch)

Was surprised by the fact that batch size seemingly had no effect on load time.

Screenshot 2024-08-09 at 4 00 47 PM
@SammyAgrawal
Copy link
Author

Questions top of mind:

  • it seems if you iterated over the dataset and did this, eventually you would have loaded everything and kernel will crash. Can you "unload" such that once a batch is processed, you garbage collect it? If you overwrite the "batch" variable will it be automatically garbage collected and the memory will be freed?
  • Does loading in line with existing chunk dimensions matter? I.e. does the "start" affect load times if you try to load across chunk lines?
  • If you use multiprocessing and spawn multiple processes, how does Dask handle loading across processes? Data balancing across N processes?

@jbusecke
Copy link
Contributor

Couple of comments:

  • I think you are always only loading the first chunk, which would explain why the time does not change significantly. The chunksize in time is 1500
image

This also means you are not using any paralellism (try to record the CPU useage while loading, I bet it never exceeds 100%).

Finally there might be some caching going on here, which could explain the fluctuations in the load time, even though these might also be random. Bottomline you should use bigger batches! Insert jaws meme

it seems if you iterated over the dataset and did this, eventually you would have loaded everything and kernel will crash. Can you "unload" such that once a batch is processed, you garbage collect it? If you overwrite the "batch" variable will it be automatically garbage collected and the memory will be freed?

I think as long as you overwrite the object you are good and the old data will be garbage collected.

Does loading in line with existing chunk dimensions matter? I.e. does the "start" affect load times if you try to load across chunk lines?

what matters most (I think) is how many of the chunks you have to load initially. If you cross chunk boundaries you will load all the chunks into memory that you touch.

If you use multiprocessing and spawn multiple processes, how does Dask handle loading across processes? Data balancing across N processes?

This might be a good read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants