Initial results #26

ValterFallenius · 2022-03-22T09:49:17Z

ValterFallenius
Mar 22, 2022

Hello,

I have deployed the model on a larger chunk of my dataset now. Here is my model:

| Name | Type | Params

0 | image_encoder | TimeDistributed | 1.7 M
1 | ct | ConditionTime | 0
2 | temporal_enc | TemporalEncoder | 3.5 M
3 | position_embedding | AxialPositionalEmbedding | 14.3 K
4 | temporal_agg | Sequential | 4.2 M
5 | head | Conv2d | 7.7 K

9.4 M Trainable params
0 Non-trainable params
9.4 M Total params
37.676 Total estimated model params size (MB)

I am using default image encoder, 256 hidden layers in ConvGRU and 8 attention layers with 16 attention heads each.

I wanted to discuss a little bit of what to expect. First let me show you some overfits I did with only two training samples:

With only 1 leadtime the result is easily overfitted after 50 epochs:

With 5 lead times it's harder to overfit but it get's something done (300 epochs):

Now let's look at a run with the full network, find validation and training loss at w&b. This is a run of 400 epochs, during 4 hours on 8 GPUs in parallel. As you can see the network has not yet overfit since validation loss is not increasing. This run is done only with 280 training samples but I have a lot more data available, I am struggling implementing an efficient way to load all the data since it's so big (work in progress).

I wanted to discuss the following:

How many epochs do you think is reasonable to expect before network is working properly? Since the lead time is randomly sampled on average 60 epochs each sample-label pair will be iterated over once.
How much training samples should I aim to include (minimum requirement) for the model to be successful?
Right now I am doing batch_size = 5 and all samples in a minibatch gets the same leadtime. Would it be impactful to implement a random leadtime on a "per-sample-basis" instead?
At the moment I am using 30 categorical bins of size 0.1mm/h. I noticed the network is overestimating the final bin since it's grouping all events >3mm/h into one bin. The reality is that 99.99% of pixels are in the <3mm/h region, I always thought that 512 bins seemed like too many and that 0.2 mm/h step size seemed to large. However my intuition here might be faulty, is there a point to using so many bins rather than 30?
Should I be using ConvLSTM instead of ConvGRU?

Here is a results of the network:
input:

y:

y_hat:

JackKelly · 2022-03-22T10:10:18Z

JackKelly
Mar 22, 2022
Maintainer

Thanks so much for sharing!

Some quick thoughts:

I am struggling implementing an efficient way to load all the data since it's so big (work in progress)

hehe, yeah, I know your pain! If it's any help: We decided to pre-prepare a set of training batches ahead of time. And then, during training, literally all the code has to do is load pre-prepared batches off disk. And, this is nice because hopefully you can fit your pre-prepared batches onto an SSD. Here's our code for pre-preparing batches, although I would guess that you'd be better off creating your own code from scratch, because our code is quite specialised to nowcasting PV power generation.

Right now I am doing batch_size = 5 and all samples in a minibatch gets the same leadtime. Would it be impactful to implement a random leadtime on a "per-sample-basis" instead?

Yes, I definitely think it would be beneficial to implement a random leadtime on a per-sample-basis! For example, in Aribandi et al. 2021, the authors show that training an NLP model on multiple tasks works best if each batch contains a random sample of the tasks 🙂

At the moment I am using 30 categorical bins of size 0.1mm/h

If you're using a relatively small number of bins, it might be worth spacing the bins so, on average, there's a uniform probability of landing in any given bin.

6 replies

jacobbieker Mar 22, 2022
Maintainer

One method that the DGMR paper does to deal with the great mismatch in precipitation is to sample more often the examples that had higher amounts of rainfall, to more equalize what the model was seeing.

For 2. Like with most problems, the more the merrier! The MetNet paper says it took them 1.72million samples before the model stopped overfitting to the dataset, but it can almost certainly be done with less. One option could be to possibly try training with a mixture of your data and the open MRMS data, if you don't have enough data? I have a few hundred thousand to ~ million MRMS readings in Zarr that I'm adding to HF that might be helpful in that.

I don't think switching from ConvGRU to ConvLSTM would have a huge effect, but it would be interesting to see if it works better with your data!

I agree with Jack that preparing the batches on disk might be the easiest option for training faster. If you are loading directly from the HDF5 file like the one you added in the PR, then it would be pretty slow. Converting it to Zarr might be a bit faster for loading different tiles, but the best would be saving them out as numpy or pytorch arrays and just load those directly from disk I think. I don't think you'd probably need something as complicated as our code is setup to do.

ValterFallenius Mar 23, 2022
Author

Ok, great. I have alot of data, it's just troublesome to load it all. I will try now to save each sample and label as a npy array and then load it directly into the dataloader.

I will also use 0.2 mm/h rain_steps and 128 categorical bins instead, thanks for pointing this out.

JackKelly Mar 23, 2022
Maintainer

I will try now to save each sample and label as a npy array and then load it directly into the dataloader

A few more quick thoughts on speeding up data loading:

If you save one-sample-per-file, then the files might end up being too small to load efficiently 🙂 (this depends a lot on your filesystem. Local SSDs are fine for loading a large number of small files. In contrast, spinning disks and cloud storage buckets really don't perform well when loading large numbers of small files!) Instead, it might be for each file to hold an entire batch. You can tell torch.utils.data.DataLoader that each "example" is in fact a complete batch by passing DataLoader(batch_size=None).
You almost certainly want to (losslessly) compress your data on disk (because the loading time is dominated by the disk's read bandwidth. Decompression is usually super-fast compared to the time taken to read data off disk. So, when using compression, the time saved reading from disk more than outweighs the time spent decompressing the data). We've found bzip2 to work well. If you save your batches as NetCDF then it's pretty easy to add bzip2 compression.

ValterFallenius Mar 24, 2022
Author

Hello again, @JackKelly

The filesystem has indeed SSDs, but to be safe I saved entire batches now. The dataloading is very slow though as you said. For now I have my data in the following directory tree:
data/train/
data/val/
data/test/

with partitioning as (train,val,test) = (0.7, 0.15, 0.15). I currently have all files stored as "MEAN_DATE_X.npy" and "MEAN_DATE_Y.npy".
for example "-5.324_2019_01_09_22_15_X.npy". I load them in the getitem(self,index) in the dataset class with np.load(self.file_names[idx]) followed by torch.from_numpy(x).

I tried running 1 epoch on 2 GPUs with this setup but it takes ~30minutes per epoch with 1000 training batches and 200 validation batches. I now want to try with your compression technique using bz2. Do you have a simple script for converting all .npy-files to bz2 compressed files? I imagine it would work something like this, but I'm not sure which bz2-mode is lossless and in what file format I should save it to.

for file in os.listdir("data/train/"):
    temp = np.load(file)
    temp_bytes = temp.tobytes()
    temp_compressed = bz2.compress(temp_bytes)
    save(temp_compressed , f"data_compressed/train/{file[:-4]}")

JackKelly Mar 24, 2022
Maintainer

That looks good!

You could also use numcodecs.bz2.BZ2, but it's probably just as easy to do it as you've written above.

I think we found that Python's built-in bz2 library is single-threaded and is pretty slow.

pbzip2 is a parallel implementation, and it's much faster on multi-CPU systems :). But then you have to mess around with using pbzip2 to decompress the data into a RAM disk, and then loading from there. Probably too painful!

When your GPUs take ~30 minutes per epoch, are you sure the bottleneck is the data loading? Does nvidia-smi say that the GPUs are operating well below 100% utilisation?

ValterFallenius · 2022-03-28T08:45:56Z

ValterFallenius
Mar 28, 2022
Author

Now I have trained on 8 GPUs for 48 hours, the training is still running but check out some of the results and my questions: w&b report

Questions:

Why does the model start with a smaller gradient and after ~150 epochs it grows, how is this possible?
Where does the spikes in validation data come from? Best guess: overestimates a rain_bin not equal to 0, this leads to a spike since rain bin 0 is overrepresented. Do I need to worry about this? Can it be that the learning rate is too big η=0.001?
Can we estimate how much longer we need to train model before it will start to overfit?

3 replies

ValterFallenius Mar 29, 2022
Author

Check link in previous comment for training report. Network started to overfit after 600 epochs, the saved weights are the ones with the lowest validation loss.

Here are some plots after training is finnished: w&b

It's not doing so good... it basically predicts "no rain". What can I try next to improve the model? I read somewhere you can add class weights to the categorical loss, would this improve my results since our data is very skew?

Right now I am using 30% of my highest rainfall-training data to avoid low-rain fall events, maybe I should decrease it even further, say 10%?

I would appreciate any input at this point.

jacobbieker Mar 29, 2022
Maintainer

I can have a more detailed look when I'm back from vacation, but yeah, you could try weighting the categories by the inverse of their frequency, which should help balance out the low rainfall and high rainfall classes.

One other one could be to try masking out all no rainfall pixels before computing the loss, so the model is only learning from where there is rainfall? It obviously won't then learn no rainfall, but could maybe help.

I'd checkout the Deep Generative Model of Radar paper for how else they did their sample weighting to help deal with this problem. Or potentially try training on the dgmr dataset as a pretraining step, as their training dataset is more balanced?

ValterFallenius Mar 29, 2022
Author

Okay thanks for the input. I will check this out.

ValterFallenius · 2022-03-30T14:20:01Z

ValterFallenius
Mar 30, 2022
Author

I don't know if this is anything I should worry about, but I think it's worth pointing out. If we look at the fourth plot in the w&b report, it shows the number of bins in each respective class. We can see that the decimal precision of DBZ-data is not accurate enough fill every bin, that is, some bins are completely empty because of the way DBZ is transformed into mm/h. (proportional to 10**DBZ) Should I worry about this? My reasoning is that even if it's a dumb way to partition the data, it shouldn't affect the final result since all of those classes will just be mapped to 0.

I have read the DGMR report, I assume you meant "Skilful precipitation nowcasting using deep generative models of radar". The random sampling technique is a bit complicated so I will start by trying balancing weights first. I will run the model again with the following changes:

I will go back to random sampling on sample by sample basis, previously I used pre-fixed minibatches that were randomly sampled but since there doesn't seem to be an issue to load the data I will go back to do it sample by sample instead so we have variance in minibatches each epoch.
I will try to apply weight balancing schemes introduced in the paper by Cui et. al. "Class-Balanced Loss Based on Effective Number of Samples". However there are some issues here I want to discuss:

They have an implemented this on a classification problem with an imbalance factor of 1 - 200 where the imbalance factor is defined as "the largest class divided by the smallest". For me this definition doesn't work since some of my classes are empty. And even if I set all my zero-classes to minimum non-zero class I get an imbalance factor of 20 300 and beta = (N-1)/N = 0.9999999936 compared to their largest beta = 0.9999.
This seems like it might not be such a good way to do it since it's way more imbalanced than their data. Can someone offer any intuition for if this is an OK implementation or not?
Here is the resulting weights with their method:

EDIT: this gave too small gradients, trying the same but rooted:

I have modified the way I sort out "heavy-rainfall" events. Previously I looked at mean DBZ in the largescale input (not centercrop) but this resulted in strange behaviour seen in the first plot in the w&b-report. As you can see on average there is less rainfall for later lead_times, the logical explanation to why is that since we maximized input then this must be a rainfall peak in time, thus it rainfall will decay after all inputs on average. A physical interpretation would be that as the rain falls from the sky it will indeed reduce the number of raindrops in the sky.

Instead, now I sort with respect the number of bins representing a pixel in Y with rainfall. This way the model will not be biased to guess that rainfall decreases with time since it counts pixels in all 60 lead times.

4 replies

tarunluthra Oct 22, 2024

can you share some of the datasets used for training? I am trying to replicate metnet architecture, openclimatfix only contains mrms data. I need to find sources to get other datasets as well

jacobbieker Oct 22, 2024
Maintainer

There are some other sources of data on our Hugging Face, including the Swedish radar data used here and UK Nimrod data used in the DeepMind DGMR paper here

tarunluthra Oct 22, 2024

thanks Jacob, I am trying to implement and test metnet architecture with the US data, I have acquired mrms data from openclimatefix, but I am trying to include other layers such as GOES and groundstation data as discussed in paper, however I am having trouble collating everything on my end. Do you know where I can find it?

jacobbieker Oct 22, 2024
Maintainer

Planetary Computer has GOES data in a fairly easy to use format. Ground station data, the 1 minute ASOS data is available here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial results #26

{{title}}

Replies: 3 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Initial results #26

ValterFallenius Mar 22, 2022

| Name | Type | Params

0 | image_encoder | TimeDistributed | 1.7 M 1 | ct | ConditionTime | 0 2 | temporal_enc | TemporalEncoder | 3.5 M 3 | position_embedding | AxialPositionalEmbedding | 14.3 K 4 | temporal_agg | Sequential | 4.2 M 5 | head | Conv2d | 7.7 K

Replies: 3 comments · 13 replies

JackKelly Mar 22, 2022 Maintainer

jacobbieker Mar 22, 2022 Maintainer

ValterFallenius Mar 23, 2022 Author

JackKelly Mar 23, 2022 Maintainer

ValterFallenius Mar 24, 2022 Author

JackKelly Mar 24, 2022 Maintainer

ValterFallenius Mar 28, 2022 Author

ValterFallenius Mar 29, 2022 Author

jacobbieker Mar 29, 2022 Maintainer

ValterFallenius Mar 29, 2022 Author

ValterFallenius Mar 30, 2022 Author

tarunluthra Oct 22, 2024

jacobbieker Oct 22, 2024 Maintainer

tarunluthra Oct 22, 2024

jacobbieker Oct 22, 2024 Maintainer

ValterFallenius
Mar 22, 2022

0 | image_encoder | TimeDistributed | 1.7 M
1 | ct | ConditionTime | 0
2 | temporal_enc | TemporalEncoder | 3.5 M
3 | position_embedding | AxialPositionalEmbedding | 14.3 K
4 | temporal_agg | Sequential | 4.2 M
5 | head | Conv2d | 7.7 K

Replies: 3 comments 13 replies

JackKelly
Mar 22, 2022
Maintainer

jacobbieker Mar 22, 2022
Maintainer

ValterFallenius Mar 23, 2022
Author

JackKelly Mar 23, 2022
Maintainer

ValterFallenius Mar 24, 2022
Author

JackKelly Mar 24, 2022
Maintainer

ValterFallenius
Mar 28, 2022
Author

ValterFallenius Mar 29, 2022
Author

jacobbieker Mar 29, 2022
Maintainer

ValterFallenius Mar 29, 2022
Author

ValterFallenius
Mar 30, 2022
Author

jacobbieker Oct 22, 2024
Maintainer

jacobbieker Oct 22, 2024
Maintainer