-
-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support training and evaluation on WeatherBench 2 #86
Comments
Hi @jacobbieker, is it possible to work on this one? I would like to try |
HI, i was browsing through the repository and weatherbench2 as well. It would be amazing if you could share how you wanted to use these functions and what you wanted included! |
@0xFrama Yes it is! This one could also be split into adding support for different datasets in WeatherBench2, might be good to split it up into smaller ones, such as supporting the ERA5 dataset for training, and the HRES for testing, or the like. @aavashsubedi I'd like implemented the ability to load the datasets in WeatherBench 2 in a training/eval script for the models in this repo. The functionality would be to load and use a specific dataset from WeatherBench. As they are stored as Zarrs in Google Cloud, the general ability should be fairly small, Xarray can directly open the Zarrs, and then they can be streamed in for evaluation. Including the mean/stddev, or min/max for normalization would also be ideal. If you both want to work on this, then it might work best to split which datasets you work on. I would propose 2 different PRs, each adding support for one of the following combinations:
That should allow us to use graph weather and compare to other models in WeatherBench 2 easier. |
While working on PR #87 I did some tests with WeatherBench2 and the ERA5 dataset, editing the run_fulll.py code. I don't know if this is what @jacobbieker had in mind, but maybe this snippet can be helpful: import apache_beam # Needs to be imported separately to avoid TypingError
import weatherbench2
import xarray as xr
# Run the code below to access cloud data on Colab!
from google.colab import auth
auth.authenticate_user()
class XrDataset(Dataset):
def __init__(self):
super().__init__()
obs_path = 'gs://weatherbench2/datasets/era5/1959-2022-6h-64x32_equiangular_conservative.zarr'
self.data = xr.open_zarr(obs_path)
# TODO: Check training period
self.data = self.data.where(self.data['time.year'] < 2018, drop=True)
self.num_levels = len(self.data["level"])
self.num_vars = len(self.data.keys())
self.NWP_features = ["geopotential",
"specific_humidity",
"temperature",
"u_component_of_wind",
"v_component_of_wind",
"vertical_velocity"]
self.aux_features = ["geopotential_at_surface",
"land_sea_mask",
"toa_incident_solar_radiation",
"toa_incident_solar_radiation_12hr"]
def __len__(self):
return len(self.data["time"])
def __getitem__(self, item):
start = self.data.isel(time = item)
end = self.data.isel(time = item+1)
# Stack the NWP features for input
input_data = np.stack(
[
(start[f"{var}"].values - np.array(const.ERA5_MEANS[f"{var}"])[:, None, None])
/ (np.array(const.ERA5_STD[f"{var}"])[:, None, None] + 0.0001)
for var in self.NWP_features
],
axis=-1,
).astype(np.float32)
num_layers, num_lat, num_lon, num_vars = input_data.shape
input_data = input_data.reshape(num_lat, num_lon, num_vars*num_layers)
input_data = np.nan_to_num(input_data)
assert not np.isnan(input_data).any()
# Stack the non-NWP features for input
aux_data = np.stack(
[
(start[f"{var}"].values - const.ERA5_MEANS[f"{var}"])
/ (const.ERA5_STD[f"{var}"]+ 0.0001)
for var in self.aux_features
],
axis=-1,
).astype(np.float32)
aux_data = np.nan_to_num(aux_data)
assert not np.isnan(aux_data).any()
assert input_data.shape[:2] == aux_data.shape[:2]
# Stack space-time features for input
lat_lons = np.array(np.meshgrid(start.latitude.values, start.longitude.values))
sin_lats = np.sin(lat_lons[0,:,:])
cos_lats = np.cos(lat_lons[0,:,:])
sin_lons = np.sin(lat_lons[1,:,:])
cos_lons = np.cos(lat_lons[1,:,:])
date = start.time.values
day_of_year = start.time.dt.dayofyear.values / 365.0
sin_day_of_year = np.sin(day_of_year)*np.ones_like(lat_lons[0,:,:])
cos_day_of_year = np.cos(day_of_year)**np.ones_like(lat_lons[0,:,:])
space_time_data = np.stack([sin_lats,
cos_lats,
sin_lons,
cos_lons,
sin_day_of_year,
cos_day_of_year], axis = -1)
space_time_data = np.nan_to_num(space_time_data).astype(np.float32)
assert not np.isnan(space_time_data).any()
assert input_data.shape[:2] == space_time_data.shape[:2]
# Stack NWP features for output
output_data = np.stack(
[
(end[f"{var}"].values - np.array(const.ERA5_MEANS[f"{var}"])[:, None, None])
/ (np.array(const.ERA5_STD[f"{var}"])[:, None, None] + 0.0001)
for var in self.NWP_features
],
axis=-1,
).astype(np.float32)
num_layers, num_lat, num_lon, num_vars = output_data.shape
output_data = output_data.reshape(num_lat, num_lon, num_vars*num_layers)
output_data = np.nan_to_num(output_data)
assert not np.isnan(output_data).any()
assert input_data.shape == output_data.shape
input_data = np.concatenate( [input_data, aux_data, space_time_data], axis = -1)
transform = transforms.Compose([transforms.ToTensor()])
# Normalize now
return (
transform(input_data).transpose(0, 1).reshape(-1, input_data.shape[-1]),
transform(output_data).transpose(0, 1).reshape(-1, output_data.shape[-1]),
) I still need to compute good estimates of mean and std of the dataset. Moreover the features and the training period are arbitrarily selected. In the next days I will try to compute the variance of the 3/6 hours changes as in #87. |
@jacobbieker Thanks, that helps! |
@gbruno16 It Looks like most of the ERA5 stuff you have already taken care of! Nice! |
Looks like theres already plenty of people working on this PR! Happy to either take the IFS dataset or can leave this one to @0xFrama and work on something else : ) |
@aavashsubedi up to you! There is also #90 that I just added that might interest you? Probably a bit more involved than this one, but similar vein. |
Hi @aavashsubedi, I've already started working on this issue, but let me know if you want to work on it as well. I can look at something else, no worries :) |
Heyy no worries! I haven't started working on this so all yours ; ) |
Hi @jacobbieker, I'm having some trouble in calculating the mean and std for each of the variables in the dataset. My computer can't process that much data in a reasonable amount of time. Do you have any suggestion on how to proceed? |
Not Jacob but I seem to have notifs enabled. I have been playing around with graph_weather on Kaggle maybe you can try this- plenty of ram and CPU processors on there? And this may not be the most surefire solution, but for mean/std, you can try the batched Monte Carlo estimate. What I mean is randomly loading batches/subsets of the entire dataset, computing the mean/std of this batch, and calculating the mean of that. This should give you an unbiased estimate of the true mean/std. It should be good enough(?- Jacob can chime in) until someone computes the population values if computing is a massive headache. |
Hi, Yeah, the Monte Carlo estimate seems like a good way to do that. Otherwise just taking a few hundred random timesteps can give still a good approximation of the values. Besides Kaggle another option could be Planetary Computer, as you can access a Dask cluster with 2TB of ram for free there. |
We want to be able to train and evaluate on WeatherBench2 to enable easy comparisons of the data.
Detailed Description
https://weatherbench2.readthedocs.io/en/latest/data-guide.html
Context
The data is already in Zarr and easy to pull and use, and freely available which is ideal. It does tend to be 6 hourly, so not as relevant for OCF, but a good way to start and check as a benchmark.
Possible Implementation
Adding a data module that is extensible
The text was updated successfully, but these errors were encountered: