Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demo version #1

Merged
merged 65 commits into from
Aug 13, 2024
Merged

Demo version #1

merged 65 commits into from
Aug 13, 2024

Conversation

dfulu
Copy link
Member

@dfulu dfulu commented Aug 1, 2024

This pull request provides the minimal demo version of this library. Still lots to do

TODO:

  • CI testing fixes
  • Fill in basic project metadata i.e. setup.py and README
    TODO list #7

@dfulu dfulu requested review from peterdudfield and Sukh-P and removed request for peterdudfield and Sukh-P August 1, 2024 15:34
):
"""Select a time slice from a Dataset or DataArray."""
used_duration = history_duration is not None and forecast_duration is not None
used_intervals = interval_start is not None and interval_end is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly just out of curiosity: when is interval slicing used? I've never seen it, so I'm assuming GSPs?

Copy link
Member Author

@dfulu dfulu Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually do use it in datapipes. I actually want to propose that we get rid of [history, forecast]_duration in favour of interval_[start, end]. There are some cases where interval makes more sense, and it dates back to this issue in datapipes. If we did this we could also get rid of live_delay_minutes in the satellite config which is slightly redundant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for pointing me!

if "target_time_utc" in da.coords:
example[NWPBatchKey.nwp_target_time_utc] = da.target_time_utc.values.astype(float)

# TODO: Do we need this at all? Especially since it is only present in UKV data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this either needs to include lat lon or not exist, otherwise it's kind of strange that only ukv gets positional features? Unless I'm misunderstanding the purpose of this.

Comment on lines +355 to +359
if "nwp" in dataset_dict:

nwp_numpy_modalities = dict()

for nwp_key, da_nwp in dataset_dict["nwp"].items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is so vague I hesitate to even call it a suggestion, but I feel like this unpacking happens over and over and over in every other function, which makes me wonder if it would be, well, maybe not easier, but clearer to just plain store the nwp sources unpacked and prefixed with nwp. or something? It will mess with checking key by key, so we will have to revert to a for loop iterating over the keys, and the making of all the specific parts nwp functions need will have to be moved out somewhere, but I feel like it would be so much neater if we could get it down to something (VERY vaguely) like:

def is_nwp(key):
  split_key = key.split()
  return split_key[0] == "nwp"

def return_spacial_slice(da, config, key):
  if is_nwp(key):
    return spacial_slice_nwp(da, config, **get_additional_nwp_kwargs(da, config))
  else:
    return spacial_slice(da, config)

for key, dataset in datasets_dict.items():
  config = config.key
  slice = return_spacial_slice(dataset, config, key)
  slices.append(slice)

It may well be that this is a me-problem (I am known to have a massive sore spot over the separate processing of nwps and non-nwps), but thought I'd through it in

Copy link

@devsjc devsjc Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could even have the dictionary contain more strongly typed elements, e.g.

from typing import NamedTuple

class DAContainer(NamedTuple):
	key: str
	"""??? Something different to da.name?"""
	da: xr.DataArray
	"""The actual xarray DataArray containing the data"""

# Note the stronger typing in the dictionary parameter, which I guess is something like
# { 
# 	"nwp": [DAContainer("ECMWF_INDIA", xr.DataArray(...)), DAContainer(...)],
# 	"sat": [DAContainer("", xr.DataArray(...)), ...],
# }
def process_and_combine_datasets(dataset_dict: dict[str, list[DAContainer]], config: Configuration) -> NumpyBatch:
	if "nwp" in dataset_dict:
        nwp_numpy_modalities = dict()
        for dac in dataset_dict["nwp"]:
			provider = config.input_data.nwp[dac.key].nwp_provider
            da_nwp = (dac.da - NWP_MEANS[provider]) / NWP_STDS[provider]
	...

I haven't looked at this carefully enough, but it might just be the case that you can use the DataArray "name" attribute to avoid this altogether, I don't know what the "nwp_key" is meant to represent other than this? Then you could have

def process_and_combine_datasets(dataset_dict: dict[str, list[xr.DataArray]], config: Configuration) -> NumpyBatch:
	if "nwp" in dataset_dict:
        nwp_numpy_modalities = dict()
        for da in dataset_dict["nwp"]:
			provider = config.input_data.nwp[da.name].nwp_provider
            da_nwp = (da - NWP_MEANS[provider]) / NWP_STDS[provider]
		...
	...

for nwp_key, da_nwp in dataset_dict["nwp"].items():
# Standardise
provider = config.input_data.nwp[nwp_key].nwp_provider
da_nwp = (da_nwp - NWP_MEANS[provider]) / NWP_STDS[provider]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably requires a whole other conversation re: whether we're keeping the whole statistics in consts thing or are they moving into the data attributes or something, but could be worth thinking about moving this into a separate normalisation function? It will also provide space for different handling of variables such as prate that don't mix well with standard normalisation


def fill_time_periods(time_periods: pd.DataFrame, freq: pd.Timedelta):
datetimes = []
for _, row in time_periods.iterrows():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like this could be vectorised instead of looping through all the rows in the df and that would be more efficient? Happy to try to do that if you agree

end_buffer = minutes(0)

# This is the max staleness we can use considering the max step of the input data
max_possible_staleness = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More concretely on the find_valid_t0_times function perhaps being too long I think this part about the staleness could be moved into a get_max_staleness function, happy to do that if you think it's a good idea

Comment on lines +43 to +44
if modality_name not in {"gsp"}:
raise NotImplementedError(f"Modality {modality_name} not supported")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very probably just a for-now thing, but in case it isn't: it should probably support "pv" as well for pvsite. but again, we'll probably add it when we add the pvsite pipeline

return contiguous_t0_periods


def _find_contiguous_t0_periods_nwp(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but is this ever used? It looks like an attempt to hide stuff like max_staleness calculation and such under the hood, is that what it is?

@dfulu dfulu merged commit 654449c into main Aug 13, 2024
3 checks passed
@peterdudfield peterdudfield deleted the development branch September 18, 2024 14:18
@peterdudfield
Copy link
Contributor

@all-contributors
please add @dfulu for code.
please add @AUdaltsova for code.
please add @Sukh-P for code.
please add @peterdudfield for code.

Copy link
Contributor

@peterdudfield

I've put up a pull request to add @dfulu! 🎉

@peterdudfield
Copy link
Contributor

@all-contributors
please add @AUdaltsova for code.

Copy link
Contributor

@peterdudfield

I've put up a pull request to add @AUdaltsova! 🎉

@peterdudfield
Copy link
Contributor

@all-contributors please add @Sukh-P for code.

Copy link
Contributor

@peterdudfield

I've put up a pull request to add @Sukh-P! 🎉

@peterdudfield
Copy link
Contributor

@all-contributors please add @peterdudfield for code.

Copy link
Contributor

@peterdudfield

I've put up a pull request to add @peterdudfield! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants