Demo version #1

dfulu · 2024-08-01T15:31:44Z

This pull request provides the minimal demo version of this library. Still lots to do

TODO:

CI testing fixes
Fill in basic project metadata i.e. setup.py and README
TODO list #7

ocf_dataset_alpha/datasets/pvnet.py

ocf_dataset_alpha/numpy_batch/__init__.py

ocf_dataset_alpha/datasets/pvnet.py

update project cov dir

AUdaltsova · 2024-08-13T11:10:44Z

ocf_data_sampler/select/select_time_slice.py

+):
+    """Select a time slice from a Dataset or DataArray."""
+    used_duration = history_duration is not None and forecast_duration is not None
+    used_intervals = interval_start is not None and interval_end is not None


Honestly just out of curiosity: when is interval slicing used? I've never seen it, so I'm assuming GSPs?

We actually do use it in datapipes. I actually want to propose that we get rid of [history, forecast]_duration in favour of interval_[start, end]. There are some cases where interval makes more sense, and it dates back to this issue in datapipes. If we did this we could also get rid of live_delay_minutes in the satellite config which is slightly redundant

Cool, thanks for pointing me!

This reverts commit 709178a.

AUdaltsova · 2024-08-13T13:04:06Z

ocf_data_sampler/numpy_batch/nwp.py

+    if "target_time_utc" in da.coords:
+        example[NWPBatchKey.nwp_target_time_utc] = da.target_time_utc.values.astype(float)
+
+    # TODO: Do we need this at all? Especially since it is only present in UKV data


I feel like this either needs to include lat lon or not exist, otherwise it's kind of strange that only ukv gets positional features? Unless I'm misunderstanding the purpose of this.

AUdaltsova · 2024-08-13T13:19:50Z

ocf_data_sampler/datasets/pvnet.py

+    if "nwp" in dataset_dict:
+
+        nwp_numpy_modalities = dict()
+
+        for nwp_key, da_nwp in dataset_dict["nwp"].items():


this is so vague I hesitate to even call it a suggestion, but I feel like this unpacking happens over and over and over in every other function, which makes me wonder if it would be, well, maybe not easier, but clearer to just plain store the nwp sources unpacked and prefixed with nwp. or something? It will mess with checking key by key, so we will have to revert to a for loop iterating over the keys, and the making of all the specific parts nwp functions need will have to be moved out somewhere, but I feel like it would be so much neater if we could get it down to something (VERY vaguely) like:

def is_nwp(key): split_key = key.split() return split_key[0] == "nwp" def return_spacial_slice(da, config, key): if is_nwp(key): return spacial_slice_nwp(da, config, **get_additional_nwp_kwargs(da, config)) else: return spacial_slice(da, config) for key, dataset in datasets_dict.items(): config = config.key slice = return_spacial_slice(dataset, config, key) slices.append(slice)

It may well be that this is a me-problem (I am known to have a massive sore spot over the separate processing of nwps and non-nwps), but thought I'd through it in

Could even have the dictionary contain more strongly typed elements, e.g.

from typing import NamedTuple class DAContainer(NamedTuple): key: str """??? Something different to da.name?""" da: xr.DataArray """The actual xarray DataArray containing the data""" # Note the stronger typing in the dictionary parameter, which I guess is something like # { # "nwp": [DAContainer("ECMWF_INDIA", xr.DataArray(...)), DAContainer(...)], # "sat": [DAContainer("", xr.DataArray(...)), ...], # } def process_and_combine_datasets(dataset_dict: dict[str, list[DAContainer]], config: Configuration) -> NumpyBatch: if "nwp" in dataset_dict: nwp_numpy_modalities = dict() for dac in dataset_dict["nwp"]: provider = config.input_data.nwp[dac.key].nwp_provider da_nwp = (dac.da - NWP_MEANS[provider]) / NWP_STDS[provider] ...

I haven't looked at this carefully enough, but it might just be the case that you can use the DataArray "name" attribute to avoid this altogether, I don't know what the "nwp_key" is meant to represent other than this? Then you could have

def process_and_combine_datasets(dataset_dict: dict[str, list[xr.DataArray]], config: Configuration) -> NumpyBatch: if "nwp" in dataset_dict: nwp_numpy_modalities = dict() for da in dataset_dict["nwp"]: provider = config.input_data.nwp[da.name].nwp_provider da_nwp = (da - NWP_MEANS[provider]) / NWP_STDS[provider] ... ...

AUdaltsova · 2024-08-13T13:24:55Z

ocf_data_sampler/datasets/pvnet.py

+        for nwp_key, da_nwp in dataset_dict["nwp"].items():
+            # Standardise
+            provider = config.input_data.nwp[nwp_key].nwp_provider
+            da_nwp = (da_nwp - NWP_MEANS[provider]) / NWP_STDS[provider]


This probably requires a whole other conversation re: whether we're keeping the whole statistics in consts thing or are they moving into the data attributes or something, but could be worth thinking about moving this into a separate normalisation function? It will also provide space for different handling of variables such as prate that don't mix well with standard normalisation

Sukh-P · 2024-08-13T13:26:43Z

ocf_data_sampler/select/fill_time_periods.py

+
+def fill_time_periods(time_periods: pd.DataFrame, freq: pd.Timedelta):
+    datetimes = []
+    for _, row in time_periods.iterrows():


Feels like this could be vectorised instead of looping through all the rows in the df and that would be more efficient? Happy to try to do that if you agree

Sukh-P · 2024-08-13T13:29:57Z

ocf_data_sampler/datasets/pvnet.py

+                end_buffer = minutes(0)
+
+            # This is the max staleness we can use considering the max step of the input data
+            max_possible_staleness = (


More concretely on the find_valid_t0_times function perhaps being too long I think this part about the staleness could be moved into a get_max_staleness function, happy to do that if you think it's a good idea

AUdaltsova · 2024-08-13T13:31:58Z

ocf_data_sampler/numpy_batch/add_sun_position.py

+    if modality_name not in {"gsp"}:
+        raise NotImplementedError(f"Modality {modality_name} not supported")


this is very probably just a for-now thing, but in case it isn't: it should probably support "pv" as well for pvsite. but again, we'll probably add it when we add the pvsite pipeline

AUdaltsova · 2024-08-13T14:01:01Z

ocf_data_sampler/select/find_contiguous_time_periods.py

+    return contiguous_t0_periods
+
+
+def _find_contiguous_t0_periods_nwp(


I might be missing something, but is this ever used? It looks like an attempt to hide stuff like max_staleness calculation and such under the hood, is that what it is?

peterdudfield · 2024-09-19T10:26:26Z

@all-contributors
please add @dfulu for code.
please add @AUdaltsova for code.
please add @Sukh-P for code.
please add @peterdudfield for code.

allcontributors · 2024-09-19T10:26:36Z

@peterdudfield

I've put up a pull request to add @dfulu! 🎉

peterdudfield · 2024-09-19T10:28:21Z

@all-contributors
please add @AUdaltsova for code.

allcontributors · 2024-09-19T10:28:31Z

@peterdudfield

I've put up a pull request to add @AUdaltsova! 🎉

peterdudfield · 2024-09-19T10:29:05Z

@all-contributors please add @Sukh-P for code.

allcontributors · 2024-09-19T10:29:15Z

@peterdudfield

I've put up a pull request to add @Sukh-P! 🎉

peterdudfield · 2024-09-19T10:29:41Z

@all-contributors please add @peterdudfield for code.

allcontributors · 2024-09-19T10:29:51Z

@peterdudfield

I've put up a pull request to add @peterdudfield! 🎉

dfulu added 8 commits July 30, 2024 14:16

Add stellite loading and test

97bd8b3

add basic NWP loading

4d0b450

add GSP loading

a0f05a1

add minimum selection functions

d8852b3

add minimum numpybatch conversion functions

ba3fb97

add reduced PVNet dataset

5629418

add reqs

281bc7e

linting and docs

64d8ae0

dfulu requested review from peterdudfield and Sukh-P and removed request for peterdudfield and Sukh-P August 1, 2024 15:34