Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Site torch dataset update #82

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Site torch dataset update #82

wants to merge 3 commits into from

Conversation

Sukh-P
Copy link
Member

@Sukh-P Sukh-P commented Nov 19, 2024

Pull Request

Description

WIP PR to update the Site Torch Dataset to return samples as xarray Datasets for easier conversion into netcdf files which is the preferred current format of saving samples.

This PR includes:

  • Adding a new functions to process the sample dict (dict with xr DataArrays) into one Dataset
  • Reordering of when .compute() is called since now we combine multiple DataArrays into a Dataset we can call compute after this is done
  • Removed unused site specific parts from original process and combine function
  • Updating unit tests now that the data type of the sample has changes in the Torch Dataset
  • Updating some time interval syntax to stop a deprecation warning (unrelated to the above changes)

TODO

  • Removed saving solar coordinates data in samples for now, current idea is to use the numpy batch functions in here in PVNet to create this data when converting to a numpy batch (if this seems messy may add some logic in here to add to the solar position coordinates to the Dataset)
  • Add new functions to go from a Dataset to NumpyBatch/TensorBatch
  • Check this works by creating some samples and adding logic into PVNet to read the netcdfs and convert to NumpyBatch/TensorBatches and then train a model, will link PR here once that is done

@peterdudfield
Copy link
Contributor

Thanks @Sukh-P great to push this forward.

A few quick thoughts, and sorry if these seem obvious and have already been answered

  1. is the ideal still to convert the site batch dataset to a dict of tensors ready for the model (PVnet)? if so, will this code sit in here, or in PVNet
  2. For different torch dataloaders, do we have an idea of how to do this for the three different process? 1. make batches, 2. load batches and train model, 3. running inference. It would be a shame to have separate torch datasets for each loader, but perhaps there is a simple way to do this. This is very much in your TODO section, so perhaps you have thought about this already / going to next.
  3. This might be related to 2., but do you know where the combining samples to batch process fits in?

@Sukh-P
Copy link
Member Author

Sukh-P commented Nov 20, 2024

@peterdudfield thanks, I have tried to answer these:

  1. is the ideal still to convert the site batch dataset to a dict of tensors ready for the model (PVnet)? if so, will this code sit in here, or in PVNet

Yes that's still the plan, perhaps still making a NumpyBatch if we want to have a more generic intermediary format, and yes the code will probably be added to here but then called in PVNet, can make that clearer in the TODO list above

  1. For different torch dataloaders, do we have an idea of how to do this for the three different process? 1. make batches, 2. load batches and train model, 3. running inference. It would be a shame to have separate torch datasets for each loader, but perhaps there is a simple way to do this. This is very much in your TODO section, so perhaps you have thought about this already / going to next.

I think the longer term plan is to move towards one batch format (netcdf) and have a common interface with batches through a batch object, in this way things will be more generalised and we will have less of having a different way to do things each time, I imagine this will need a bit more thought and can be improved after adding in a working pipeline for sites, can create an issue/discussion around this after we have

  1. This might be related to 2., but do you know where the combining samples to batch process fits in?

So I think this is managed by having a function which does some stacking of samples like here into a batch and the Torch DataLoader where you specify how many samples would be in a batch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants