Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about get_datasets Function: Train Dataset Length Issue and Code Structure Review #196

Open
kujhin opened this issue Nov 12, 2024 · 1 comment

Comments

@kujhin
Copy link

kujhin commented Nov 12, 2024

could you show me how the get_datasets function is structured? When I tried to use it with my data, the length of train_dataset came out as 1, so I'd like to check how this function works.

@wgifford
Copy link
Collaborator

Hi @kujhin thank you for your interest in Granite-TSFM. I will give an overview of the get_datasets function below, but if you have a specific example and question please ask.

get_datasets is meant to simplify the process of getting appropriate torch datasets for use in model training or inference for our time series models. In combines a few different functions:

  1. Splitting the data into train, validation, test
  2. Training and applying the preprocessor for a) scaling the data, and b) encoding any categorical data if present
  3. Optionally further reducing the selected training dataframe based on the fewshot parameter
  4. Creating the torch datasets by using ForecastDFDataset on the appropriate preprocessed data split

Splitting the data

The data can be split in a few different ways. One can provide absolute indices or relative fractions. When creating the splits a context_length window is prepended to the validation and test datasets to ensure that there are sufficient samples for these datasets. If relative fractions are specified for each split they are calculated as:

    index_start_i = floor(length_i * start_fraction) - start_offset
    index_end_i = floor(length_i * end_fraction)

start_offset is 0 for the train split, otherwise it is context_length. If only train and test fractions are specified the following logic is used:

    l = len(df)
    train_size = int(l * train)
    test_size = int(l * test)
    valid_size = l - train_size - test_size

Then the following start and end indices will be used for each split:
train: [0, train_size)
valid: [train_size - start_offset, train_size+valid_size)
test: [train_size + valid_size - start_offset, l)
where start_offset = context_length

Training and applying preprocessor

After splitting the data, the TimeSeriesPreprocessor instance passed to the get_datasets function (tsp) is trained on the test data (i..e, tsp.train(test_df)). Then all three dataframes are appropriately processed using tsp.preprocess().

Optional fewshot training

If fewshot training is enabled (by passing fewshot_fraction) the training dataframe is further reduced by selecting only that fraction of data elements. The selection can be done in several ways: 1) by selecting data directly from the beginning or end of the training dataframe, or 2) uniform sampling of context windows after torch dataset creation. The selection is done prior to torch dataset creation, with the exception of uniform sampling. For the uniform case, it must be done after context window creation, hence it is done directly on the train torch dataset.

Creating the torch datasets

Once the pandas dataframes are selected and preprocessed they are passed to ForecastDFDataset to produce a torch dataset. The dataset is responsible for creating the context windows and the prediction windows, as well as incorporating additional information in the dataset (id columns, timestamps, etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants