about get_datasets Function: Train Dataset Length Issue and Code Structure Review #196

kujhin · 2024-11-12T08:19:21Z

could you show me how the get_datasets function is structured? When I tried to use it with my data, the length of train_dataset came out as 1, so I'd like to check how this function works.

wgifford · 2024-11-15T13:37:28Z

Hi @kujhin thank you for your interest in Granite-TSFM. I will give an overview of the get_datasets function below, but if you have a specific example and question please ask.

get_datasets is meant to simplify the process of getting appropriate torch datasets for use in model training or inference for our time series models. In combines a few different functions:

Splitting the data into train, validation, test
Training and applying the preprocessor for a) scaling the data, and b) encoding any categorical data if present
Optionally further reducing the selected training dataframe based on the fewshot parameter
Creating the torch datasets by using ForecastDFDataset on the appropriate preprocessed data split

Splitting the data

The data can be split in a few different ways. One can provide absolute indices or relative fractions. When creating the splits a context_length window is prepended to the validation and test datasets to ensure that there are sufficient samples for these datasets. If relative fractions are specified for each split they are calculated as:

    index_start_i = floor(length_i * start_fraction) - start_offset
    index_end_i = floor(length_i * end_fraction)

start_offset is 0 for the train split, otherwise it is context_length. If only train and test fractions are specified the following logic is used:

    l = len(df)
    train_size = int(l * train)
    test_size = int(l * test)
    valid_size = l - train_size - test_size

Then the following start and end indices will be used for each split:
train: [0, train_size)
valid: [train_size - start_offset, train_size+valid_size)
test: [train_size + valid_size - start_offset, l)
where start_offset = context_length

Training and applying preprocessor

After splitting the data, the TimeSeriesPreprocessor instance passed to the get_datasets function (tsp) is trained on the test data (i..e, tsp.train(test_df)). Then all three dataframes are appropriately processed using tsp.preprocess().

Optional fewshot training

If fewshot training is enabled (by passing fewshot_fraction) the training dataframe is further reduced by selecting only that fraction of data elements. The selection can be done in several ways: 1) by selecting data directly from the beginning or end of the training dataframe, or 2) uniform sampling of context windows after torch dataset creation. The selection is done prior to torch dataset creation, with the exception of uniform sampling. For the uniform case, it must be done after context window creation, hence it is done directly on the train torch dataset.

Creating the torch datasets

Once the pandas dataframes are selected and preprocessed they are passed to ForecastDFDataset to produce a torch dataset. The dataset is responsible for creating the context windows and the prediction windows, as well as incorporating additional information in the dataset (id columns, timestamps, etc.).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about get_datasets Function: Train Dataset Length Issue and Code Structure Review #196

about get_datasets Function: Train Dataset Length Issue and Code Structure Review #196

kujhin commented Nov 12, 2024

wgifford commented Nov 15, 2024

about get_datasets Function: Train Dataset Length Issue and Code Structure Review #196

about get_datasets Function: Train Dataset Length Issue and Code Structure Review #196

Comments

kujhin commented Nov 12, 2024

wgifford commented Nov 15, 2024

Splitting the data

Training and applying preprocessor

Optional fewshot training

Creating the torch datasets