Skip to content

Week 07 and 08

AlvaroJoseLopes edited this page Sep 19, 2023 · 1 revision

TL;DR

During weeks 7 and 8, I implemented the pre-processing methods and splitting methods of the framework. In summary I implemented:

  • Pre-processing methods: Rating binarization and filter by rating or by k-core.
  • Splitting methods: Random by ratio, Timestamp by ratio, fixed by timestamp and K-fold. All of them with global or user-level splitting.

Pre processing

During a Recommender System experiment pipeline, the pre-processing step is a model-independent hyper-factor that may generate different recommender performances. The goal of this framework is to implement the main used pre-processing methods as demonstrated by Elliot and DaisyRec literature reviews.

For now, the binarization and main filtering methods are available in the framework.

In the .yaml file, the directive preprocess is used to define the list of pre-processing methods to be performed during the experiment pipeline. The pre-processing step can be configured as:

experiment:
  preprocess:
    - method: method1  
      parameters:
        parameter_1: 3  
        parameter_2: val 
    - method: method2
      parameters: 
        parameter_1: 4

Where,

  • preprocess: specifies a list of pre-processing methods. (optional)
    • method: method name (mandatory)
    • parameters: method parameters in the format parameter_name: parameter_value

The preprocessing method is implemented on the framework/dataloader/preprocess/ subpackage.

Binarization

This pre-processing step is a data transformation method, that transforms explicit feedback, such as ratings or counts, into implicit feedback (positive or negative).

For this is necessary a threshold parameter indicates if a rating should be considered positive or negative. For each rating $r$, the binarized rating $r_b$ will be positive ($r_b = 1$) if $r \geq threshold$, otherwise negative ($r_b = 0$).

In the .yaml file, the method name is binarize and the only parameter is a threshold number. Example:

experiment:
  preprocess:
    - method: binarize
      parameters: 
        threshold: 4

Filtering

In general, the original dataset is sparse, where some items are rarely interacted with or users only interact with a few items. To address this issue, some filtering strategies can be used to remove inactive users or items.

As indicated by the literature, the main used methods are Filter by Rating and K-core.

K-Core

Filters out users and items with less than k interactions. This method can be used interactively until the condition is met (all users/items have at least k interactions) or until a specific number of iterations.

In the .yaml file, the method name is filter_kcore, and the parameters are k, the number of iterations and the target type of node (user or item). Example:

experiment:
  # ...
  preprocess:
    - method: filter_kcore
      parameters:
        k: 20
        iterations: 3
        target: user # user or rating

Filter by rating

Filter by rating is a specific type of k-core with only one iteration.

Filter by rating can be specified just by setting iterations: 1

Dataset splitting

This step of the experiment pipeline aims to split the data into training, validation and test sets. According to the literature the main used methods are Random by Ratio, Timestamp by Ratio, Fixed Timestamp and K-fold.

The framework implements those main splitting methods and uses an Edge Splitter to split the graph into training, test and validation sets.

In the .yaml file, the directive split is used to define the split method. For example:

experiment:
  # ...
  split:
    seed: 42
    test:
      method: method1_name 
      parameter1_name: 0.2
      parameter2_name: value
    validation:
      method: method2_name 
      parameter1_name: value_2
      parameter2_name: 100

Where,

  • split: specifies the splitting method used. (mandatory)
    • seed: random seed value for reproducibility
    • test: directive for the test split
    • validation: directive for the validation split
    • method: splitting method name (mandatory)
    • Parameters as a dictionary where the key is the splitting method parameter name and the value is the corresponding value of this parameter. Example: parameter1: value1

The split method is implemented on the framework/dataloader/edge_splitter/ subpackage. The main class is EdgeSplitter() that has a method split() that returns :

  • G_train: training graph after removing the ratings used for training;
  • ratings_test: A dictionary where the key is an ItemNode and the value is a list of tuples [(ItemNode, rating: float)]

All the splitted data will be stored at a Dataset instance.

Random By Ratio

Randomizes the ratings and extracts a proportion $p$ of ratings to be treated as a test set, and the rest $1-p$ will be used as a training set.

This can be done at user or global level. The first indicates that the split will be done globally. The latter indicates that for each user a proportion $p$ of ratings from that user will be used as a test set.

In the .yaml file, the method name is random_by_ratio, and the parameters are:

  • p: test set proportion (mandatory)
  • level: global or user level (mandatory).

Example:

experiment:
  # ...
  split:
    seed: 42
    test:
      method: random_by_ratio 
      level: global # or user
      p: 0.2
    validation:
      method: random_by_ratio 
      level: global # or user
      p: 0.2

Timestamp by ratio

After ordering the ratings by timestamp, a proportion $p$ of the most recent ratings is extracted as test data.

In the .yaml file, the method name is timestamp_by_ratio, and the parameters are the same as Random by Ratio. Example:

experiment:
  # ...
  split:
    test:
      method: timestamp_by_ratio 
      level: user # or global
      p: 0.1
    validation:
      method: timestamp_by_ratio 
      level: user # or global
      p: 0.2

Fixed Timestamp

Splits the data between train and test sets by a fixed timestamp. All ratings before timestamp are used as training data and the rest as test data.

In the .yaml file, the method name is fixed_timestamp. The only parameter is the timestamp number. Example:

experiment:
  # ...
  split:
    test:
      method: fixed_timestamp 
      timestamp: 890000000
    validation:
      method: fixed_timestamp 
      timestamp: 880000000

K-Fold

In K-fold cross-validation, the dataset is divided into K equally sized subsets or folds. The model is trained and evaluated K times, each time using a different fold as the test set while using the remaining folds for training. The results are then averaged to obtain an overall performance measure.

In the .yaml file, the method name is k_fold. The parameters are:

  • k: number of folds (mandatory)
  • level: global or user level (mandatory).

Note: This method does not support validation splitting.

Example:

experiment:
  # ...
  split:
    test:
      method: k_fold
      k: 3 
      level: 'user'

Next Steps

Implement a Recommender System model and add it to the experiment pipeline.

References

Vito Walter Anelli et al., "Elliot: A Comprehensive and Rigorous Framework for Reproducible Recommender Systems Evaluation"

Zhu Sun et al., "DaisyRec 2.0: Benchmarking Recommendation for Rigorous Evaluation"

Clone this wiki locally