-
Notifications
You must be signed in to change notification settings - Fork 0
Week 07 and 08
During weeks 7 and 8, I implemented the pre-processing methods and splitting methods of the framework. In summary I implemented:
- Pre-processing methods: Rating binarization and filter by rating or by k-core.
- Splitting methods: Random by ratio, Timestamp by ratio, fixed by timestamp and K-fold. All of them with global or user-level splitting.
During a Recommender System experiment pipeline, the pre-processing step is a model-independent hyper-factor that may generate different recommender performances. The goal of this framework is to implement the main used pre-processing methods as demonstrated by Elliot and DaisyRec literature reviews.
For now, the binarization and main filtering methods are available in the framework.
In the .yaml file, the directive preprocess is used to define the list of pre-processing methods to be performed during the experiment pipeline. The pre-processing step can be configured as:
experiment:
preprocess:
- method: method1
parameters:
parameter_1: 3
parameter_2: val
- method: method2
parameters:
parameter_1: 4
Where,
-
preprocess: specifies a list of pre-processing methods. (optional)
- method: method name (mandatory)
-
parameters: method parameters in the format
parameter_name: parameter_value
The preprocessing method is implemented on the framework/dataloader/preprocess/
subpackage.
This pre-processing step is a data transformation method, that transforms explicit feedback, such as ratings or counts, into implicit feedback (positive or negative).
For this is necessary a threshold
parameter indicates if a rating should be considered positive or negative. For each rating
In the .yaml file, the method name is binarize
and the only parameter is a threshold
number. Example:
experiment:
preprocess:
- method: binarize
parameters:
threshold: 4
In general, the original dataset is sparse, where some items are rarely interacted with or users only interact with a few items. To address this issue, some filtering strategies can be used to remove inactive users or items.
As indicated by the literature, the main used methods are Filter by Rating and K-core.
Filters out users and items with less than k
interactions. This method can be used interactively until the condition is met (all users/items have at least k interactions) or until a specific number of iterations
.
In the .yaml file, the method name is filter_kcore
, and the parameters are k
, the number of iterations
and the target
type of node (user or item). Example:
experiment:
# ...
preprocess:
- method: filter_kcore
parameters:
k: 20
iterations: 3
target: user # user or rating
Filter by rating is a specific type of k-core with only one iteration.
Filter by rating can be specified just by setting iterations: 1
This step of the experiment pipeline aims to split the data into training, validation and test sets. According to the literature the main used methods are Random by Ratio, Timestamp by Ratio, Fixed Timestamp and K-fold.
The framework implements those main splitting methods and uses an Edge Splitter to split the graph into training, test and validation sets.
In the .yaml file, the directive split is used to define the split method. For example:
experiment:
# ...
split:
seed: 42
test:
method: method1_name
parameter1_name: 0.2
parameter2_name: value
validation:
method: method2_name
parameter1_name: value_2
parameter2_name: 100
Where,
-
split: specifies the splitting method used. (mandatory)
- seed: random seed value for reproducibility
- test: directive for the test split
- validation: directive for the validation split
- method: splitting method name (mandatory)
- Parameters as a dictionary where the key is the splitting method parameter name and the value is the corresponding value of this parameter. Example:
parameter1: value1
The split method is implemented on the framework/dataloader/edge_splitter/
subpackage. The main class is EdgeSplitter()
that has a method split()
that returns :
-
G_train
: training graph after removing the ratings used for training; - ratings_test: A dictionary where the key is an
ItemNode
and the value is a list of tuples[(ItemNode, rating: float)]
All the splitted data will be stored at a Dataset
instance.
Randomizes the ratings and extracts a proportion
This can be done at user or global level. The first indicates that the split will be done globally. The latter indicates that for each user a proportion
In the .yaml file, the method name is random_by_ratio
, and the parameters are:
-
p
: test set proportion (mandatory) -
level
: global or user level (mandatory).
Example:
experiment:
# ...
split:
seed: 42
test:
method: random_by_ratio
level: global # or user
p: 0.2
validation:
method: random_by_ratio
level: global # or user
p: 0.2
After ordering the ratings by timestamp, a proportion
In the .yaml file, the method name is timestamp_by_ratio
, and the parameters are the same as Random by Ratio. Example:
experiment:
# ...
split:
test:
method: timestamp_by_ratio
level: user # or global
p: 0.1
validation:
method: timestamp_by_ratio
level: user # or global
p: 0.2
Splits the data between train and test sets by a fixed timestamp
. All ratings before timestamp
are used as training data and the rest as test data.
In the .yaml file, the method name is fixed_timestamp
. The only parameter is the timestamp
number. Example:
experiment:
# ...
split:
test:
method: fixed_timestamp
timestamp: 890000000
validation:
method: fixed_timestamp
timestamp: 880000000
In K-fold cross-validation, the dataset is divided into K
equally sized subsets or folds. The model is trained and evaluated K
times, each time using a different fold as the test set while using the remaining folds for training. The results are then averaged to obtain an overall performance measure.
In the .yaml file, the method name is k_fold
. The parameters are:
-
k
: number of folds (mandatory) -
level
: global or user level (mandatory).
Note: This method does not support validation splitting.
Example:
experiment:
# ...
split:
test:
method: k_fold
k: 3
level: 'user'
Implement a Recommender System model and add it to the experiment pipeline.
Zhu Sun et al., "DaisyRec 2.0: Benchmarking Recommendation for Rigorous Evaluation"