Skip to content

Week 01

AlvaroJoseLopes edited this page Jul 18, 2023 · 6 revisions

TL;DR

This week I devoted my time to Entity linking step for Data Integration. This step goal is to match each item, of a RS dataset, with their corresponding DBpedia's resource. I was able to code:

  • Bash scripts to download the supported datasets (inside datasets/ folder).
  • A data integration CLI script (data_integration.py), that can convert the item, user and rating data into a standardized .csv file. This script can also map each item from the chosen dataset with their corresponding DBpedia's URI through SPARQL queries.
  • The Data Integration module (data_integration/), has the Dataset base class from which each RS dataset class will be derived. Through the methods provided by this class, the data integration script will be able to convert the RS dataset into a standardized .csv file and match each item with DBpedia resources.
  • The MovieLens class, derived from Dataset, loads the dataset and provides the main methods responsible for dealing with the specificities when converting the MovieLens dataset to a standardized .csv and matching items with DBpedia. MovieLens-100k and MovieLens-1M are supported by the classes MovieLens100k and MovieLens1M, respectively.
  • Initial results:
    • MovieLens-100k: 1462 matches in a total of 1681 movies (86.9%).
    • MovieLens-1M: 3356 matches in a total of 3883 movies (86.5%).

What was done

As proposed for the project timeline, this first week was devoted to the Entity Linking step, of Data Integration Module. The main contributions were:

  • Implemented Bash scripts to easily download the supported datasets;
  • Implemented Entity Linking functionality on Data Integration script (mapping with DBpedia);
  • Started implementing Entity-Linking related methods for datasets inside data_integration module;
  • Mapped MovieLens-100k and MovieLens-1M items with DBpedia.

Downloading supported RS datasets

For each supported dataset, a bash script is provided to easily download the full dataset. In this script, wget is used to download the dataset from their official sources and a checksum is done to check if the dataset was correctly downloaded.

The implemented scripts will be placed at datasets/ folder. For now, only MovieLens-100k and MovieLens-1M are supported. The full dataset will be placed into a new folder inside datasets/ with the same name of the script file.

Usage example:

cd datasets
bash ml-100k.sh # Downloaded at `datasets/ml-100k` folder
bash ml-1m.sh   # Downloaded at `datasets/ml-1m` folder

Data Integration Script

The goal of this script is to provide a simple CLI for Data Integration between some RS datasets and DBpedia. This script will allow:

  1. Converting item, user and rating data to a standardized .csv file.
  2. Entity Linking each item from the dataset with their corresponding DBpedia's resource. I.e mapping each item id with their DBpedia's URI.
  3. Enriching items information with useful resources from DBpedia.

During this week, the points 1 and 2 were adressed.

How to use

python3 data_integration.py [-h] -d DATASET -i INPUT_PATH -o OUTPUT_PATH [-ci] [-cu] [-cr] [-map]

Arguments:

  • -h: Shows the help message.
  • -d: Name of a supported dataset. Will be the same name of the folder created by the bash script provided for the dataset. For now, check data_integration/dataset2class.py to see the supported ones.
  • -i: Input path where the full dataset is placed.
  • -o: Output path where the integrated dataset will be placed.
  • -ci: Use this flag if you want to convert item data.
  • -cu: Use this flag if you want to convert user data.
  • -cr: Use this flag if you want to convert rating data.
  • -map: Use this flag if you want to map dataset items with DBpedia. At least the item data should be already converted.

Usage Example:

python3 data_integration.py -d 'ml-100k' -i 'datasets/ml-100k' -o 'datasets/ml-100k/processed' -ci -cu -cr -map

Check Makefile for more examples.

Data Integration Module

This module (data_integration) includes the main classes and methods to be provided to the Data Integration script. For now, this module contains only the necessary methods for entity linking and converting dataset info (user, item and rating).

The file dataset.py contains the Dataset base class from which each RS dataset class will be derived. Each extension of Dataset class will need to override the following methods, considering each dataset specificity:

  • load_item_data(): should return a pd.Dataframe() containing each item info.
  • load_user_data(): should return a pd.Dataframe() containing each user info.
  • load_rating_data(): should return a pd.Dataframe() containing each rating info.
  • entity_linking(df_item): takes as argument a pd.Dataframe() corresponding to the item data and returns a pd.Dataframe() containing each item_id and their mapped URI.
  • get_query_params(): should return a python dictionary to substitute query_template placeholders (string.Template).

Also, some others attributes of Dataset should be provided, such as which features will be extracted from the dataset. Indicated in data_integration/dataset.py.

The fields item_id, user_id, rating and URI are reserved and should be used as standard for all datasets.

The file dataset2class.py contains a dictionary mapping each dataset name with their corresponding submodule and class. Example:

dataset2class = {
    # Dataset name
    'ml-100k': { 
        'submodule': 'movielens', # class should be on this file movielens.py
        'class': 'MovieLens100k' # class name
    },
    ...
}

Finally, movielens.py contains the MovieLens derived class of Dataset and override the required methods considering this dataset specificity.

MovieLens Entity Linking baseline

When mapping each RS dataset with DBpedia resources, it's important to analyse the fields provided by the dataset and find the ones that could be useful when mapping.

In the case of MovieLens, the most important field is the movie title. The dataset documentation indicates that the title are identical to titles provided by the IMDB (including year of release), which has the pattern:

  • "$MOVIE_TITLE ($YEAR_OF_RELEASE)"
  • Examples: "Toy Story (1995)", "GoldenEye (1995)", "Grumpier Old Men (1995)".

The baseline for MovieLens consists in:

  1. Extract movie and year of the film. Check _extract_title() and _extract_year() methods for more details.
  2. Match the extracted title with the rdfs:label of DBpedia's URI, using regex. See the SPARQL query below for more details.
  3. Use dbo:wikiPageRedirects property to reach other labels that also refer to the same resource.
  4. When the query returns more than one URI, Levenshtein distance will be used to choose the most similar one.

To reduce the search space the SPARQL query uses:

  • the type dbo:Film to search only by resources of films.
  • the dbpedia category dbc:XYZT_films to search only for films released in XYZT year.

The query template is:

PREFIX dct:  <http://purl.org/dc/terms/>
PREFIX dbo:  <http://dbpedia.org/ontology/>
PREFIX dbr:  <http://dbpedia.org/resource/>
PREFIX rdf:	 <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?film WHERE {
    {
        ?film rdf:type dbo:Film .
        ?film dct:subject $year_category .
        ?film rdfs:label ?label .
        FILTER regex(?label, "$name_regex", "i")
    }
    UNION
    {
        ?film rdf:type dbo:Film .
        ?film dct:subject $year_category .
        ?tmp dbo:wikiPageRedirects ?film .
        ?tmp rdfs:label ?label .
        FILTER regex(?label, "$name_regex", "i") .
    }
}

The placeholder $name_regex should be replaced with the movie title regex to match the label. And $year_category should be replaced with the corresponding dbpedia category dbc:XYZT_films.

Considering the title "Toy Story (1995)", the get_map_query() return should be:

params = {
    'name_regex': '^Toy.*Story',
    'year_category': 'dbr:Category:1995_films'
}
return self.map_query_template.substitute(**params)

Some other MovieLens details

I noticed that there is a inconsistency between the year extracted from the movie title and release date provided by MovieLens-100k. For example, "Braveheart (1995)" is indicated as released in 16-Feb-1996.

Some movies has their names in other languages in parentheses, like: "Blue Angel, The (Blaue Engel, Der) (1930)".

Some other movies has the pattern like "Lion King, The (1994)", instead of "The Lion King (1994)", which is more close to its dbpedia resource The_Lion_King.

Initial results

These are the inital results when matching:

  • MovieLens-100k: 1462 matches in a total of 1681 movies (86.9%);
  • MovieLens-1M: 3356 matches in a total of 3883 movies (86.5%).

Future work

The next step is to Entity Link others datasets.

The queries are done sequentially, one possible future work is to parallelize the queries.