-
Notifications
You must be signed in to change notification settings - Fork 0
Week 01
This week I devoted my time to Entity linking step for Data Integration. This step goal is to match each item, of a RS dataset, with their corresponding DBpedia's resource. I was able to code:
- Bash scripts to download the supported datasets (inside
datasets/
folder). - A data integration CLI script (
data_integration.py
), that can convert the item, user and rating data into a standardized .csv file. This script can also map each item from the chosen dataset with their corresponding DBpedia's URI through SPARQL queries. - The Data Integration module (
data_integration/
), has theDataset
base class from which each RS dataset class will be derived. Through the methods provided by this class, the data integration script will be able to convert the RS dataset into a standardized .csv file and match each item with DBpedia resources. - The
MovieLens
class, derived fromDataset
, loads the dataset and provides the main methods responsible for dealing with the specificities when converting the MovieLens dataset to a standardized .csv and matching items with DBpedia. MovieLens-100k and MovieLens-1M are supported by the classesMovieLens100k
andMovieLens1M
, respectively. - Initial results:
- MovieLens-100k: 1462 matches in a total of 1681 movies (86.9%).
- MovieLens-1M: 3356 matches in a total of 3883 movies (86.5%).
As proposed for the project timeline, this first week was devoted to the Entity Linking step, of Data Integration Module. The main contributions were:
- Implemented Bash scripts to easily download the supported datasets;
- Implemented Entity Linking functionality on Data Integration script (mapping with DBpedia);
- Started implementing Entity-Linking related methods for datasets inside
data_integration
module; - Mapped MovieLens-100k and MovieLens-1M items with DBpedia.
For each supported dataset, a bash script is provided to easily download the full dataset. In this script, wget
is used to download the dataset from their official sources and a checksum is done to check if the dataset was correctly downloaded.
The implemented scripts will be placed at datasets/
folder. For now, only MovieLens-100k and MovieLens-1M are supported. The full dataset will be placed into a new folder inside datasets/
with the same name of the script file.
Usage example:
cd datasets
bash ml-100k.sh # Downloaded at `datasets/ml-100k` folder
bash ml-1m.sh # Downloaded at `datasets/ml-1m` folder
The goal of this script is to provide a simple CLI for Data Integration between some RS datasets and DBpedia. This script will allow:
- Converting item, user and rating data to a standardized .csv file.
- Entity Linking each item from the dataset with their corresponding DBpedia's resource. I.e mapping each item id with their DBpedia's URI.
- Enriching items information with useful resources from DBpedia.
During this week, the points 1 and 2 were adressed.
python3 data_integration.py [-h] -d DATASET -i INPUT_PATH -o OUTPUT_PATH [-ci] [-cu] [-cr] [-map]
Arguments:
- -h: Shows the help message.
-
-d: Name of a supported dataset. Will be the same name of the folder created by the bash script provided for the dataset. For now, check
data_integration/dataset2class.py
to see the supported ones. - -i: Input path where the full dataset is placed.
- -o: Output path where the integrated dataset will be placed.
- -ci: Use this flag if you want to convert item data.
- -cu: Use this flag if you want to convert user data.
- -cr: Use this flag if you want to convert rating data.
- -map: Use this flag if you want to map dataset items with DBpedia. At least the item data should be already converted.
Usage Example:
python3 data_integration.py -d 'ml-100k' -i 'datasets/ml-100k' -o 'datasets/ml-100k/processed' -ci -cu -cr -map
Check Makefile for more examples.
This module (data_integration
) includes the main classes and methods to be provided to the Data Integration script. For now, this module contains only the necessary methods for entity linking and converting dataset info (user, item and rating).
The file dataset.py
contains the Dataset
base class from which each RS dataset class will be derived. Each extension of Dataset
class will need to override the following methods, considering each dataset specificity:
-
load_item_data()
: should return apd.Dataframe()
containing each item info. -
load_user_data()
: should return apd.Dataframe()
containing each user info. -
load_rating_data()
: should return apd.Dataframe()
containing each rating info. -
entity_linking(df_item)
: takes as argument apd.Dataframe()
corresponding to the item data and returns apd.Dataframe()
containing each item_id and their mapped URI. -
get_query_params()
: should return a python dictionary to substitute query_template placeholders (string.Template
).
Also, some others attributes of Dataset
should be provided, such as which features will be extracted from the dataset. Indicated in data_integration/dataset.py
.
The fields item_id, user_id, rating and URI are reserved and should be used as standard for all datasets.
The file dataset2class.py
contains a dictionary mapping each dataset name with their corresponding submodule and class. Example:
dataset2class = {
# Dataset name
'ml-100k': {
'submodule': 'movielens', # class should be on this file movielens.py
'class': 'MovieLens100k' # class name
},
...
}
Finally, movielens.py
contains the MovieLens
derived class of Dataset
and override the required methods considering this dataset specificity.
When mapping each RS dataset with DBpedia resources, it's important to analyse the fields provided by the dataset and find the ones that could be useful when mapping.
In the case of MovieLens, the most important field is the movie title. The dataset documentation indicates that the title are identical to titles provided by the IMDB (including year of release), which has the pattern:
- "$MOVIE_TITLE ($YEAR_OF_RELEASE)"
- Examples: "Toy Story (1995)", "GoldenEye (1995)", "Grumpier Old Men (1995)".
The baseline for MovieLens consists in:
- Extract movie and year of the film. Check
_extract_title()
and_extract_year()
methods for more details. - Match the extracted title with the rdfs:label of DBpedia's URI, using regex. See the SPARQL query below for more details.
- Use dbo:wikiPageRedirects property to reach other labels that also refer to the same resource.
- When the query returns more than one URI, Levenshtein distance will be used to choose the most similar one.
To reduce the search space the SPARQL query uses:
- the type dbo:Film to search only by resources of films.
- the dbpedia category dbc:XYZT_films to search only for films released in XYZT year.
The query template is:
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?film WHERE {
{
?film rdf:type dbo:Film .
?film dct:subject $year_category .
?film rdfs:label ?label .
FILTER regex(?label, "$name_regex", "i")
}
UNION
{
?film rdf:type dbo:Film .
?film dct:subject $year_category .
?tmp dbo:wikiPageRedirects ?film .
?tmp rdfs:label ?label .
FILTER regex(?label, "$name_regex", "i") .
}
}
The placeholder $name_regex should be replaced with the movie title regex to match the label. And $year_category should be replaced with the corresponding dbpedia category dbc:XYZT_films.
Considering the title "Toy Story (1995)", the get_map_query()
return should be:
params = {
'name_regex': '^Toy.*Story',
'year_category': 'dbr:Category:1995_films'
}
return self.map_query_template.substitute(**params)
I noticed that there is a inconsistency between the year extracted from the movie title and release date provided by MovieLens-100k. For example, "Braveheart (1995)" is indicated as released in 16-Feb-1996.
Some movies has their names in other languages in parentheses, like: "Blue Angel, The (Blaue Engel, Der) (1930)".
Some other movies has the pattern like "Lion King, The (1994)", instead of "The Lion King (1994)", which is more close to its dbpedia resource The_Lion_King.
These are the inital results when matching:
- MovieLens-100k: 1462 matches in a total of 1681 movies (86.9%);
- MovieLens-1M: 3356 matches in a total of 3883 movies (86.5%).
The next step is to Entity Link others datasets.
The queries are done sequentially, one possible future work is to parallelize the queries.