-
Notifications
You must be signed in to change notification settings - Fork 0
Week 02
This week I also devoted my time to the Entity linking step for Data Integration. I was able to:
- Implement the class
LastFM
and its Entity linking related methods to match each LastFM item with DBpedia; - Implement
Dataset.parallel_queries()
method for parallel web requests to the SPARQL endpoint, using python Threads. Previously the queries were done sequentially. This improved significantly the total time to entity link all datasets.
LastFM matching result:
- 11815 matches in a total of 17632 (67%)
The dataset used can be found in LastFM-HetRec2021.
The class LastFM
, derived from Dataset
, implements the necessary methods to convert item data to a standardized .csv file (To be done: converting user and rating data) and match each artist/band with their corresponding DBpedia's URI.
In the case of LastFM, the most important field is artist/band name. The baseline for this dataset is similar to MovieLens, consisting in:
- Match the extracted artist/band name with the rdfs:label of DBpedia's URI, using regex. See the SPARQL query below for more details.
- Use the types dbo:MusicalArtist and dbo:Band to search only for resources related to bands and musical artists.
- When the query returns more than one URI, Levenshtein distance will be used to choose the most similar one.
The query template is:
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?artist WHERE {
{
?artist rdf:type dbo:MusicalArtist .
?artist rdfs:label ?label .
FILTER regex(?label, "$name_regex", "i")
}
UNION
{
?artist rdf:type dbo:Band .
?artist rdfs:label ?label .
FILTER regex(?label, "$name_regex", "i")
}
}
The placeholder $name_regex should be replaced with the artist/band name regex to match the label.
Considering the band "Daft Punk" the get_map_query()
return should be:
params = {
'name_regex': '^Daft.*Punk',
}
return self.map_query_template.substitute(**params)
Previously the queries were done sequentially with a poor performance on datasets with a great number of items, like LastFM (17632 items).
Python provides two built-in ways to pallelize code: Threading and Multiprocessing. Each one has its advantage and disadvantages, which makes them suitable for different types of applications. Threading suits well IO-bound tasks, while Multiprocessing better suits CPU-bound tasks.
The SPARQL queries for this project can be classified into IO-bound. Therefore the chosen solution was to use Threads for implementing parallel web requests to the SPARQL endpoint.
The Worker
class is used to instantiate each thread that will be used inside Dataset.parallel_queries()
method.
Each worker will store a reference to a same queue, containing each query to be done. More precisely, the queue will store tuples containing the item_id and the SPARQL query string to be performed.
All workers will consume that queue, request the SPARQL endpoint and store the item_id and query return into a list of tuples. See Worker.run()
for more details.
After all requests have been done, all query results will be combined into a list of tuples. This list will be returned by Dataset.parallel_queries(queue)
. Finally, the query results can be processed to determine the best URI for each match.
The use of parallel requests improved significantly the total time to entity link all datasets.
The next steps are:
- Implement the methods for converting user and rating data from LastFM.
- Entity Link other datasets from other domains, like Book-Crossing.