-
Notifications
You must be signed in to change notification settings - Fork 0
Week 02
This week I also devoted my time to Entity linking step for Data Integration. I was able to:
- Implement the class
LastFM
and its Entity linking related methods to match each LastFM item with DBpedia; - Implement
Dataset.parallel_queries()
method for parallel web requests to the SPARQL endpoint, using python Threads. Previously the queries were done sequentially.
LastFM matching result:
- 11815 matches in a total of 17632 (67%)
The dataset used can be found in (LastFM-HetRec2021)[https://grouplens.org/datasets/hetrec-2011/].
The class LastFM
, derived from Dataset
, implements the necessary methods to convert item data to a standardized .csv file (To be done: converting user and rating data) and match each music with their corresponding DBpedia's URI.
In the case of LastFM, the most important field is artist/band name. The baseline for this dataset is similar to MovieLens, consisting in:
- Match the extracted artist/band name with the rdfs:label of DBpedia's URI, using regex. See the SPARQL query below for more details.
- Use the types dbo:MusicalArtist and dbo:Band to search only for resources related to bands and musical artists.
- When the query returns more than one URI, Levenshtein distance will be used to choose the most similar one.
The query template is:
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?artist WHERE {
{
?artist rdf:type dbo:MusicalArtist .
?artist rdfs:label ?label .
FILTER regex(?label, "$name_regex", "i")
}
UNION
{
?artist rdf:type dbo:Band .
?artist rdfs:label ?label .
FILTER regex(?label, "$name_regex", "i")
}
}
The placeholder $name_regex should be replaced with the artist/band name regex to match the label.
Considering the band "Daft Punk" the get_query_params()
return should be:
{
'name_regex': '^Daft.*Punk',
}