Skip to content

Week 02

AlvaroJoseLopes edited this page Jun 12, 2023 · 3 revisions

TL;DR

This week I also devoted my time to Entity linking step for Data Integration. I was able to:

  • Implement the class LastFM and its Entity linking related methods to match each LastFM item with DBpedia;
  • Implement Dataset.parallel_queries() method for parallel web requests to the SPARQL endpoint, using python Threads. Previously the queries were done sequentially.

LastFM matching result:

  • 11815 matches in a total of 17632 (67%)

Entity Linking LastFM

The dataset used can be found in (LastFM-HetRec2021)[https://grouplens.org/datasets/hetrec-2011/].

The class LastFM, derived from Dataset, implements the necessary methods to convert item data to a standardized .csv file (To be done: converting user and rating data) and match each music with their corresponding DBpedia's URI.

In the case of LastFM, the most important field is artist/band name. The baseline for this dataset is similar to MovieLens, consisting in:

  1. Match the extracted artist/band name with the rdfs:label of DBpedia's URI, using regex. See the SPARQL query below for more details.
  2. Use the types dbo:MusicalArtist and dbo:Band to search only for resources related to bands and musical artists.
  3. When the query returns more than one URI, Levenshtein distance will be used to choose the most similar one.

The query template is:

PREFIX dct:  <http://purl.org/dc/terms/>
PREFIX dbo:  <http://dbpedia.org/ontology/>
PREFIX dbr:  <http://dbpedia.org/resource/>
PREFIX rdf:	 <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?artist WHERE {
    {
        ?artist rdf:type dbo:MusicalArtist .
        ?artist rdfs:label ?label .
        FILTER regex(?label, "$name_regex", "i")
    }
    UNION
    {
        ?artist rdf:type dbo:Band .
        ?artist rdfs:label ?label .
        FILTER regex(?label, "$name_regex", "i")
    }
}

The placeholder $name_regex should be replaced with the artist/band name regex to match the label.

Considering the band "Daft Punk" the get_query_params() return should be:

{
    'name_regex': '^Daft.*Punk',
}

Parallel SPARQL queries

Clone this wiki locally