-
Notifications
You must be signed in to change notification settings - Fork 0
Week 05 and 06
During weeks 5 and 6, I wrapped up the data enriching step for LastFM and started coding the framework. I was able to:
- Implement the enriching step of LastFM
- Implement the necessary methods to convert the standard
.csv
files to a heterogenous network, using NetworkX.
This time I used a different approach to find the most useful properties for LastFM. The approach consisted in finding the most common DBpedia properties among the item resources (musical artists and bands).
After choosing the most important properties, a SPARQL query was used to retrieve those properties for each dataset.
A query to retrieve the count of properties was built to find the most common properties among the artists/bands. The query is structured to find properties, their types, and the count of occurrences for each property within the specified resources. The resources are limited to those classified as "musicalArtist" or "Band" according to the DBpedia ontology, since the same resource can be classified in more than one type.
The query, limited to two resources examples, is given below:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?property ?type (COUNT(?resource) AS ?count)
WHERE {
VALUES ?type { dbo:musicalArtist dbo:Band }
{
<http://dbpedia.org/resource/The_Jam> ?property ?value .
<http://dbpedia.org/resource/The_Jam> rdf:type ?type .
BIND(<http://dbpedia.org/resource/The_Jam> AS ?resource)
}
UNION
{
<http://dbpedia.org/resource/Edgar_Froese> ?property ?value.
<http://dbpedia.org/resource/Edgar_Froese> rdf:type ?type .
BIND(<http://dbpedia.org/resource/Edgar_Froese> AS ?resource)
}
UNION
{
same query for other resource...
}
UNION
...
}
GROUP BY ?type ?property
The complete script to retrieve this information can be found in this gist.
In the case of LastFm dataset, the most common properties are related to music genre, recorder, awards, associated artists/bands and more. The template query is:
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT
?abstract
(GROUP_CONCAT(DISTINCT ?bandMember; SEPARATOR="::") AS ?bandMember)
(GROUP_CONCAT(DISTINCT ?genre; SEPARATOR="::") AS ?genre)
(GROUP_CONCAT(DISTINCT ?associatedMusicalArtist; SEPARATOR="::") AS ?associatedMusicalArtist)
(GROUP_CONCAT(DISTINCT ?awards; SEPARATOR="::") AS ?awards)
(GROUP_CONCAT(DISTINCT ?recordLabel; SEPARATOR="::") AS ?recordLabel)
(GROUP_CONCAT(DISTINCT ?associatedBand; SEPARATOR="::") AS ?associatedBand)
(GROUP_CONCAT(DISTINCT ?origin; SEPARATOR="::") AS ?origin)
WHERE {
{
OPTIONAL { <$URI> dbo:genre ?genre } .
OPTIONAL { <$URI> dbo:abstract ?abstract } .
OPTIONAL { <$URI> dbp:origin ?origin } .
OPTIONAL { <$URI> dbo:recordLabel ?recordLabel } .
OPTIONAL { <$URI> dbo:bandMember ?bandMember } .
OPTIONAL { <$URI> dbo:associatedMusicalArtist ?associatedMusicalArtist } .
OPTIONAL { <$URI> dbo:associatedBand ?associatedBand } .
OPTIONAL { <$URI> dbp:awards ?awards } .
FILTER(LANG(?abstract) = 'en')
}
UNION
{
<$URI> dbo:wikiPageRedirects ?uri .
OPTIONAL { ?uri dbo:genre ?genre } .
OPTIONAL { ?uri dbo:abstract ?abstract } .
OPTIONAL { ?uri dbp:origin ?origin } .
OPTIONAL { ?uri dbo:recordLabel ?recordLabel } .
OPTIONAL { ?uri dbo:bandMember ?bandMember } .
OPTIONAL { ?uri dbo:associatedMusicalArtist ?associatedMusicalArtist } .
OPTIONAL { ?uri dbo:associatedBand ?associatedBand } .
OPTIONAL { ?uri dbp:awards ?awards } .
FILTER(LANG(?abstract) = 'en')
}
}
The inclusion of redirected properties is crucial, as it allows us to access information about certain resources that would otherwise be inaccessible. By following these redirects, we can ensure that we gather the intended properties.
The resulting enriched dataset has the following statistics:
- number of entities with the property item_id: 11783 (100.00%)
- number of entities with the property abstract: 11007 (93.41%)
- number of entities with the property bandMember: 2444 (20.74%)
- number of entities with the property genre: 8718 (73.99%)
- number of entities with the property associatedMusicalArtist: 3919 (33.26%)
- number of entities with the property awards: 146 (1.24%)
- number of entities with the property recordLabel: 7238 (61.43%)
- number of entities with the property associatedBand: 3919 (33.26%)
- number of entities with the property origin: 7069 (59.99%)
The objective of the framework is to enable users to easily configure an entire experiment pipeline. For example, with the below .yaml file the user could load an enriched MovieLens dataset:
experiment:
dataset:
name: ml-100k
item:
path: datasets/ml-100k/processed/item.csv
extra_features: [movie_year, movie_title]
user:
path: datasets/ml-100k/processed/user.csv
extra_features: [gender, occupation]
ratings:
path: datasets/ml-100k/processed/rating.csv
timestamp: True
enrich:
map_path: datasets/ml-100k/processed/map.csv
enrich_path: datasets/ml-100k/processed/enriched.csv
remove_unmatched: True
properties:
- type: subject
grouped: True
sep: "::"
- type: director
grouped: True
sep: "::"
Let's break down the main directives for the dataset:
-
item: specifies the item info to be added to the network. (mandatory)
-
path: filepath of the standardized
item.csv
. (mandatory) -
extra_features: For default, the only column to be added is the
item_id
. With a list of column names the user can specify additional features to be added as property node. (optional)
-
path: filepath of the standardized
-
user: specifies the user info. (mandatory)
-
path: filepath of the standardized
user.csv
. (mandatory) -
extra_features: For default, the only column to be added is the
item_id
. With a list of column names the user can specify additional features to be added as property node. (optional)
-
path: filepath of the standardized
-
ratings: specifies the ratings info. (mandatory)
-
path: filepath of the standardized
ratings.csv
. (mandatory) -
timestamp: boolean that indicates if the column
timestamp
is present.
-
path: filepath of the standardized
-
enrich: specifies the enriched info. (mandatory)
-
map_path: filepath of the standardized
map.csv
. (mandatory) -
enrich_path: filepath of the standardized
user.csv
. (mandatory) - remove_unmatched: boolean to specify if nodes unmatched with DBpedia should be removed. (mandatory)
-
properties: list of properties to enrich the dataset (mandatory)
- type: column name (type) of the property (mandatory)
- grouped: boolean that indicates if the property was grouped and concatenated into a single string. Used for multiples property values of the same property type for a given resource. (mandatory)
- sep: separator used to concatenate a list of property values. (optional)
-
map_path: filepath of the standardized
PyYaml library was used to deserialize the .yaml file and convert the configuration into a python dictionary. With the configuration provided by the user, it's possible to load and convert the data into a NetworkX graph as specified.
All Recommender System datasets will be modeled as a heterogenous network, with nodes of type UserNode
, ItemNode
, and PropertyNode
. A rating from a user to an item will be represented as an edge between those nodes (the timestamp of the rating, is a property of this edge). On the other hand, an edge between an item and a property will indicate that this item has this property. And finally, an edge between two different users will indicate a social link between them.
The class Graph
is a wrapper on the top of nx.Graph()
that receives the dataset configuration and converts it as specified.
The following image is a sample of the network specified above:
Analyze the most used pre-processing, filtering, and splitting methods, then implement them.