-
Notifications
You must be signed in to change notification settings - Fork 5
5.2. Expert Finding
A first step, which does not need any specific configuration, is the author consolidation. The pipeline identifies similar names taking into account:
- initials
- that authors may or may not use middle names
- that authors may have multiple surnames and sometimes only use one (especially in Spanish and Portuguese speaking countries)
- the surname may precede the first name (especially in East Asian countries)
- that accents may or may not be used.
This system is supported by a large database of known given names and surnames and outputs names that may refer to the same person. The system assumes that all names are unique as it has no way to distinguish different authors with the same name.
This step uses algorithms which extend existing relevance-based measures of expertise, using topical hierarchies to introduce a semantically motivated measure of expertise (only if authors are present in the metadata, ignored otherwise). We are following the expert finding approach introduced by (Bordea, 2010) which measures the relevance of a term (i.e. an expertise topic), for an author (i.e. an expert).
The main approach is as follow:
In this method, authors are ranked according to how frequently they mention each extracted term, noted t. This is calculated as the relevance of an author, a, to a topic t, which is calculated by means of Tf-irf (Term Frequency-Inverse Researcher Frequency):
Where Da is the set of documents authored by author, a, and tf-idf is calculated to assess the importance of the term in the corpus.
The tf-idf measures the relevance of a given expertise topic for a person. People are represented by an aggregated document that is constructed by concatenating all the documents.
Each individual i is represented by a virtual document d i that is constructed by aggregating all the documents authored by that person. This allows us to estimate the relevance of a term for an individual by computing the relevance of a term for the virtual document d i .
Documents are aggregated for each individual i, resulting in a set of aggregated documents D defined as follows:
D = {d i1 , d i2 , ..., d in }
Where n is the total number of individuals.
The tf score is defined below as the frequency f (t, d i ) of a term t in the aggregated document d i authored by an individual i, normalised by the frequency f (t, D) of the term in the whole set of aggregated documents D.
tf (t, i) = f (t, d i ) / f (t, D)
The idf score for a term t and a set of aggregated documents D is defined below as the total number of individuals n divided by the number of individuals that mention the term.
idf (t, D) = n / |{d i ∈ D : t ∈ d i }|
In this way, the tf idf scoring function combines tf and idf as follows:
tf idf (t, i) = tf (t, i) · idf (t, I)
We are following the expert finding approach introduced by (bordea, 2010) which measures the relevance of a term (i.e. an expertise topic), for an author (i.e. an expert). In this method, authors are ranked according to how frequently they mention each extracted term, noted t.
This resource has been funded by Science Foundation Ireland under Grant SFI/12/RC/2289_P2 for the Insight SFI Research Centre for Data Analytics. © 2020 Data Science Institute - National University of Ireland Galway