-
Notifications
You must be signed in to change notification settings - Fork 5
5.1. Term Extraction
The goal in applying Term Extraction is to automatically identify key terms in the domain with the objective to create a taxonomy that represents this domain. The content of the corpus itself given as input for analysis will represent the domain, in other words the domain is embodied by the corpus. A key term is a single or multiword expression characteristic to the domain. For a term to be classified as a key term it should: (i) appear frequently in the dataset, and (ii) be of relevance to the domain.
Term Extraction pipeline
The term extraction framework follows a sequential pipeline (Figure above) with the following steps:
-
The identification of candidate terms extracts from the textual documents all noun phrases with the following characteristics:
- Containing a minimum and a maximum of words defined by a ngram_minand ngram_maxparameters, respectively.
- Fulfilling a set of Part-of-Speech constraints to ensure it constitutes a noun phrase, e.g. the head element of the term can not be an adjective.
- Occurring within the dataset at least a given number of times specified by the min_term_freq parameter.
-
The scoring step provides numeric scores to each term extracted, representing the relevance of the term for the domain in the textual dataset. As the notion of relevance changes from one application to another, the scoring step can make use of one or more scoring functions most suitable for underlying domain and task.
Consider KBa textual dataset and t ∈ KB a candidate term extracted in the step for identification of candidate terms. Let F = {f1(x), f2(x),..., fn(x)} be a set of functions that indicate the relevance of t in KB for a task T. The score given to t is a n-tuple given by:
score(t) = { f(t) | f F}
-
The ranking step sorts all candidate terms from the most relevant (i.e. highest score value) to the least relevant (i.e. lowest score value). If only one score function was used to score a term, then ranking is a simple sorting mechanism. However, if more than one scoring function was used, then the ranking step applies an aggregation function to determine the position of each term in the ranking.
-
The filtering step filters the list of candidate terms by keeping only the top n terms after the ranking step (where n is a parameter provided to the algorithm).
Saffron provides a set of scoring functions that highlight different properties of extracted candidate terms. Details on how each scoring function works can be found at (Astrakhantsev, 2016). The scoring functions available are divided into four main categories:
-
Frequency of occurrences: considers only frequencies of candidate terms and/or frequency of words occurring within candidate terms (functions available: Total TF-IDF, Residual IDF, C Value, Basic, ComboBasic). Single word phrases are favored by measures based on frequency as they have a higher probability of occurrence. It is therefore advised to combine such measure with one based on another principle for a less single-word focus, or to set the minimum length of term to two words.
-
Total TF-IDF: A common measure used in information retrieval. The values for this measures raise when the potential term occur frequently in few documents:
where D is a total number of document in collection, DTF(t) is a number of documents containing term candidate t.
-
Residual IDF: This method is "based on the assumption that the deviation of observed IDF from the IDF modeled by Poisson distribution is higher for keywords [terms] than for ordinary words" (Astrakhantsev, 2016). See formula below:
-
C-Value: This is one of the most popular method used in term extraction. It promotes term candidates that occur frequently without being part of other term candidates. The original method was meant to work with multi-word term candidates only, but this version supports one-word term too (following the modification from Ventura et al. (2013)):
where |t| is a length of term candidate t (number of words), s is a set of term candidates containing t, i.e. candidates for which t is a substring.
-
Basic: Basic is a modification of C-Value that focuses on more intermediate level (of specificity) terms extraction. It functions as C-Value, in that it can extract multi-word terms only. However, as opposed to C-Value, Basic promotes term candidates that are part of other term candidates (embeddedness), based on the principle that such terms are usually served for the creation of more specific terms.
where et is a number of term candidates containing t.
-
Future Basic: Based on the same principle as the Basic function, this function allows to extract terms in a “predictive” way, rather than getting the current terms of the domain. From a time-stamped corpus, it extracts terms, and predicts how likely they will be important in the future for the domain of the corpus. For this task, Saffron implements a ranking algorithm that takes a temporal distribution into consideration and predicts what would be the future distribution.
-
ComboBasic: ComboBasic is based on Basic, but modifies it further so that the level of term specificity can be customized by changing parameters of the method:
where e′t is a number of term candidates that are contained in t. Therefore, by increasing β, one can extract more specific terms and vice versa. In Saffron, α is set to 0.75, and β to 0.1.
-
Future ComboBasic: Based on the same principle as the ComboBasic function, this function allows to extract terms in a “predictive” way, rather than getting the current terms of the domain. From a time-stamped corpus, it extracts terms, and predicts how likely they will be important in the future for the domain of the corpus. For this task, Saffron implements a ranking algorithm that takes a temporal distribution into consideration and predicts what would be the future distribution.
-
-
Context of occurrences: follows the distributional hypothesis (Harris, 1954) to distinguish terms from non-terms by considering the distribution of words in their contexts.
-
PostRankDC: The PostRank Domain Coherence function works in 3 steps. First, it extracts 200 best term candidates by using the Basic method. Then, the words from contexts (window of five words) of previously extracted 200 terms are filtered: it keeps only nouns, adjectives, verbs and adverbs that occur in at least 1/4 of all documents and are similar to these 200 term candidates, i.e. ranked in the top 50 by averaged Normalized PMI:
"where w is a context word; T is a set of 200 best term candidates extracted by Basic; P(t, w) is a probability of occurrence of word w in the context of t; P(t) and P(w) are probabilities of occurrences of term t and word w, correspondingly. These probabilities are estimated on the basis of occurrence frequencies in the input collection; context is considered to be a five words window. Finally, as a weight of a term candidate, DomainCoherence takes the average of the same NPMI measures computed with each of 50 context words extracted at the previous step"(Astrakhantsev, 2016).
-
-
Reference corpora: is based on the assumption that terms can be distinguished from other words and collocations by comparing occurrence statistics in the dataset against statistics from a reference corpus - usually of general language/non specific domain (functions available: Weirdness, Relevance).
-
Weirdness: It is an implementation of this idea, normalizing it by sizes (in number of words) of document collections:
where NTFtarget(t) and NTFreference are frequencies of t normalized by sizes of target and reference collections, respectively
-
Relevance: It is based on Weirdness, however takes into account the fraction of documents where the term candidate occur:
-
Saffron uses the Wikidata Database as a reference collection.
-
Topic modeling: is based on the idea that topic modeling uncovers semantic information useful for term recognition; in particular, that the distribution of words over the topics found by the topic modeling is a less noisy signal than the simple frequency of occurrences (function available: NovelTopicModel).
-
NovelTopicModel: For this function, probability distribution of words over the following topics are obtained: φt – general topics (1 ≤ t ≤ 20); φB – background topic; φD – document specific topic. Then, it extracts 200 words most probable for each topic: Vt, VB, VD, correspondingly; finally, for each term candidate ci its weight is computed as a sum of maximal probabilities for each of its Li words (wi1wi2...wiLi):
-
In addition, Saffron has two ranking procedures:
-
Single score: Where only one score function is used to rank all terms. In this case all terms are sorted in ascending order by their associated score.
-
Voting: Combines the results of several score functions. Based on the voting mechanism from (Zhang et al., 2008), this ranking happens in two steps. In the first step, the single score procedure is applied to each score function used, resulting in a set of ranked lists R -- one list per score function. Next, the final ranking position for a candidate term* t* is given by:
where n is the number of score functions used and Ri(t) is the rank position of t as provided by the score function i.
It is needed for each corpus to test and analyse the performance of different parameters and score function choices regarding the frequency and relevance of the terms they yield for the specific domain.
Nikita Astrakhantsev: ATR4S: Toolkit with State-of-the-art Automatic Terms Recognition Methods in Scala. CoRR abs/1611.07804 (2016)
Ventura, J.A.L., Jonquet, C., Roche, M., Teisseire, M., et al.: Combining c-value and keyword extraction methods for biomedical terms extraction. In: LBM’2013: International Symposium on Languages in Biology and Medicine, pp. 45–49 (2013)
Harris, Z.S.: Distributional structure. Word 10(2-3), 146–162 (1954)
This resource has been funded by Science Foundation Ireland under Grant SFI/12/RC/2289_P2 for the Insight SFI Research Centre for Data Analytics. © 2020 Data Science Institute - National University of Ireland Galway