Automatic keyword extraction is an important research topic. Keywords serve as a dense summary for a document, lead to improved information retrieval, or be the entrance to a document collection. The aim of this project is to find a small set of terms that describes a specific document, independently of the domain it belongs to.
We examine five different models: n-grams
(unigram, bigram, trigram), POSTagger-based-ngrams
(terms matching any of a set of part-of-speech (POS) tag sequences), and RAKE
(Rapid Automatic Keyword Extraction).
For the models using ngrams, we use three different features: term frequency, collection frequency and relative position of the first occurrence to find top keywords.
To run our model, download/clone this repo into your local device, compile & execute KeyPhraseExtraction.java
in your IDE or through command-line tools (javac, java). Unzip the TRAINING.zip
to get training data.
Files ending with .abstr
are used to train our models, while .contr
documents contains the controlled manually assigned keywords, separated with semicolon and .uncontr
contains the uncontrolled manually assigned keywords, separated with semicolon.
For POSTagger model, please download the basic English Stanford Tagger and add .jar
files to your project build path or class path.
To evaluate the performance, we compute the precision score of the top 5 keywords generated for each document in the Training Dataset and compare among 5 models.
The result is presented in the report.txt
file. The standard used to calculate precision score of each .abstr
document is the manually-picked keywords stored in the corresponding .uncontr
files.