Skip to content

Latest commit

 

History

History
20 lines (12 loc) · 1.79 KB

README.md

File metadata and controls

20 lines (12 loc) · 1.79 KB

Key Phrase Extraction

Automatic keyword extraction is an important research topic. Keywords serve as a dense summary for a document, lead to improved information retrieval, or be the entrance to a document collection. The aim of this project is to find a small set of terms that describes a specific document, independently of the domain it belongs to.

Models

We examine five different models: n-grams (unigram, bigram, trigram), POSTagger-based-ngrams (terms matching any of a set of part-of-speech (POS) tag sequences), and RAKE (Rapid Automatic Keyword Extraction).

For the models using ngrams, we use three different features: term frequency, collection frequency and relative position of the first occurrence to find top keywords.

Installation & Usage

To run our model, download/clone this repo into your local device, compile & execute KeyPhraseExtraction.java in your IDE or through command-line tools (javac, java). Unzip the TRAINING.zip to get training data.

Files ending with .abstr are used to train our models, while .contr documents contains the controlled manually assigned keywords, separated with semicolon and .uncontr contains the uncontrolled manually assigned keywords, separated with semicolon.

For POSTagger model, please download the basic English Stanford Tagger and add .jar files to your project build path or class path.

Testing & Evaluation

To evaluate the performance, we compute the precision score of the top 5 keywords generated for each document in the Training Dataset and compare among 5 models.

The result is presented in the report.txt file. The standard used to calculate precision score of each .abstr document is the manually-picked keywords stored in the corresponding .uncontr files.