Key Phrase Extraction

Automatic keyword extraction is an important research topic. Keywords serve as a dense summary for a document, lead to improved information retrieval, or be the entrance to a document collection. The aim of this project is to find a small set of terms that describes a specific document, independently of the domain it belongs to.

Models

We examine five different models: n-grams (unigram, bigram, trigram), POSTagger-based-ngrams (terms matching any of a set of part-of-speech (POS) tag sequences), and RAKE (Rapid Automatic Keyword Extraction).

For the models using ngrams, we use three different features: term frequency, collection frequency and relative position of the first occurrence to find top keywords.

Installation & Usage

To run our model, download/clone this repo into your local device, compile & execute KeyPhraseExtraction.java in your IDE or through command-line tools (javac, java). Unzip the TRAINING.zip to get training data.

Files ending with .abstr are used to train our models, while .contr documents contains the controlled manually assigned keywords, separated with semicolon and .uncontr contains the uncontrolled manually assigned keywords, separated with semicolon.

For POSTagger model, please download the basic English Stanford Tagger and add .jar files to your project build path or class path.

Testing & Evaluation

To evaluate the performance, we compute the precision score of the top 5 keywords generated for each document in the Training Dataset and compare among 5 models.

The result is presented in the report.txt file. The standard used to calculate precision score of each .abstr document is the manually-picked keywords stored in the corresponding .uncontr files.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
KeyPhraseExtraction.java		KeyPhraseExtraction.java
README-DATASET.txt		README-DATASET.txt
README.md		README.md
Training.zip		Training.zip
_config.yml		_config.yml
stopwords.txt		stopwords.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Key Phrase Extraction

Models

Installation & Usage

Testing & Evaluation

About

Releases

Packages

Languages

LinhTangTD/KeyPhraseExtraction

Folders and files

Latest commit

History

Repository files navigation

Key Phrase Extraction

Models

Installation & Usage

Testing & Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages