This is the README file for Project 2 of Dr. Michael Bloodgood's Natural Language Processing (CSC 427-01) class at The College of New Jersey, completed by Robert Helck, Michael Giordano, Geethika Manojkumar, and Poean Lu. This project asked our team to develop bigram and unigram language models implementing the Maximum Likelihood Estimation (MLE) method, as well as the Add-1 smoothing technique, as had been discussed in our class. Additionally, as part of this project, our team was asked to perform a variety of experiments using our models and provide a written analysis of the results thereof. Additional requirements for the project included (but were not limited to) analyzing the perplexity scores of our models, and finding and normalizing appropriate training and test data.
This README file will provide both an overview of the contents of this package, as well as detailed command-line instructions on the proper installation and use of the program.
Our package includes the source code of the program (see main.py), as well as training and text corpora. Corpora came from:
- http://www.nltk.org/nltk_data/ - Australian Broadcasting Commission 2006 File taken from: \abc\rural.txt
- http://www.nltk.org/nltk_data/ - Project Gutenberg Selections File taken from: \gutenberg\austen-sense.txt
Source files in folder:
- austin.txt - text file of 'Sense and Sensibility' by Jane Austin
- australia_rural.txt - text file containing news reports from the Australian Broadcast Company
- D2.txt - write up of statistics required for D2 deliverable
- D3.txt - write up of feedback required for D3 deliverable
- main.py - main python code that can be executed by the user to run the program
- README.md - this file that explains the contents to the user
Operating System: Ubuntu Language: Python 3.7.5
In the ELSA command line run: $ module add python/3.7.5
This program is designed to be used from the terminal. Once the user has entered the directory where they have unzipped the tar.gz file, the user will enter "python3 main.py file.txt", where file.txt is the corpus that the user would like to use. For convenience, we have included two corpora in our package, however, the user is free to enter the full path to another .txt file that they would like to use instead. If the program cannot locate the file, the user will be prompted for a new file until a usable file path is entered. After this, the user is prompted to enter either "MLE" or "Add-1". This allows the user to choose whether to use Maximum Likelihood or, Add-1 smoothing (NOTE: only Add-1 is available, there is no support for Add-k smoothing where k!=1). If the user does not enter "MLE" or "Add-1", the user will continue to be prompted to do so.
This program will report the perplexity scores for the unigram and bigram models (measured on the held-out data from the corpus specified from the user) on the command line, as well as the 5 most common sentences generated from the bigram and unigram models, and the 10 most common unigram and bigrams for the respective models.
Normalization:
- All punctuation is removed and replaced with spaces.
- All text is changed to lower case
- Each line of the raw .txt file starts with "
" and ends with "". - We split the given corpus such that 75% of our data was used for training and the remaining 25% was held out for testing
Outputs:
- Sentences stop when the "" or 20 tokens are in the sentence
- For the chosen task, output includes:
- Unigram Perplexity
- Bigram Perplexity
- Unigram MLE sentences
- Bigram MLE sentences
- Top 10 MLE Unigrams
- Top 10 MLE Bigrams
-For sentence output, 5 sentences are given according to pdf instructions
-As the purpose of this assignment is a learning experience, "" was left in the generated sentence. In an actual appliation of this software, "" would potentiall want to be removed
-Addiionally "
" and "" were used in the sentences. In an actual implementation, these may be removed. -For Top 10 outputs, format is "[uni/bi]gram : # of occurrences of [uni/bi]gram : [MLE/Add-1] probability"