Northeastern University, Spring 2023: DS 5020

Simple Language Model using N-grams

Natural Language Processing (NLP) is a field of data science/AI that involves analyzing text and language data. This assignment aims to create a simple language model and gain insights into the application of probability to text data. The main objective is to load and tokenize the Berkeley Restaurant dataset, introduced in the “Speech and Language Processing'' book by Jurafsky. The following are the main steps involved in the assignment:

Steps

Load and preprocess the dataset provided. Tokenize the text and keep only actual words while removing disfluencies such as “uh” and “uhm”. Add special tokens to indicate the beginning of each sentence (e.g., </s>).
Count the words and report the size of the vocabulary. Also, report the number of sentences in the dataset.
Read the chapter on N-grams and generate figures 4.1 and 4.2 for bigram counts. The figures do not have to be exact.
Calculate the joint probability for at least five sentences (with vocabulary in the dataset) using bigrams.
Repeat step 2 using trigrams. Observe if the estimates have changed.
Submit the code or pdf of the program and output. The program should be able to run.

Tools

Python packages such as nltk will be used to accomplish this task. The N-gram model will be used to build the language model, and the joint probability will be calculated using the counts.

Submission

The final submission will include the code or pdf of the program and output. The program should be able to run without any errors. Failing to submit a program that runs will result in no credit being assigned.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
extras		extras
src		src
.gitignore		.gitignore
LICENSE		LICENSE
NU-DS5020-NLP-Code.pdf		NU-DS5020-NLP-Code.pdf
NU-DS5020-NLP.pdf		NU-DS5020-NLP.pdf
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Northeastern University, Spring 2023: DS 5020

Simple Language Model using N-grams

Steps

Tools

Submission

About

Releases

Packages

Languages

License

kierblk/NU-DS5020-NLP

Folders and files

Latest commit

History

Repository files navigation

Northeastern University, Spring 2023: DS 5020

Simple Language Model using N-grams

Steps

Tools

Submission

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages