Update README.md

tulane-cmps6730 · May 2, 2024 · 9e88807 · 9e88807
1 parent 98967fb
commit 9e88807
Showing 1 changed file with 61 additions and 29 deletions.
diff --git a/README.md b/README.md
@@ -1,42 +1,74 @@
-# CMPS 6730 Sample Project
+# CMPS 6730 Philosophy Natural Language Processing Project
 
-This repository contains starter code for the final project in CMPS 4730/6730: Natural Language Processing at Tulane University.
+## Overview
 
-The code in this repository will be copied into your team's project repository at the start of class to provide a starting point for your project.
+This repository contains to Bobby Becker's project for CMPS 4730/6730 at Tulane University, which applies advanced NLP techniques to analyze and interpret major philosophical texts. The project utilizes Latent Dirichlet Allocation (LDA) and Word2Vec models to explore thematic connections within and between 43 works of philosophy, inlcuding philosophers such as Plato, Aristotle, Marx, Nietzsche, Kant, and others.
 
-You should edit this file to include a summary of the goals, methods, and conclusions of your project.
+## Project Artifacts
 
-The structure of the code supports the following:
+### 1. `final_project.ipynb`
 
-- A simple web UI using Flask to support a demo of the project
-- A command-line interface to support running different stages of the project's pipeline
-- The ability to easily reproduce your work on another machine by using virtualenv and providing access to external data sources.
+This Jupyter notebook showcases the data analysis of the project. Here's what it contains:
 
-### Using this repository
+#### Latent Dirichlet Allocation (LDA)
+- **Text Preparation**: Each of the 43 philosophical works is preprocessed to remove stopwords and other non-informative text elements. This clean text is then tokenized.
+- **LDA Processing**: The LDA model is applied to the tokenized text to extract key themes, each represented by four words. This thematic extraction helps in understanding the central topics discussed in each work.
 
-- At the start of the course, students will be divided into project teams. Each team will receive a copy of this starter code in a new repository. E.g.:
-https://github.com/tulane-cmps6730/project-alpha
-- Each team member will then clone their team repository to their personal computer to work on their project. E.g.: `git clone https://github.com/tulane-cmps6730/project-alpha`
-- See [GettingStarted.md](GettingStarted.md) for instructions on using the starter code.
+#### Word2Vec
+- **Vector Training**: Post-LDA, a Word2Vec model is trained on the corpus to generate word vectors for the identified thematic words.
+- **Vector Averaging**: For each text, the vectors of its four theme words are averaged to create a single vector that represents the overall thematic essence of the text.
 
+#### Principal Component Analysis (PCA)
+- **Dimensionality Reduction**: The high-dimensional vectors are reduced to 2D and 3D using PCA, which creates a visual representation of the vector similarities between the philosophical works. Philosphical works are shown compared to each other, compared to the word vectors representing each philosopher, and to word vectors representing the philosphical themes.
 
-### Contents
+### 2. `flask.py` and `plato_matcher_online.py`
 
-- [docs](docs): template to create slides for project presentations
-- [nlp](nlp): Python project code
-- [notebooks](notebooks): Jupyter notebooks for project development and experimentation
-- [report](report): LaTeX report
-- [tests](tests): unit tests for project code
+The web component uses Flask for an interactive interface that allows the user to make queries into Plato's texts. This can be viewed as a potential application of the research portion of this project.
 
-### Background Resources
+#### Workflow
+1. **Text Segmentation**: First, works of Plato are loaded in, tokenized, & segmented into 500 parts.
+2. **Latent Dirichlet Allocation (LDA)** Then, using LDA, we analyze each portion of the text and generate 6 words to represent that passage.
+3. **Vector Representation**: We then load in a Word2Vec model, trained on all 43 philosophical works used in the Jupyter notebook. The 6 words generated by each passage are averaged together to create a unique vector to represent each passage. 
+4. **User Interaction**: Users submit text through the web interface, which goes through the same process as the passages of Plato: 6 words are generated by LDA to represent the user input, and those 6 words are vectorized and averaged to create a vector representing the user's input.
+5. **Similarity Calculation**: We then calculate cosine similarity between the user's vector and each passage's vector to find the best match.
+6. **Text Refinement and Citation**: Once the most relevant passage is identified, a GPT-3.5 model is used to identify and rewrite the most important portion of the passage and provide a citation to the user.
+
+## Getting Started
+
+### Prerequisites
+
+- Python 3.8+
+- Flask
+- Gensim
+- NLTK
+- sklearn
+- OpenAI API key
+
+### Installation
+
+bash:
+git clone https://github.com/yourusername/cmps-6730-nlp-project.git
+cd cmps-6730-nlp-project
+pip install -r requirements.txt
+
+Also, write in your OpenAI key in the 'plato_online_matcher.py' file.
+
+### Running the Application
+python flask.py
+Navigate to http://127.0.0.1:5000/ in your web browser
+
+### Example Usage:
+Question:
+<img width="1249" alt="Question_Friendship" src="https://github.com/tulane-cmps6730/project-philosophy/assets/86581611/dda477e9-7bb4-40fe-9d4d-743ca5f0b75e">
+
+Answer:
+<img width="1216" alt="Answer_Friendship" src="https://github.com/tulane-cmps6730/project-philosophy/assets/86581611/5f0af556-7a8d-4e20-826d-3045484efec4">
+
+
+Question:
+<img width="1233" alt="Question_Politics" src="https://github.com/tulane-cmps6730/project-philosophy/assets/86581611/535ed02b-1305-4307-97f6-f54b00254f66">
+
+Answer:
+<img width="1216" alt="Answer_Politics" src="https://github.com/tulane-cmps6730/project-philosophy/assets/86581611/f15d67f8-622f-411f-914b-d872e4686203">
 
-The following will give you some technical background on the technologies used here:
 
-1. Refresh your Python by completing this online tutorial: <https://www.learnpython.org/> (3 hours)
-2. Create a GitHub account at <https://github.com/>
-3. Setup git by following <https://help.github.com/en/articles/set-up-git> (30 minutes)
-4. Learn git by completing the [Introduction to GitHub](https://lab.github.com/githubtraining/introduction-to-github) tutorial, reading the [git handbook](https://guides.github.com/introduction/git-handbook/), then completing the [Managing merge conflicts](https://lab.github.com/githubtraining/managing-merge-conflicts) tutorial (1 hour).
-5. Install the Python data science stack from <https://www.anaconda.com/distribution/> . **We will use Python 3** (30 minutes)
-6. Complete the scikit-learn tutorial from <https://www.datacamp.com/community/tutorials/machine-learning-python> (2 hours)
-7. Understand how python packages work by going through the [Python Packaging User Guide](https://packaging.python.org/tutorials/) (you can skip the "Creating Documentation" section). (1 hour)
-8. Complete Part 1 of the [Flask tutorial](https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-i-hello-world), which is the library we will use for making a web demo for your project.