This repository contains to Bobby Becker's project for CMPS 4730/6730 at Tulane University, which applies advanced NLP techniques to analyze and interpret major philosophical texts. The project utilizes Latent Dirichlet Allocation (LDA) and Word2Vec models to explore thematic connections within and between 43 works of philosophy, inlcuding philosophers such as Plato, Aristotle, Marx, Nietzsche, Kant, and others.
This Jupyter notebook showcases the data analysis of the project. Here's what it contains:
- Text Preparation: Each of the 43 philosophical works is preprocessed to remove stopwords and other non-informative text elements. This clean text is then tokenized.
- LDA Processing: The LDA model is applied to the tokenized text to extract key themes, each represented by four words. This thematic extraction helps in understanding the central topics discussed in each work.
- Vector Training: Post-LDA, a Word2Vec model is trained on the corpus to generate word vectors for the identified thematic words.
- Vector Averaging: For each text, the vectors of its four theme words are averaged to create a single vector that represents the overall thematic essence of the text.
- Dimensionality Reduction: The high-dimensional vectors are reduced to 2D and 3D using PCA, which creates a visual representation of the vector similarities between the philosophical works. Philosphical works are shown compared to each other, compared to the word vectors representing each philosopher, and to word vectors representing the philosphical themes.
The web component uses Flask for an interactive interface that allows the user to make queries into Plato's texts. This can be viewed as a potential application of the research portion of this project.
- Text Segmentation: First, works of Plato are loaded in, tokenized, & segmented into 500 parts.
- Latent Dirichlet Allocation (LDA) Then, using LDA, we analyze each portion of the text and generate 6 words to represent that passage.
- Vector Representation: We then load in a Word2Vec model, trained on all 43 philosophical works used in the Jupyter notebook. The 6 words generated by each passage are averaged together to create a unique vector to represent each passage.
- User Interaction: Users submit text through the web interface, which goes through the same process as the passages of Plato: 6 words are generated by LDA to represent the user input, and those 6 words are vectorized and averaged to create a vector representing the user's input.
- Similarity Calculation: We then calculate cosine similarity between the user's vector and each passage's vector to find the best match.
- Text Refinement and Citation: Once the most relevant passage is identified, a GPT-3.5 model is used to identify and rewrite the most important portion of the passage and provide a citation to the user.
- Python 3.8+
- Flask
- Gensim
- NLTK
- sklearn
- OpenAI API key
bash: git clone https://github.com/yourusername/cmps-6730-nlp-project.git cd cmps-6730-nlp-project pip install -r requirements.txt
Put in your OpenAI key at the top of the 'plato_online_matcher.py' file.
python flask.py and navigate to http://127.0.0.1:5000/ in your web browser