Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete mini project 3 #3

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,21 @@
# TextMining

This is the base repo for the text mining and analysis project for Software Design at Olin College.
## Description
This project analyzes philosophical texts for linguistic similarity and visualizes their relationship spatially using Metric Multidimensional Scaling.
It also includes a Markov text synthesizer to generate a philosophical "maxim" across all included schools of thought.

## Getting Started

### Required Packages:
pip install nltk requests vaderSentiment
pip install matplotlib scikit-learn scip
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really minor, but these two lines came out as on the same line which made it initially difficult to understand that there were two commands. Make sure that your markdown styling is what you want it to be in the final product.


### Usage:
To run the text analysis, use:
python text_mining.py

### Existing Files:
philtexts3.pickle was generated using python pulltexts.py

## Links
[Project Reflection](Reflection.pdf)
Binary file added Reflection.pdf
Binary file not shown.
Binary file added TextCluster.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added philtexts3.pickle
Binary file not shown.
143 changes: 143 additions & 0 deletions pulltexts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
"""
File: text_similarity.py
Name: Ava Lakmazaheri
Date: 10/11/17
Desc: Load, pickle texts from Project Gutenberg
"""
import pickle
import numpy as np
import math
from sklearn.manifold import MDS
import matplotlib.pyplot as plt
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

all_names = ['tao', 'analects', 'plato', 'aristotle', 'machiavelli', 'spinoza',
'locke', 'hume', 'kant', 'marx', 'mill', 'cousin', 'nietzsche']

num = len(all_names)

all_texts = [' '] * num

def clean(text):
"""
Removes header and footer text from Gutenberg document
Input: string
Output: string
"""
startidx = text.find(" ***")
endidx = text.rfind("*** ")
return text[startidx:endidx]

def load_texts(filename):
"""
Loads in all books from a .pickle file and stores each as a string element in a list
Input: none (change to .pickle file name?)
Output: list of strings
"""
input_file = open(filename, 'rb')
reloaded_copy_of_texts = pickle.load(input_file)

for i in range(num-1):
all_texts[i] = clean(reloaded_copy_of_texts[i])

def histogram(text):
"""
Counts occurrences of each word in text
Input: string
Output: dict
"""
d = dict()

# break giant string of text into list of words
words = text.split();
for word in words:
d[word] = d.get(word, 0) + 1
return d

def all_unique_words(all_texts):
"""
Accounts for all unique words in all texts provided, to assist with similarity analysis
Input: list of strings
Output: list of strings
"""
allwords = []

for text in all_texts:
wordlist = text.split()
for word in wordlist:
if(word not in allwords):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No parentheses needed

allwords.append(word)
return allwords

def gen_vector(text, wordbank):
"""
Generate an n-dimensional vector for word count (where n is the total number of unique words)
Inputs: string, list of strings
Output: list of values (in this case, floats >= 0)
"""
v = []
h = histogram(text)

for word in wordbank:
v.append(h.get(word, 0))

return v

def comp_cos(vec1, vec2):
"""
Compute the cosine similarity between two vectors
Inputs: list of floats
Output: float
"""
dot_product = np.dot(vec1, vec2)
norm_1 = np.linalg.norm(vec1)
norm_2 = np.linalg.norm(vec2)
cos_val = dot_product / (norm_1 * norm_2)
if math.isnan(cos_val):
cos_val = 0
return cos_val

def similarity():
"""
Run linguistic similarity analysis on on philosophy texts. Print the raw
similarity comparisons and plot their relationships spatially.
"""
wordbank = all_unique_words(all_texts)

vecs = [[]] * num
for i in range(num-1):
vecs[i] = gen_vector(all_texts[i], wordbank)

sim = np.zeros((num, num))
for i in range(num-1):
for j in range(num-1):
sim[i][j] = comp_cos(vecs[i], vecs[j])
#print(sim[i][j])

dissimilarities = 1 - sim
coord = MDS(dissimilarity='precomputed').fit_transform(dissimilarities)

plt.scatter(coord[:,0], coord[:,1])

# Label the points
for i in range(coord.shape[0]):
plt.annotate(str(i), (coord[i,:]))

plt.show()

def sentiment(text):
"""
Run valence sentiment analysis on text
Input: string
Output: dict
"""
analyzer = SentimentIntensityAnalyzer()
f = analyzer.polarity_scores(text)
return f

if __name__ == "__main__":
load_texts('philtexts2.pickle')
similarity()
# for i in range(num-1):
# print(all_names[i])
# print(sentiment(all_texts[i]))
Loading