COMM 493 Sentiment Analysis App

Utilizing customer review data, constructed an MVP fullstack dashboard for Urban Closet, a fictitious fast-fashion retailer.

Objectives

Completed Objectives

Developed a Sentiment Analysis model with Python.
- Read below how we trained our model.
Constructed a simple Flask server.
- GET "/" for Vue.js app
- GET "/api/${product}" for JSON formatted review data.
- POST {"text": YOUR_STRING} to "/api/classify" for a sentiment result.
Implemented a Vue.js frontend application.

Technology Stack

Server: Flask

Client: Vue.js

AI Model: SKLearn & NLTK

Model Training

Step 1: Read in the reviews from csv file.

dir_path = os.path.dirname(os.path.realpath(__file__))
df_kaggle = pd.read_csv(dir_path + '/comments-kaggle.csv')
df_case = pd.read_csv(dir_path + '/comments.csv')

Step 2: Cleaning up the text.

Make text all lowercase and remove any line breaks
Tokenizing all non stopwords (stopwords: the, a, an, etc...).
Stem the word (reducing it to its root word).

tokenizer=RegexpTokenizer(r'\w+')
en_stopwords=set(stopwords.words('english'))
ps=PorterStemmer()

def getStemmedReview(review):
    review=review.lower()
    review=review.replace("<br /><br />"," ")
    #Tokenize
    tokens=tokenizer.tokenize(review)
    new_tokens=[token for token in tokens if token not in  en_stopwords]
    stemmed_tokens=[ps.stem(token) for token in new_tokens]
    clean_review=' '.join(stemmed_tokens)
    return clean_review

df_kaggle['Comment'].apply(getStemmedReview)
df_case['Comment'].apply(getStemmedReview)

Step 3: Split into test and training sets

df = pd.concat([df_kaggle, df_case])
split = len(df)*7//10

x_train = df.loc[:split, 'Comment'].values
y_train = df.loc[:split, 'Sentiment'].values
x_test = df.loc[split:, 'Comment'].values
y_test = df.loc[split:, 'Sentiment'].values

Step 4: Vectorize text

To feed the data to the Machine Learning model, we must convert categorical data, such as text or words, into a numerical form.
Note that we only perform fit operation on the training set, once the vectorizer learns from the training data, that same learning can be used on the test data.
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, encoding='utf-8', decode_error='ignore')
vectorizer.fit(x_train)
x_train=vectorizer.transform(x_train)
x_test=vectorizer.transform(x_test)

step 5: Create a model

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

from sklearn.linear_model import LogisticRegression
model=LogisticRegression(solver='liblinear')
model.fit(x_train,y_train)
print('Score on training data is: '+str(model.score(x_train,y_train)))
print('Score on testing data is: '+str(model.score(x_test,y_test)))

step 6: Model persistence

Save model as pickle files for use from our web applicaiton
https://docs.python.org/3.7/library/pickle.html

from sklearn.externals import joblib
joblib.dump(en_stopwords, dir_path + '/pkl_objects/stopwords.pkl') 
joblib.dump(model, dir_path + '/pkl_objects/model.pkl')
joblib.dump(vectorizer, dir_path + '/pkl_objects/vectorizer.pkl')

Model 300:

Method for labeling:

We labeled the 300 case provided reviews as either positive or negative. When reviews had both sentiments, we labeled it to whichever side it leaned more heavily towards. For future consideration is creating a model on sentiment rank (example from -1 to 1, where 0 is neutral).

Splitting data:

We trained the model on a 70/30 random split.

Model 8202+300:

Sourcing Data:

The more data to learn from, the more accurate a model can be. We sourced ecommerce review data.

Source: https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-review

Method for labeling:

Each review included a recommend Boolean. Those who recommended had a higher rating and mor positive comment. We used this Boolean to categorize as positive or negative

Selecting Data:

First removed all blank reviews. Then separated into positive and negative. Included 4,101 negative reviews and over 18,000 positive reviews. So selected all 4,101 negative reviews and a random 4,101 positive reviews.

Splitting data:

We trained the model on a 70/30 random split with all training data coming from the Kaggle data and the test data consisting of the Kaggle data and 300 case data.

Author(s)

Andrew Greenan - GitHub - LinkedIn

Group Member(s)

Alex Lorant
Bridget Mulligan
Luke Nailor
Robert Cadman

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
client		client
server		server
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COMM 493 Sentiment Analysis App

Objectives

Completed Objectives

Technology Stack

Server: Flask

Client: Vue.js

AI Model: SKLearn & NLTK

Model Training

Model 300:

Method for labeling:

Splitting data:

Model 8202+300:

Sourcing Data:

Method for labeling:

Selecting Data:

Splitting data:

Author(s)

Group Member(s)

About

Releases

Packages

Languages

greenan8/sentiment-analysis-app

Folders and files

Latest commit

History

Repository files navigation

COMM 493 Sentiment Analysis App

Objectives

Completed Objectives

Technology Stack

Server: Flask

Client: Vue.js

AI Model: SKLearn & NLTK

Model Training

Model 300:

Method for labeling:

Splitting data:

Model 8202+300:

Sourcing Data:

Method for labeling:

Selecting Data:

Splitting data:

Author(s)

Group Member(s)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages