Shiyu Qiu @92612ShiyuQiu
Kanglan Tang @kanglant
Shiqi Wu @sophiewu6
Our project aims to implement sentiment analysis by analyzing Twitter data and classifying a given tweet as a positive or negative sentiment. It uses various machine learning techniques including Naive Bayes, Gradient Boosting, SVM, Logistic Regression, Neural Network, and RNN. After training the models and combining some selected models into a hybridized model, we applied it to analyze posts under popular hashtags on Twitter to help investigate public attitudes to controversial political, commercial, and social events related to the hashtags in a fast, convenient, and accurate way. For example, during the 2020 presidential election, the model can be applied to evaluating Twitter users’ preference for presidential candidates.
application:
Tweets_Application.ipynb -> applying combined model to three topics, listing example tweets and showing percentage of positive, negative and neutral attitudes.
tweets.ipynb -> collecting tweets of different topics from Twitter's API
tweets-full-texts.ipynb -> collecting tweets with full text from Twitter's API using tweepy
coronavirus_full_text_preprocess.ipynb -> preprocess all coronavirus tweets into word2vec format
full_text_tweets_preprocess.ipynb -> preprocess 2 other topics' tweets into word2vec format
application_dataset:
Metoomovement.csv -> tweets related to #MeToo
Trump.csv -> tweets related to #Trump
coronavirus.csv -> tweets related to #coronavirus
Metoomovement1.csv -> full text tweets related to #MeToo
trump1.csv -> full text tweets related to #Trump
coronavirus1.csv -> full text tweets related to #coronavirus
evaluation:
Classification Reports of Classifiers.ipynb -> understanding different models' performance from various perspectives
Dataset_Size_Threshold.ipynb -> tuning dataset size for running gridsearch on models for tuning parameters
eli5_eval.ipynb -> use ELI5 library to show feature importances and explain predictions
manual_test.ipynb -> extract top positive and top negative tweets to examine prediction accuracy
models:
Combining Classifiers.ipynb -> combining selected high performance models
Gradient_Boost.ipynb -> training, optimizing and showing performance of gradient boost classifier
Naive Bayes Classifier.ipynb -> training, optimizing and showing performance of naive bayes classifier
Neural-Network.ipynb -> training, optimizing and showing performance of neural network classifier
RNN.ipynb -> training, optimizing and showing performance of recurrent neural network classifier
SVM.ipynb -> training, optimizing and showing performance of SVM classifier
logistic_bayesian.ipynb -> training, optimizing and showing performance of logistic bayesian classifier
new_data:
X_coronavirus.csv -> input matrix of 1000 tweets of #coronavirus
X_metoo.csv -> input matrix of 1000 tweets of #MeToo
X_sparse.csv -> input matrix of training tweets in sparse format
X_trump.csv -> input matrix of 1000 tweets of #Trump
Y.csv -> input labels of training tweets
combinedModel.sav -> stored trained combined model
features.csv -> vocabulary features
coronavirus_word2vec.csv -> preprocessed coronavirus tweets in word2vec format
trump_word2vec.csv -> preprocessed trump tweets in word2vec format
Metoomovement_word2vec.csv -> preprocessed metoomovement tweets in word2vec format
preprocess:
BagOfWords_sparse.ipynb -> preprocessing training dataset to transfer them into matrix and store the larger matrix in sparse format
Word2Vector.ipynb -> using Word2Vec library to represent words as a dictionry mapping words and their vectors
preprocess_application_data.ipynb -> transfering tweets of three topics to matrix of 0 and 1 representing appearance of every feature vocabulary
word2vec.model -> stored word2vec model