Sentiment Analysis in Stock Market using Twitter and Stocktwits Data with CNN, LSTM, MLP, NLP and Stacking Ensemble
This repo is part of code implementation of the accepted paper on The International Conference on Data Science and Its Applications (ICoDSA) 2020, entitled "SENN: Stock Ensemble-based Neural Network for Stock Market Prediction using Historical Stock Data and Sentiment Analysis", written by Louis Owen and Finny Oktariani.
tensorflow, standfordnlp, bs4, contractions, inflect, nltk, textblob, string, pandas, numpy, gensim, sklearn, matplotlib, seaborn
Behavior of stock market prices not only influenced by the historical prices but also influenced by the news or opinions from the market it self.
Given the news or opinions data which annotated with sentiment score from financial experts (scaled from -1 to 1), one should predict the sentiment score of the newly unseen data.
Using Neural Network and Natural Language Processing (NLP) to preprocess and train the given data in order to predict the newly unseen data
Consists of microblog messages focusing on stock market events and assessments from investors and traders, exchanged via the Stock-Twits microblogging platform
Some stock market discussion also takes place on the Twitter platform. In order to extend and diversify our data sources, we extract Twitter posts containing company stock symbols (cashtags)
Data is collected via scraping through official website API. The syntax of scraping is provided by SemEval
You can see the .json scraped data for Stocktwits and for Twitter
The data provided from the official website are not clean yet and we have to do some feature engineering and feauture selection before jump into the modeling process.
The Stocktwits and Twitter data are combined into one dataframe before doing data cleaning process.
The tweets cleaning procedure including:
- Remove HTML encoding: '&', '$quot', etc
- Remove mention tag: '@userid1'
- Remove Retweet symbol: 'RT'
- Convert all URL into '_url'
- Convert abbreviation and slang words into its basic form using dictionary
- Convert elongated word into its basic form: 'Winnn'
- Convert ordinal words into ordinal number: 'first' is converted to '1st'
After done with the tweets cleaning, there are in total 61 variables created from the cleaned tweets text and from tweets metadata.
You can find the python script here
For Windows User: run python script in anaconda prompt
After doing feature engineering, we have to do some analysis to select which features will be used in the modeling process. The analysis are as follows:
- Missing Value Analysis
- Constant Variables Analysis
- Duplicated Variables Analysis
- Correlated Variables Analysis
So, from 61 variables created in the feature engineering process we have to remove 6 variables.
You can find the Notebook here
Now, we have 55 variables extracted from texts which will be used in the modeling process.
The full data is separated into three parts, namely train, validation, and test data. Each of the MLP Feature Driven Model, MLP Simple Word Embedding, CNN, and LSTM is trained on the train data using 10-folds cross-validation. Those four trained model will be used to predict both validation and test data. Thus, there are 4 prediction results for each validation and test data. Then two new tables with five columns are created, each for validation and test data. Those columns consist of four predicted results from the four trained models along with the actual sentiment score. The newly created table from validation data will be exploited as the train data for the Stacked Ensemble model and the newly created table from the test data will be exploited as the validation data.
The metrics used to assess the model's performance is Cosine Similarity
There are 4 models created and 1 ensemble model to combine those 4 models.
- Multi Layer Perceptron Feature Driven
- Multi Layer Perceptron Simple Word Embedding
- Convolutional Neural Network
- Long Short-Term Memory Neural Network
- Multi Layer Perceptron Stacking Ensemble
This model aims to get the information from the 55 manually curated variables. The MLP has 3 hidden layers with tanh activation function both in the hidden layer and output layer. Number of nodes in the first, second, and third layer is 50, 30, 15. Dropout is also used in this model.
You can find the Notebook here
This model aims to get the information straights from the vector representation of texts. The MLP has 3 hidden layers with tanh activation function in the output layer and ReLu in the hidden layer. Number of nodes in the first, second, and third layer is 30. Dropout is also used in this model.
You can find the Notebook here
CNN aims to get the local behaviour within the texts by sliding over 1,2,3,4 words at each time. Gaussian Noise with 0.01 variance is used to give noise into the input data. There are 25 filters for each window slide, so in total there are 100 filters with 1 dimension vector output for each filter. Max-pooling is used in each the output of each filter, resulting 100 scalar in total. Then all of the scalar are concatenated into one 100-dimensional vector.
The output of this CNN is then integrated with MLP with 2 hidden layers with tanh activation function both in the hidden layer and output layer. Number of nodes in the first and second layer is 15. Dropout is also used in this MLP.
You can find the Notebook here
LSTM aims to get the global behaviour within the texts. This model used 2 layers of LSTM which integrated with MLP with 2 hidden layers with tanh activation function in the output layer and ReLu in the hidden layer. Number of nodes in the first and second layer is 50 and 10. Dropout is also used in this model.
You can find the Notebook here
Classical ensemble method is done by averaging the output of each model created. However, thish approach will give the same weights for each model created.
MLP Stacking Ensemble can find the optimum weights for each model created. The MLP has 1 hidden layer with 4 nodes. Activation function in the hidden layer is ReLu, while in the output layer is tanh.
You can find the Notebook here
The cosine similarity of the final model is 0.662.
You can download the trained model in keras format .h5:
Here are the link to download supporting files used in this project which are not uploaded into the repository due to the large size
If you find the paper and the code helpful, please cite us.
@INPROCEEDINGS{Owen2008:SENN,
AUTHOR="Louis Owen and Finny Oktariani",
TITLE="{SENN:} Stock Ensemble-based Neural Network for Stock Market Prediction
using Historical Stock Data and Sentiment Analysis",
BOOKTITLE="2020 International Conference on Data Science and Its Applications (ICoDSA)
(ICoDSA 2020)",
DAYS=4,
MONTH=aug,
YEAR=2020
}
The underlying code of this project is licensed under the MIT license.