Sentiment Analysis in Stock Market using Twitter and Stocktwits Data with CNN, LSTM, MLP, NLP and Stacking Ensemble

This repo is part of code implementation of the accepted paper on The International Conference on Data Science and Its Applications (ICoDSA) 2020, entitled "SENN: Stock Ensemble-based Neural Network for Stock Market Prediction using Historical Stock Data and Sentiment Analysis", written by Louis Owen and Finny Oktariani.

Requirements

tensorflow, standfordnlp, bs4, contractions, inflect, nltk, textblob, string, pandas, numpy, gensim, sklearn, matplotlib, seaborn

Background

Why?

Behavior of stock market prices not only influenced by the historical prices but also influenced by the news or opinions from the market it self.

How?

Given the news or opinions data which annotated with sentiment score from financial experts (scaled from -1 to 1), one should predict the sentiment score of the newly unseen data.

What?

Using Neural Network and Natural Language Processing (NLP) to preprocess and train the given data in order to predict the newly unseen data

Dataset

Stocktwits Messages

Consists of microblog messages focusing on stock market events and assessments from investors and traders, exchanged via the Stock-Twits microblogging platform

Twitter Messages

Some stock market discussion also takes place on the Twitter platform. In order to extend and diversify our data sources, we extract Twitter posts containing company stock symbols (cashtags)

Data Scraping

Data is collected via scraping through official website API. The syntax of scraping is provided by SemEval

You can see the .json scraped data for Stocktwits and for Twitter

Data Preparation

The data provided from the official website are not clean yet and we have to do some feature engineering and feauture selection before jump into the modeling process.

Data Cleaning & Feature Engineering

The Stocktwits and Twitter data are combined into one dataframe before doing data cleaning process.

The tweets cleaning procedure including:

Remove HTML encoding: '&amp', '$quot', etc
Remove mention tag: '@userid1'
Remove Retweet symbol: 'RT'
Convert all URL into '_url'
Convert abbreviation and slang words into its basic form using dictionary
Convert elongated word into its basic form: 'Winnn'
Convert ordinal words into ordinal number: 'first' is converted to '1st'

After done with the tweets cleaning, there are in total 61 variables created from the cleaned tweets text and from tweets metadata.

You can find the python script here

For Windows User: run python script in anaconda prompt

Feature Selection

After doing feature engineering, we have to do some analysis to select which features will be used in the modeling process. The analysis are as follows:

Missing Value Analysis
Constant Variables Analysis
Duplicated Variables Analysis
Correlated Variables Analysis

So, from 61 variables created in the feature engineering process we have to remove 6 variables.

You can find the Notebook here

Now, we have 55 variables extracted from texts which will be used in the modeling process.

Data Split

The full data is separated into three parts, namely train, validation, and test data. Each of the MLP Feature Driven Model, MLP Simple Word Embedding, CNN, and LSTM is trained on the train data using 10-folds cross-validation. Those four trained model will be used to predict both validation and test data. Thus, there are 4 prediction results for each validation and test data. Then two new tables with five columns are created, each for validation and test data. Those columns consist of four predicted results from the four trained models along with the actual sentiment score. The newly created table from validation data will be exploited as the train data for the Stacked Ensemble model and the newly created table from the test data will be exploited as the validation data.

Model Performance Metrics

The metrics used to assess the model's performance is Cosine Similarity

Model

There are 4 models created and 1 ensemble model to combine those 4 models.

Multi Layer Perceptron Feature Driven
Multi Layer Perceptron Simple Word Embedding
Convolutional Neural Network
Long Short-Term Memory Neural Network
Multi Layer Perceptron Stacking Ensemble

MLP Feature Driven

This model aims to get the information from the 55 manually curated variables. The MLP has 3 hidden layers with tanh activation function both in the hidden layer and output layer. Number of nodes in the first, second, and third layer is 50, 30, 15. Dropout is also used in this model.

You can find the Notebook here

MLP Simple Word Embedding

This model aims to get the information straights from the vector representation of texts. The MLP has 3 hidden layers with tanh activation function in the output layer and ReLu in the hidden layer. Number of nodes in the first, second, and third layer is 30. Dropout is also used in this model.

You can find the Notebook here

Convolutional Neural Network

CNN aims to get the local behaviour within the texts by sliding over 1,2,3,4 words at each time. Gaussian Noise with 0.01 variance is used to give noise into the input data. There are 25 filters for each window slide, so in total there are 100 filters with 1 dimension vector output for each filter. Max-pooling is used in each the output of each filter, resulting 100 scalar in total. Then all of the scalar are concatenated into one 100-dimensional vector.

The output of this CNN is then integrated with MLP with 2 hidden layers with tanh activation function both in the hidden layer and output layer. Number of nodes in the first and second layer is 15. Dropout is also used in this MLP.

You can find the Notebook here

Long Short-Term Memory

LSTM aims to get the global behaviour within the texts. This model used 2 layers of LSTM which integrated with MLP with 2 hidden layers with tanh activation function in the output layer and ReLu in the hidden layer. Number of nodes in the first and second layer is 50 and 10. Dropout is also used in this model.

You can find the Notebook here

MLP Stacking Ensemble

Classical ensemble method is done by averaging the output of each model created. However, thish approach will give the same weights for each model created.

MLP Stacking Ensemble can find the optimum weights for each model created. The MLP has 1 hidden layer with 4 nodes. Activation function in the hidden layer is ReLu, while in the output layer is tanh.

You can find the Notebook here

Results

The cosine similarity of the final model is 0.662.

You can download the trained model in keras format .h5:

Supporting Files

Here are the link to download supporting files used in this project which are not uploaded into the repository due to the large size

Google Word2Vec Pre-Trained Model

Citation

If you find the paper and the code helpful, please cite us.

@INPROCEEDINGS{Owen2008:SENN,
AUTHOR="Louis Owen and Finny Oktariani",
TITLE="{SENN:} Stock Ensemble-based Neural Network for Stock Market Prediction
using Historical Stock Data and Sentiment Analysis",
BOOKTITLE="2020 International Conference on Data Science and Its Applications (ICoDSA)
(ICoDSA 2020)",
DAYS=4,
MONTH=aug,
YEAR=2020
}

License

The underlying code of this project is licensed under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
Data Preparation		Data Preparation
Model		Model
Raw Data		Raw Data
Supporting_Files		Supporting_Files
LICENSE		LICENSE
PMI_df.csv		PMI_df.csv
README.md		README.md
df_prepared.csv		df_prepared.csv
df_test_final.csv		df_test_final.csv
df_train_final.csv		df_train_final.csv
rf_1_gram_df.csv		rf_1_gram_df.csv
rf_2_gram_df.csv		rf_2_gram_df.csv
rf_3_gram_df.csv		rf_3_gram_df.csv
rf_4_gram_df.csv		rf_4_gram_df.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Analysis in Stock Market using Twitter and Stocktwits Data with CNN, LSTM, MLP, NLP and Stacking Ensemble

Requirements

Background

Why?

How?

What?

Dataset

Stocktwits Messages

Twitter Messages

Data Scraping

Data Preparation

Data Cleaning & Feature Engineering

Feature Selection

Data Split

Model Performance Metrics

Model

MLP Feature Driven

MLP Simple Word Embedding

Convolutional Neural Network

Long Short-Term Memory

MLP Stacking Ensemble

Results

Supporting Files

Citation

License

About

Releases

Packages

Languages

License

louisowen6/NLP_Stacking_Ensemble

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis in Stock Market using Twitter and Stocktwits Data with CNN, LSTM, MLP, NLP and Stacking Ensemble

Requirements

Background

Why?

How?

What?

Dataset

Stocktwits Messages

Twitter Messages

Data Scraping

Data Preparation

Data Cleaning & Feature Engineering

Feature Selection

Data Split

Model Performance Metrics

Model

MLP Feature Driven

MLP Simple Word Embedding

Convolutional Neural Network

Long Short-Term Memory

MLP Stacking Ensemble

Results

Supporting Files

Citation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages