Use Twitter to make stock predictions.
Numerous companies are interested in using microblogging as a way to predict in real time how stock prices are going to move. At no other time in human history did investors have access to the real time thoughts and voices of the masses. The theory is if investors had a good sense of how customers felt about a particular company at any given time would give insight to how the “market” values the company and thus the price and volume of the stocks.
Even beyond the idea of analyzing tweets in real time, the idea of accessing real time information and using that to make decisions is the future. There is a diminishing value of data as time progresses. Insights are indeed perishable. But if you have the ability to combine old and recent data, that's even more valuable.
The goal of this project is twofold:
- Set Up a production ready analytics system using AWS.
- See if there is a relationship between Stock sentiment and movement in the price of the stock. This is the first part of the problem. Once this relationship is established, then the next step is to figure out how to make money.
- Data Sources turn on and off with the start and end of the market day.
- Twitter Data Source filters on Ticker Symbols.
- Feature Flags implemented to easily turn on/off writing to Kinesis streams and writing to Mongo DB.
- Auto MongoDB backups.
- Event and Error Logging.
- Tweet and Stock Price count Logging.
- Dedicated container for Data Analysis.
The overall architecture is shown in this image:
EC2 is hosting the dockerized production system that pushes data to Kinesis via the Boto3 library and a Mongo database.
The EC2 services is shown in this image:
There are several services running:
- Stocks: This service is responsible for fetching Live stock data from IEX.
- Twitter: This service is responsible for fetching tweets from the Twitter API.
- Manager: This service is responsible for controlling when to turn on fetching when the market is open.
- Redis: its the mechanism the Stock and Twitter services use to pass information to the Manager service.
- Log: this service is used to log events from the other services.
- Data Store: This service is used as a backup to sending data to Kinesis. Its a way to quickly gather data without having to pay the transfer costs of Kinesis.
- Analysis: this service is used to analyze the data from the Data Store. Its being used to validate different machine learning models to deploy in production.
- Backup: This service performs daily backups of the MongoDB and uploads to S3.
The full system diagram is shown in this image:
Data is sent to a Spark instance via Spark Streaming and aggregations are done on the data. There is an instance used for development of the SKLearn model. The model is retrained, serialized and uploaded to S3. The deployed instance pulls the latest model from S3 and makes predictions every 20 minutes and logs those predictions to a database to be later analyzed.