A scala-based program that reads both Twitter batch data and streaming data and runs sentiment analysis on the data.
Almost all businesses today run their social media accounts, hoping that it will bring them love and popularity among users in some way. Yet, there has been an interesting question posed as to whether businesses actually benefit from running social media accounts or do nothing but damage themselves. This program is designed to look into businesses' Twitter accounts and run sentiment analysis on each tweet and response generated by the businesses.
- sbt
- Apache Spark
- Spark SQL
- Spark Streaming
- Docker
- Twitter API v2
- Apache Parquet
- Subjectivity Lexicon (Link)
- Read Twitter batch data of a selected business account for the past 7 days.
- Load the batch data using Apache Spark and convert the data into Spark DataFrame.
- Manipulate the converted DataFrame, so that only tweet texts are fed to the sentiment analysis program.
- Run sentiment analysis on each tweet and response generated by businesses and returns one of followings responses:
Positive
,Negative
, orMixed
. - Read and process live Twitter-Stream data using Spark Structured Streaming in order to find the most popular topics of discussion on Twitter at a given moment.
- Read Twitter streaming data of a selected business account in real-time and save every 10 new lines as a csv file in Datalake1.
- Read newly generated csv files in Datalake1 in real-time using Spark Streaming, convert them into DataFrame, extract only tweet texts from the files, and save them as a parquet file in Datalake2.
- Load the parquet files using Apache Spark and run sentiment anlaysis on each tweet and response generated in real-time to return one of followings responses:
Positive
,Negative
, orMixed
.
In order to run this program properly, you will need to do the following prerequisites:
- Be sure to create Twitter API v2 key.
- Be sure to download Subjectivity Lexicon from the link above and upload it to the cluster where you will run your jar files.
If all of the prerequisites above are met, go ahead and clone this repo by using the command below:
git clone https://github.com/spark131008/Twitter_Account_Sentiment_Analysis_Program.git
In order to create a jar file of each program, use the command below:
sbt package
Once all jar files are created, copy the files located within /target/scala-2.12 directory and paste them to JVM or a local cluster. If you are running your cluster in a Docker container, use the command below:
docker cp ./target/scala-2.12/<Name of the jar file>.jar spark-master:/<Name of the jar file>.jar
In order to run a jar file using Apache Spark in a Docker container, use the command below:
docker exec spark-master bash -c "./spark/bin/spark-submit --class "<Name of the class>" --master local[4] /<Name of the jar file>.jar"
If you want to run sentiment analysis on filtered Twitter streaming data, please follow the order below when spark-sbumitting jar files.
1. docker exec spark-master bash -c "./spark/bin/spark-submit --class "TwitterStreamingDataProcessing" --master local[4] /filtered_twitter_stream.jar"
2. docker exec spark-master bash -c "./spark/bin/spark-submit --class "SparkStreaming" --master local[4] /spark_streaming.jar"
3. docker exec spark-master bash -c "./spark/bin/spark-submit --class "SentimentAnalysis" --master local[4] /sentiment_analysis.jar"
Sundoo, Chase, Trenton, Josh
https://docs.google.com/presentation/d/1vG7IgBXfc0gUOD-RylH3TJpXcLts6diCY8UIR92Tm0M/edit?usp=sharing