Large Scale Stream Processing - Job Hunting in Twitter Streaming

Team KTY: Kaili Chen, Tingyu Mao, Yulong Qiao

Framework

Apache Kafka as message queue
Apache Spark as backend data analysis tool
Redis as memcaching to enhance real-timeness
Flask as framework for WebApp

Architecture

Installation

Our App requires proper installation of Kafka, Spark, and Redis.

Kafka

To install Kafka,

First, Check Scala version, run:

$ spark-shell

Then open localhost:4040 to check scala version This pictures show scala version of 2.11

Installation:

Download kafka_2.11-0.10.2.0
Download spark-streaming-kafka-0-8-assembly_2.11-2.1.0.jar and put into spark's jar directory.

Note that the version of .jar file must be consistent with the scala you use in spark.

Redis

Easy Installation:

brew install redis

You can also turn to manual installation:

Download Redis 3.2
Make, test and Install

$ cd redis-3.2.8
$ sudo make
$ sudo make test
$ sudo make install

Revise conf file

$ cd redis-3.2.8

Find redis.conf, open it by vim, revise the line of dir. as dir /opt/redis/

mkdir in /opt/ and move redis.conf to /etc, and you are done

$ mkdir /opt/redis
$ mv redis.conf /etc

App and System Service

We construct three apps

Real-time job plotting on Google Map
- spark-job: JobPlotting.py
- WebApp: WebApp_JobPlotting/
Real-time job Cluster by states of US
- spark-job: JobCluster.py
- WebApp: WebApp_JobClusterAndTrend/
- another improved version: JobClusterWithMerge.py, in this code, spark will try to merge some new feature words from the stream data. It use the same WebApp file as the previous one.
Real-time job Trending in US
- spark-job: JobTrend.py
- WebApp: WebApp_JobClusterAndTrend/

System Service

start kafka
start redis
submit spark job
start streaming
start WebApp
go to localhost and observe the app

start Kafka server

Go to the directory where you install kafka package and start "zookeeper" and "kafka server". Here are the commands. Open your terminal,

$ bin/zookeeper-server-start.sh config/zookeeper.properties

Open another terminal,

$ bin/kafka-server-start.sh config/server.properties

Now you have start kafka server. Then you need to create a topic like "tweets" for queuing Open a new terminal and type in,

$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic tweets

Right now you have created a topic named tweets

Then you can pull tweets to this port.

$ python Streaming/tweetFetcher.py

To check the data you just pull in, open a new terminal and type in,

 $ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic tweets

start Redis server

/usr/local/bin/redis-server /etc/redis.conf

Submit Spark Job

spark-submit JobPlotting.py (JobCluster.py JobTrend.py)

Now sparkSubmit.py send to tweet data to redis channel

start Streaming

python tweetFetcher.py

start Flask

python application.py (in WebApp_JobPlotting/ or WebApp_JobClusterAndTrend/)

WebApp Demo

Real-time Job Plotting

Start the WebApp, job posts on Twitter is plotted on GoogleMap in real-time. Click on the marker to Toggle job infomation

Real-time job Clustering

We utilize job clustering with SVG map. Start the WebApp, jobs clustered within each state will shown on the map. Hover onto each state to toggle.

Real-time job Trending

We utilize job trending with SVG map. Start the WebApp, the map of US will change to different color

Red means pessimistic
Green means optimistic
Yellow means no difference

Future Work

Now we have finish the whole set-up work and in the next days, we can divide into two parts. I hope @Yulong can polish the "webapp" part. And we need to do discuss more on what kinds of data should we presents on web. And I and @Kaili can focus on analysis. And the short-term goal for this week's presentation is that we can display the live job posting on geomap. Besides, to simplify the work of set-up, i.e, open redis or kafka, for mac user, it is recommended to write a shell script to make things easy.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.idea		.idea
AppCreds		AppCreds
Data		Data
SparkEngine		SparkEngine
Streaming		Streaming
WebApp_JobClusterAndTrend		WebApp_JobClusterAndTrend
WebApp_JobPloting		WebApp_JobPloting
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large Scale Stream Processing - Job Hunting in Twitter Streaming

Team KTY: Kaili Chen, Tingyu Mao, Yulong Qiao

Framework

Architecture

Installation

Kafka

Redis

App and System Service

We construct three apps

System Service

start Kafka server

start Redis server

Submit Spark Job

start Streaming

start Flask

WebApp Demo

Real-time Job Plotting

Real-time job Clustering

Real-time job Trending

Future Work

About

Releases

Packages

Contributors 3

Languages

kailichen/JobHunting

Folders and files

Latest commit

History

Repository files navigation

Large Scale Stream Processing - Job Hunting in Twitter Streaming

Team KTY: Kaili Chen, Tingyu Mao, Yulong Qiao

Framework

Architecture

Installation

Kafka

Redis

App and System Service

We construct three apps

System Service

start Kafka server

start Redis server

Submit Spark Job

start Streaming

start Flask

WebApp Demo

Real-time Job Plotting

Real-time job Clustering

Real-time job Trending

Future Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages