- Apache Kafka as message queue
- Apache Spark as backend data analysis tool
- Redis as memcaching to enhance real-timeness
- Flask as framework for WebApp
Our App requires proper installation of Kafka, Spark, and Redis.
To install Kafka,
First, Check Scala
version, run:
$ spark-shell
Then open localhost:4040
to check scala
version
This pictures show scala version of 2.11
Installation:
-
Download kafka_2.11-0.10.2.0
-
Download spark-streaming-kafka-0-8-assembly_2.11-2.1.0.jar and put into spark's
jar
directory.
- Note that the version of .jar file must be consistent with the scala you use in spark.
Easy Installation:
brew install redis
You can also turn to manual installation:
-
Download Redis 3.2
-
Make, test and Install
$ cd redis-3.2.8
$ sudo make
$ sudo make test
$ sudo make install
- Revise
conf
file
$ cd redis-3.2.8
Find redis.conf
, open it by vim
, revise the line of dir.
as dir /opt/redis/
- mkdir in /opt/ and move redis.conf to /etc, and you are done
$ mkdir /opt/redis
$ mv redis.conf /etc
-
Real-time job plotting on Google Map
- spark-job:
JobPlotting.py
- WebApp:
WebApp_JobPlotting/
- spark-job:
-
Real-time job Cluster by states of US
- spark-job:
JobCluster.py
- WebApp:
WebApp_JobClusterAndTrend/
- another improved version:
JobClusterWithMerge.py
, in this code, spark will try to merge some new feature words from the stream data. It use the same WebApp file as the previous one.
- spark-job:
-
Real-time job Trending in US
- spark-job:
JobTrend.py
- WebApp:
WebApp_JobClusterAndTrend/
- spark-job:
- start kafka
- start redis
- submit spark job
- start streaming
- start WebApp
- go to localhost and observe the app
Go to the directory where you install kafka package and start "zookeeper" and "kafka server". Here are the commands. Open your terminal,
$ bin/zookeeper-server-start.sh config/zookeeper.properties
Open another terminal,
$ bin/kafka-server-start.sh config/server.properties
Now you have start kafka server. Then you need to create a topic like "tweets" for queuing Open a new terminal and type in,
$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic tweets
Right now you have created a topic named tweets
Then you can pull tweets to this port.
$ python Streaming/tweetFetcher.py
To check the data you just pull in, open a new terminal and type in,
$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic tweets
/usr/local/bin/redis-server /etc/redis.conf
spark-submit JobPlotting.py (JobCluster.py JobTrend.py)
Now sparkSubmit.py send to tweet data to redis channel
python tweetFetcher.py
python application.py (in WebApp_JobPlotting/ or WebApp_JobClusterAndTrend/)
Start the WebApp, job posts on Twitter is plotted on GoogleMap in real-time. Click on the marker to Toggle job infomation
We utilize job clustering with SVG map. Start the WebApp, jobs clustered within each state will shown on the map. Hover onto each state to toggle.
We utilize job trending with SVG map. Start the WebApp, the map of US will change to different color
Now we have finish the whole set-up work and in the next days, we can divide into two parts. I hope @Yulong can polish the "webapp" part. And we need to do discuss more on what kinds of data should we presents on web. And I and @Kaili can focus on analysis. And the short-term goal for this week's presentation is that we can display the live job posting on geomap. Besides, to simplify the work of set-up, i.e, open redis or kafka, for mac user, it is recommended to write a shell script to make things easy.