data-workshop/README.md at master · Svetixbot/data-workshop · GitHub

Installation instructions

Download Spark - Prebuild for Hadoop 2.6
Try to run

Scala api

$YOUR_SPARK_PATH/bin/spark-shell

Python api

$YOUR_SPARK_PATH/bin/pyspark

May need to install python or java

Once it is loaded you will see scala console:

scala>

Type 'sc' and press enter, you should see:

scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@1f60824e

Start spark locally

Start spark locally.

1.1. Start the master $YOUR_SPARK_PATH/sbin/start-master.sh

1.2. After couple of minutes the http://localhost:8080 will be available. It is a spark ui.

1.3. Start workers(as many as you like, 2 or 3):

$YOUR_SPARK_PATH/bin/spark-class org.apache.spark.deploy.worker.Worker $SPARK_MASTER_URL &

where $SPARK_MASTER_URL is in the [web ui](http://localhost:8080)

Submit a test job

$YOUR_SPARK_PATH/bin/spark-submit --master $SPARK_MASTER_URL job.py $SPARK_MASTER_URL

Exersices

#1 Learn about RDD

Let's go through the official tutorial

#2 More challenges

#3 Set up a cluster on EC2

References

Designing Data-Intensive Applications