Skip to content

Running BlinkDB on a Cluster

Sameer Agarwal edited this page Aug 18, 2013 · 5 revisions

This wiki is closely mirrored after the Shark Wiki and describes how to get BlinkDB up and running on a cluster. If you are interested in using BlinkDB on Amazon EC2, see page Running BlinkDB on EC2 to use the set of EC2 scripts to launch a pre-configured cluster in a few mins.

Dependencies

Running BlinkDB on a cluster requires the following external components:

  • Scala 2.9.3
  • Spark 0.7.2
  • A compatible Java Runtime: OpenJDK 7, Oracle HotSpot JDK 7, or Oracle HotSpot JDK 6u23+
  • The BlinkDB-specific Hive installation (based on Hive 0.9), linked to the BlinkDB repository as a submodule
  • A HDFS cluster: setup not included in this guide.

Scala

If you don't have Scala 2.9.3 installed on your system, you can download it by:

$ wget http://www.scala-lang.org/downloads/distrib/files/scala-2.9.3.tgz
$ tar xvfz scala-2.9.3.tgz

Spark

We are using Spark's standalone deployment mode to run BlinkDB on a cluster. You can click on this click to find more information.

Download Spark:

$ wget http://spark-project.org/files/spark-0.7.2-prebuilt-hadoop1.tgz          # Hadoop 1/CDH3 - or -
$ wget http://spark-project.org/files/spark-0.7.2-prebuilt-cdh4.tgz             # Hadoop 2/CDH4
$ tar xvfz spark-0.7.2-prebuilt*.tgz

Edit spark-0.7.2/conf/slaves to add the hostname of each slave, one per line.

Edit spark-0.7.2/conf/spark-env.sh to set SCALA_HOME and SPARK_WORKER_MEMORY

export SCALA_HOME=/path/to/scala-2.9.3
export SPARK_WORKER_MEMORY=16g

SPARK_WORKER_MEMORY is the maximum amount of memory that Spark can use on each node. Increasing this allows more data to be cached, but be sure to leave memory (e.g. 1 GB) for the OS and any other services that the node may be running.

BlinkDB

Get the latest version of BlinkDB.

$ git clone -b alpha-0.1.0 https://github.com/sameeragarwal/blinkdb.git

BlinkDB requires the (patched) development package of BlinkDB Hive which is added as a submodule in the BlinkDB repository. Clone it from github and package it:

$ cd blinkdb
$ git submodule init
$ git submodule update
$ cd hive_blinkdb
$ ant package

Now edit blinkdb/conf/blinkdb-env.sh (based on blinkdb-env.sh.template) to set the HIVE_HOME, SCALA_HOME and MASTER environmental variables:

export HADOOP_HOME=/path/to/hadoop
export HIVE_HOME=/path/to/hive_blinkdb
export MASTER=spark://<MASTER_IP>:7077
export SPARK_HOME=/path/to/spark
export SPARK_MEM=16g

source $SPARK_HOME/conf/spark-env.sh

The last line is there to avoid setting SCALA_HOME in two places. Make sure SPARK_MEM is not larger than SPARK_WORKER_MEMORY set in the previous section.

Copy the Spark and BlinkDB directories to slaves. We assume that the user on the master can SSH to the slaves. For example:

$ while read slave_host; do
$   rsync -Pav spark-0.7.2 blinkdb $slave_host
$ done < /path/to/spark/conf/slaves

Launch the cluster by running the Spark cluster launch scripts:

$ cd spark-0.7.2
$ ./bin/start_all.sh

Configuring with Hadoop2/CDH4

The newest versions of Hadoop require additional configuration options. You may need to set the following values inside of Hive's configuration file (hive-site.xml):

  • fs.default.name: Should point to the URI of your HDFS namenode. E.g. hdfs://myNameNode:8020/
  • fs.defaultFS: Should be equal to fs.default.name
  • mapred.job.tracker: Should list the host:port of your JobTracker or be set to "NONE" if you are only using Spark. Note that this needs to be explicitly set even if you aren't using a JobTracker.
  • mapreduce.framework.name: Should be set to a non-empty string, e.g. "NONE".

Testing

You can now launch BlinkDB with the command

$ ./bin/blinkdb-withinfo

More detailed information on Spark standalone scripts and options is also available.

To verify that BlinkDB is running, you can try the following example, which creates a table with sample data:

CREATE TABLE src(key INT, value STRING);
LOAD DATA LOCAL INPATH '${env:HIVE_HOME}/examples/files/kv1.txt' INTO TABLE src;
SELECT COUNT(1) FROM src;    
CREATE TABLE src_cached AS SELECT * FROM SRC;
SELECT COUNT(1) FROM src_cached;