Skip to content

SampleClean Installation Instructions

Sanjay Krishnan edited this page May 11, 2014 · 8 revisions

Requirements:

  • Scala 2.10.4
  • Spark 0.9.1

SampleClean requires Scala 2.10.4, so first download Scala 2.10.4 $ wget http://www.scala-lang.org/files/archive/scala-2.10.4.tgz $ tar xvzf scala-2.10.4.tgz

Then, download Spark 0.9.1 $ wget http://d3kbcqa49mib13.cloudfront.net/spark-0.9.1.tgz` $ tar xvzf spark-0.9.1.tgz

Get the patched BlinkDB interface to SampleClean: $ git clone https://github.com/sjyk/blinkdb.git`

Once the repository is cloned run: $ cd blinkdb $ git submodule init $ git submodule update $ cd hive_blinkdb $ git pull https://github.com/sjyk/hive_blinkdb.git $ ant package

ant package builds all Hive jars and put them into build/dist directory. If you are trying to build Hive on your local machine and (a) your distribution doesn't have yum or (b) the above yum commands don't work out of the box with your distro, then you probably want to upgrade to a newer version of ant. ant >= 1.8.2 should work. Download ant binaries at http://ant.apache.org/bindownload.cgi. You might also be able to upgrade to a newer version of ant using a package manager, however on older versions of CentOS, e.g. 6.4, yum can't install ant 1.8 out of the box so installing ant by downloading the binary installation package is recommended.

The BlinkDB/SampleClean code is in the blinkdb/ directory. To setup your environment to run BlinkDB/SampleClean locally, you need to set HIVE_HOME and SCALA_HOME environmental variables in a file blinkdb/conf/blinkdb-env.sh to point to the folders you just downloaded. BlinkDB comes with a template file blinkdb-env.sh.template that you can copy and modify to get started:

$ cd blinkdb/conf
$ cp blinkdb-env.sh.template blinkdb-env.sh

Edit blinkdb/conf/blinkdb-env.sh and set the following for running local mode:

#!/usr/bin/env bash

export SHARK_MASTER_MEM=1g

export HIVE_DEV_HOME="/path/to/hive"
export HIVE_HOME="$HIVE_DEV_HOME/build/dist"

SPARK_JAVA_OPTS="-Dspark.local.dir=/tmp "
SPARK_JAVA_OPTS+="-Dspark.kryoserializer.buffer.mb=10 "
SPARK_JAVA_OPTS+="-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps "
export SPARK_JAVA_OPTS

export SCALA_VERSION=2.9.3
export SCALA_HOME="/path/to/scala-home-2.9.3"
export SPARK_HOME="/path/to/spark"
export HADOOP_HOME="/path/to/hadoop-1.2.0"
export JAVA_HOME="/path/to/java-home-1.7_21-or-newer"

Next, package and publish Spark and BlinkDB/SampleClean

$ cd $SPARK_HOME
$ sbt/sbt publish-local
$ cd $BLINKDB_HOME
$ sbt/sbt package

Next, create the default Hive warehouse directory. This is where Hive will store table data for native tables.

$ sudo mkdir -p /user/hive/warehouse
$ sudo chmod 0777 /user/hive/warehouse  # Or make your username the owner

You can now start the BlinkDB/SampleClean CLI: $ ./bin/blinkdb

After starting BlinkDB/SampleClean, it will present a SampleClean prompt: sampleclean>

We have provided a example dirty dataset of world cities with various text formatting issues and semantic errors. Create a table and load the data in to the table:

CREATE TABLE cities (city string, country string, population string, area string, density string) ROW       
FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';

LOAD DATA LOCAL INPATH 'data/files/world_population.csv' OVERWRITE INTO TABLE cities;

You can enable the SampleClean SQL extensions by: sampleclean> set sampleclean.enabled = true;

To initialize a 10% "dirty sample" run: sampleclean> SCINITIALIZE cities_sample (city, country, population, area, density) FROM cities SAMPLEWITH 0.1;

Then, set the following variables: sampleclean> set sampleclean.sample.size=; sampleclean> set sampleclean.dataset.size=52;

To see what the data looks like run: sampleclean> scshow cities_sample;

To run a first query try: sampleclean> selectrawsc sum(population) from cities_sample;

This query will return NULL as the there are string formatting issues in the population field.

You can fix these problems (for any attribute) with: sampleclean> scformat cities_sample population number;

To review your changes you can run: sampleclean> scshow cities_sample; sampleclean> selectrawsc sum(population) from cities_sample;

You will now get results with confidence intervals. There are many other data cleaning primitives to try out, for example, this removes all of the "cities" with "/" in the names: sampleclean> scfilter cities_sample city not like '%/%';

After playing around with the cleaning, if you want to reset to the beginning run: sampleclean> screset cities_sample;