Skip to content
Sameer Agarwal edited this page Aug 18, 2013 · 4 revisions

BlinkDB supports a subset of SQL nearly identical to that implemented by Hive/Shark. This guide assumes you have some familiarity with Hive and Shark, and focuses on the extra functionality included in BlinkDB. Those who need a refresher can refer to the Hive Documentation and Shark Documentation

BlinkDB Query Language

BlinkDB is backwards compatible with Shark and supports all unmodified Shark queries. In addition to this, BlinkDB introduces a SAMPLEWITH operator that takes a sampling ratio as argument and returns a statistical random sample of the original dataset. These samples can be created offline using the CTAS (Create Table as Select) operator. For instance, let us assume that we have a table called logs in our warehouse. A 1% random sample on logs can be created as:

$ CREATE TABLE AS SELECT * FROM logs SAMPLEWITH 0.01

NOTE: These samples can be created on any materialized view as well using the SAMPLEWITH operator

BlinkDB CLI

The easiest way to run BlinkDB is to start a BlinkDB Command Line Client (CLI) and being executing queries. The BlinkDB CLI connects directly to the Hive Metastore, so it is compatible with existing Hive deployments. BlinkDB executables are available in the bin/ directory. To start the BlinkDB CLI, simply run:

$ ./bin/blinkdb                            # Start CLI for interactive session
$ ./bin/blinkdb -e "SELECT * FROM foo"     # Run a specific query and exit
$ ./bin/blinkdb -i queries.hql             # Run queries from a file
$ ./bin/blinkdb -H                         # Start CLI and print help

You can enter queries into the CLI directly, or use a flag to pass it a file. The BlinkDB CLI will only work correctly if the HIVE_HOME environment variable is set (see Configuration). Alternative versions of the CLI exist which print out more information: bin/blinkdb-withinfo and bin/blinkdb-withdebug.

Configuration Options

Configuration variables are environment vars that must be set for the BlinkDB driver and slaves to run correctly. These are specified in conf/blinkdb-env.sh. A few of the more important ones are described here:

HIVE_HOME     # Path to directory containing patched Hive jars
HIVE_CONF_DIR # Optional, a different path containing Hive configuration files 
SPARK_MEM     # How many much memory to allocate for slaves (e.g '1500m', '5g')