-
Notifications
You must be signed in to change notification settings - Fork 4
BlinkDB User Guide
BlinkDB supports a subset of SQL nearly identical to that implemented by Hive/Shark. This guide assumes you have some familiarity with Hive and Shark, and focuses on the extra functionality included in BlinkDB. Those who need a refresher can refer to the Hive Documentation and Shark Documentation
BlinkDB is backwards compatible with Shark and supports all unmodified Shark queries. In addition to this, BlinkDB introduces a SAMPLEWITH
operator that takes a sampling ratio
as argument and returns a statistical random sample of the original dataset. These samples can be created offline using the CTAS (Create Table as Select) operator. For instance, let us assume that we have a table called logs
in our warehouse. A 1% random sample on logs
can be created as:
$ CREATE TABLE AS SELECT * FROM logs SAMPLEWITH 0.01
NOTE: These samples can be created on any materialized view as well using the SAMPLEWITH operator
The easiest way to run BlinkDB is to start a BlinkDB Command Line Client (CLI) and being executing queries. The BlinkDB CLI connects directly to the Hive Metastore, so it is compatible with existing Hive deployments. BlinkDB executables are available in the bin/
directory. To start the BlinkDB CLI, simply run:
$ ./bin/blinkdb # Start CLI for interactive session
$ ./bin/blinkdb -e "SELECT * FROM foo" # Run a specific query and exit
$ ./bin/blinkdb -i queries.hql # Run queries from a file
$ ./bin/blinkdb -H # Start CLI and print help
You can enter queries into the CLI directly, or use a flag to pass it a file. The BlinkDB CLI will only work correctly if the HIVE_HOME environment variable is set (see Configuration). Alternative versions of the CLI exist which print out more information: bin/blinkdb-withinfo
and bin/blinkdb-withdebug
.
Configuration variables are environment vars that must be set for the BlinkDB driver and slaves to run correctly. These are specified in conf/blinkdb-env.sh
. A few of the more important ones are described here:
HIVE_HOME # Path to directory containing patched Hive jars
HIVE_CONF_DIR # Optional, a different path containing Hive configuration files
SPARK_MEM # How many much memory to allocate for slaves (e.g '1500m', '5g')