Skip to content

1 Installing the StorageLoader

Yali Sassoon edited this page Aug 15, 2013 · 21 revisions

HOME > SNOWPLOW SETUP GUIDE > Step 4: setting up alternative data stores > Installing the StorageLoader

  1. Assumptions
  2. Dependencies
  3. Installation
  4. Configuration
  5. Next steps
## 1. Assumptions

This guide assumes that you have administrator access to a Unix-based server (e.g. Ubuntu, OS X, Fedora) which you can install StorageLoader on.

## 2. Dependencies

2.1 Software

To install StorageLoader, first make sure that your server has all of the following installed:

  1. Git - see the [Git Installation Guide] git-install
  2. Ruby and RVM* - see our Ruby and RVM setup guide

* If you prefer, an alternative Ruby manager such as chruby or rbenv should work fine too.

Also make sure that if you are loading Snowplow events into a PostgreSQL database, then the StorageLoader must be run on the same server running PostgreSQL. That is because it downloads the files locally, and Postgres needs to be able to ingest the data from the local file system.

### 2.2 S3 buckets

StorageLoader moves the Snowplow event files through three distinct S3 buckets during the load process. These buckets are as follows:

  1. In Bucket - contains the Snowplow event files to process
  2. Archive Bucket - where StorageLower moves the Snowplow event files after successful loading

The In Bucket for StorageLoader is the same as the Out Bucket for the EmrEtlRunner - i.e. you will already have setup this bucket.

We recommend creating a new folder for the Archive Bucket - i.e. do not re-use EmrEtlRunner's own Archive Bucket. Create the required Archive Bucket in the same AWS region as your In Bucket.

Important: if you are using the StorageLoader to load your data into Redshift, you need to make sure your In Bucket is in the same region as your Redshift cluster. The reason is that Redshift currently only supports the bulk loading of data from S3 in the same region.

Right, now we can install StorageLoader.

## 3. Installation

First, checkout the Snowplow repository and navigate to the StorageLoader root:

$ git clone git://github.com/snowplow/snowplow.git
$ cd snowplow/4-storage/storage-loader

StorageLoader depends on some Postgres libraries being installed to talk to Redshift. You will need to install this first:

$ sudo aptitude install libpq-dev

Now you are ready to install the application on your system:

$ bundle install --deployment

Check it worked okay:

$ bundle exec bin/snowplow-storage-loader --version
snowplow-storage-loader 0.0.4
## 4. Configuration

StorageLoader requires a YAML format configuration file to run. We provide two configuration file templates in the Snowplow GitHub repository:

  1. [/4-storage/storage-loader/config/redshift.yml.sample] redshift-config-yml
  2. [/4-storage/storage-loader/config/postgres.yml.sample] postgres-config-yml

Redshift sample configuration

The Redshift configuration template looks like this:

:aws:
  :access_key_id: ADD HERE
  :secret_access_key: ADD HERE
:s3:
  :region: ADD HERE # S3 bucket region must be the same as Redshift cluster region
  :buckets:
    :in: ADD HERE # Must be s3:// not s3n:// for Redshift
    :archive: ADD HERE
:download:
  :folder: # Not required for Redshift
:targets:
  - :name: "My Redshift database"
    :type: redshift
    :host: ADD HERE # The endpoint as shown in the Redshift console
    :database: ADD HERE # Name of database 
    :port: 5439 # Default Redshift port
    :table: events
    :username: ADD HERE 
    :password: ADD HERE 
    :maxerror: 1 # Stop loading on first error, or increase to permit more load errors

Postgres sample configuration

The Postgres configuration template looks like this:

:aws:
  :access_key_id: ADD HERE
  :secret_access_key: ADD HERE
:s3:
  :region: ADD HERE # S3 bucket region
  :buckets:
    :in: ADD HERE
    :archive: ADD HERE
:download:
  :folder: ADD HERE # Postgres-only config option. Where to store the downloaded files
:targets:
  - :name: "My PostgreSQL database"
    :type: postgres
    :host: ADD HERE # Hostname of database server
    :database: ADD HERE # Name of database 
    :port: 5432 # Default Postgres port
    :table: atomic.events
    :username: ADD HERE 
    :password: ADD HERE 
    :maxerror: # Not required for Postgres

Populating the configuration

To take each section from the configuration templates in turn:

aws

The aws variables should be self-explanatory - enter your AWS access key and secret here.

s3

The region variable should hold the AWS region in which your two data buckets (In Bucket and Archive Bucket) are located, e.g. "us-east-1" or "eu-west-1".

Important: Please note that currently Redshift can only load from buckets in the US region, so you will need to put both your buckets in "us-east-1" if you are using Redshift.

Within the s3 section, the buckets variables are as follows:

  • in is where you specify your In Bucket
  • archive is where you specify your Archive Bucket

Each of the bucket variables must start with an S3 protocol - we recommend using s3://, as the s3n:// protocol does not currently work with Redshift. Each variable can include a sub-folder within the bucket as required, and a trailing slash is optional.

Important: do not put your Archive Bucket location inside your In Bucket, or you will create circular references which StorageLoader cannot resolve when moving files.

The following are examples of valid bucket settings:

:buckets:
  :in: s3://my-snowplow-data/events/
  :archive: s3://my-snowplow-archive/events/

Please note that all buckets must exist prior to running StorageLoader.

download

This is where we configure the StorageLoader download operation, which downloads the Snowplow event files from Amazon S3 to your local server, ready for loading into your database.

This setting is needed for Postgres, but not if you are only loading into Redshift

  • you can safely leave it blank.

You will need to set the folder variable to a local directory path - please make sure that:

  • this path exists,
  • is writable by StorageLoader
  • it is empty
  • the postgres user needs to be able to read every directory containing the folder specified. This is necessary to ensure that PostgreSQL can read the data in the folder, when it comes to ingest it

target

In this section we configure exactly what database(s) StorageLoader should load our Snowplow events into. At the moment, StorageLoader supports only two types of load target, Redshift and Postgres, which require slightly different configurations.

To take each variable in turn:

  1. name, enter a descriptive name for this Snowplow storage target
  2. type, what type of database are we loading into? Currently the only supported formats are "postgres" and "redshift"
  3. host, the host (endpoint in Redshift parlance) of the databse to load. Only supported for Redshift currently, leave blank for Infobright
  4. database, the name of the database to load
  5. port, the port of the database to load. 5439 is the default Redshift port; 5432 is the default Postgres port
  6. table, the name of the database table which will store your Snowplow events. Must have been setup previously
  7. username, the database user to load your Snowplow events with. You can leave this blank to default to the user running the script
  8. password, the password for the database user. Leave blank if there is no password
  9. maxerror, a Redshift-specific setting governing how many load errors should be permitted before failing the overall load. See the [Redshift COPY documentation] redshift-copy for more details

Note that the host and port options are not currently supported for Infobright - StorageLoader assumes that the Infobright database is on the server it is being run on, and accesses it on the standard Infobright port (5029).

Loading multiple databases

It is possible to load Snowplow events into multiple storage targets using StorageLoader.

Simply add additional entries under the :targets: section, like so:

:targets:
  - :name: "My test PostgreSQL database"
    :type: postgres
    ...
  - :name: "My production PostgreSQL database"
    :type: postgres
    ...
  - :name: "My test Redshift database"
    :type: redshift
    ...
  - :name: "My production Redshift database"
    :type: redshift
    ...
## 5. Next steps

All done? You have the StorageLoader installed! Now find out how to use it.

HOME > SNOWPLOW SETUP GUIDE > Step 4: Setting up alternative data stores

Setup Snowplow

  • [Step 1: Setup a Collector] (setting-up-a-collector)
  • [Step 2: Setup a Tracker] (setting-up-a-tracker)
  • [Step 3: Setup EmrEtlRunner] (setting-up-EmrEtlRunner)
  • [Step 4: Setup alternative data stores] (setting-up-alternative-data-stores)
    • [4.1: setup Redshift] (setting-up-redshift)
    • [4.2: setup PostgreSQL] (setting-up-postgresql)
    • [4.3: installing the StorageLoader] (1-installing-the-storageloader)
    • [4.4: using the StorageLoader] (2-using-the-storageloader)
    • [4.5: scheduling the StorageLoader] (3-scheduling-the-storageloader)
  • [Step 5: Analyze your data!] (Getting started analyzing Snowplow data)

Useful resources

Clone this wiki locally