-
Notifications
You must be signed in to change notification settings - Fork 0
Setting up Snowplow
Setting up Snowplow is a five step process:
- Setup a Snowplow Collector
- Setup a Snowplow Tracker
- Setup EmrEtlRunner
- Setting up alternative data stores (e.g. Redshift, PostgreSQL)
- Analyze your data!
The Snowplow collector receives data from Snowplow trackers and logs that data to S3 for storage and further processing. Setting up a collector is the first step in the Snowplow setup process.
Setup a Snowplow collector now!
Setup your collector? Then proceed to step 2: setup a tracker.
## Step 2: Setup a Snowplow TrackerSnowplow trackers generate event data and send that data to Snowplow collectors to be captured. The most common Snowplow tracker used is the Javascript tracker, which is integrated in websites (either directly or via a tag management solution) the same way that any web analytics tracker (e.g. Google Analytics or Omniture tags) is integrated.
Note: once you have setup a collector and tracker, you can pause and perform the remainder of the setup steps later. That is because your data is being successfully generated and logged. When you eventually proceed to step 3: Setup EMrEtlRunner, you will be able to process all the data you have logged since setup.
Setup your tracker? Now proceed to step 3: setup EmrEtlRunner.
## Step 3: Setup EmrEtlRunnerThe EmrEtlRunner application regularly takes the raw log files generated by the Snowplow collector and
- Cleans up the data into a format that is easier to parse / analyse
- Enriches the data (e.g. infers the location of the visitor from his / her IP address and infers the search engine keywords from the query string)
- Stores that cleaned, enriched data in S3
Once you have setup EmrEtlRunner, the process for taking the raw data generated by the collector, cleaning and enriching it will be automated.
Setup EmrEtlRunner? Proceed to step 4: setup the StorageLoader.
## Step 4: Setup the alternative data stores (e.g. Redshift, PostgreSQL)Most Snowplow users store their web event data in at least two places: S3 for processing in Hadoop (e.g. to enable machine learning via Mahout) and a database (e.g. Redshift or PostgreSQL) for more traditional OLAP analysis.
The StorageLoader is an application to regularly transfer data from S3 into other databases e.g. Redshift. If you only wish to process your data using Hadoop on EMR, you do not need to setup the StorageLoader. However, if you would find it convenient to have your data in another data store (e.g. Redshift) then you can set this up at this stage.
Setup alternative data stores.
Setup the alternative data stores? Then proceed to step 5: analyse your data.
## Step 5: Analyse your data!Once your data is stored in S3 and Redshift, setup is complete and you are in a position to start analysing it. As part of the setup guide we run through the steps necessary to perform some intiial analysis and plugin a couple of analytics tools, to get you started.
Get started analysing Snowplow data
![architecture] conceptual-architecture
You now have all five Snowplow subsystems working!
Home | About | Project | Setup Guide | Technical Docs | Copyright © 2012-2013 Snowplow Analytics Ltd
- [Step 1: Setup a Collector] (setting-up-a-collector)
- [Step 2: Setup a Tracker] (setting-up-a-tracker)
- [Step 3: Setup EmrEtlRunner] (setting-up-EmrEtlRunner)
- [Step 4: Setup alternative data stores] (setting-up-alternative-data-stores)
- [Step 5: Analyze your data!] (getting-started-analyzing-Snowplow-data)
Useful resources