EmrEtlRunner

HOME > [SNOWPLOW TECHNICAL DOCUMENTATION](Snowplow technical documentation) > Enrichment > EmrEtlRunner

An overview of how EmrEtlRunner instruments the enrichment process

Raw collector logs that need to be processed are identified in the in-bucket. (This is the bucket that the collector log files are generated in: it's location is specified in the [EmrEtlRunner config file] config-file.)

EmrEtlRunner then triggers the Enrichment process to run. It spins up an EMR cluster (the size of which is determined by the config file), uploads the JAR with the Scalding Enrichment process on, and instructs EMR to:

Use S3DistCopy to aggregate the collector log files and write them to HDFS
Run the Enrichment process on those aggregated files in HDFS
Write the output of that Enrichment to the Out-bucket in S3. (As specified in the config file).
When the job has completed, EmrEtlRunner moves the processed collector log files from the in-bucket to the archive bucket. (This, again, is specified in the config file.)

By setting up a cron job to run EmrEtlRunner regularly, Snowplow users can ensure that the event data regularly flows through the Snowplow data pipeline from the collector to storage.

Note: many references are made to the 'Hadoop ETL' and 'Hive ETL' in the documentation and the config file. 'Hadoop ETL' refers to the current Scalding-based Enrichment Process. 'Hive ETL' refers to the legacy Hive-based ETL process. EmrEtlRunner can be setup to instrument either. However, we recommend all Snowplow users use the Scalding based 'Hadoop ETL', as it is much more robust, as well as being cheaper to run.

HOME > [TECHNICAL DOCUMENTATION](Snowplow technical documentation)

1. Trackers
Overview
Javascript Tracker
No-JS Tracker
Lua Tracker
Arduino Tracker

A. Snowplow Tracker Protocol

2. Collectors
Overview
Cloudfront collector
Clojure collector (Elastic Beanstalk)
SnowCannon (node.js)

B. Collector logging formats

3. Enrich
Overview
EmrEtlRunner
Scalding-based Enrichment Process

C. Canonical Snowplow event model

4. Storage
Overview
[Storage in S3](S3 storage)
Storage in Redshift
Storage in PostgreSQL
Storage in Infobright (deprecated)
The StorageLoader

D. Snowplow storage formats (to write)

5. Analytics
Analytics documentation

Common
Artifact repositories

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EmrEtlRunner

An overview of how EmrEtlRunner instruments the enrichment process

Clone this wiki locally