-
Notifications
You must be signed in to change notification settings - Fork 0
EmrEtlRunner
HOME > [SNOWPLOW TECHNICAL DOCUMENTATION](Snowplow technical documentation) > Enrichment > EmrEtlRunner
Raw collector logs that need to be processed are identified in the in-bucket. (This is the bucket that the collector log files are generated in: it's location is specified in the [EmrEtlRunner config file] config-file.)
EmrEtlRunner then triggers the Enrichment process to run. It spins up an EMR cluster (the size of which is determined by the config file), uploads the JAR with the Scalding Enrichment process on, and instructs EMR to:
- Use S3DistCopy to aggregate the collector log files and write them to HDFS
- Run the Enrichment process on those aggregated files in HDFS
- Write the output of that Enrichment to the Out-bucket in S3. (As specified in the config file).
- When the job has completed, EmrEtlRunner moves the processed collector log files from the in-bucket to the archive bucket. (This, again, is specified in the config file.)
By setting up a cron job to run EmrEtlRunner regularly, Snowplow users can ensure that the event data regularly flows through the Snowplow data pipeline from the collector to storage.
Note: many references are made to the 'Hadoop ETL' and 'Hive ETL' in the documentation and the config file. 'Hadoop ETL' refers to the current Scalding-based Enrichment Process. 'Hive ETL' refers to the legacy Hive-based ETL process. EmrEtlRunner can be setup to instrument either. However, we recommend all Snowplow users use the Scalding based 'Hadoop ETL', as it is much more robust, as well as being cheaper to run.
Home | About | Project | Setup Guide | Technical Docs | Copyright © 2012-2013 Snowplow Analytics Ltd
HOME > [TECHNICAL DOCUMENTATION](Snowplow technical documentation)
1. Trackers
Overview
Javascript Tracker
No-JS Tracker
Lua Tracker
Arduino Tracker
2. Collectors
Overview
Cloudfront collector
Clojure collector (Elastic Beanstalk)
SnowCannon (node.js)
3. Enrich
Overview
EmrEtlRunner
Scalding-based Enrichment Process
C. Canonical Snowplow event model
4. Storage
Overview
[Storage in S3](S3 storage)
Storage in Redshift
Storage in PostgreSQL
Storage in Infobright (deprecated)
The StorageLoader
D. Snowplow storage formats (to write)
5. Analytics
Analytics documentation
Common
Artifact repositories