-
Notifications
You must be signed in to change notification settings - Fork 0
Canonical event model
HOME > [SNOWPLOW TECHNICAL DOCUMENTATION](SnowPlow technical documentation) > [Storage](storage documentation)
# Canonical data structure- Overview
- The SnowPlow canonical data structure: understanding the individual fields
- A note about data storage formats
In order to analyse SnowPlow data, it is important to understand how it is structured. We have tried to make the structure of SnowPlow data as simple, logical, and easy-to-query as possible.
- Single table SnowPlow data is stored in a single, "fat" (many columns) table. We call this the SnowPlow events table
- Each line represents one event. Each line in the table represents a single event, be that a page view, add to basket, play video, like etc.
- Immutable log. The SnowPlow data table is designed to be immutable: the data in each line should not change over time. Data points that we would expect to change over time (e.g. what cohort a particular user belongs to, how we classify a particular visitor) can be derived from SnowPlow data. However, our recommendation is that these derived fields should be defined and calculated at analysis time, stored in a separate table and joined to the SnowPlow events table when performing any analysis
- Structured events. SnowPlow explicitly recognises particular events that are common in web analytics (e.g. page views, transactions, ad impressions) as 'first class citizerns'. We have a model for the types of fields that may be captured when those events occur, and specific columns in the database that correspond to those fields
- Unstructured events. We intend to build out support for unstructured events. (So that SnowPlow users will be able to construct their own arbritary JSONs of fields for their own event types.) When supported, these JSONs will be stored as-is e.g. in a single 'unstructured event' column in Hive
- Whilst some fields are event specific (e.g.
tr_city
representing the city in a delivery address for an online transaction), others are platform specific (e.g.page_url
for events that occur on the web) and some are relevant across platforms, devices and events (e.g.dt
andtm
, the date and time an event occurs, orapp_id
, the application ID). - Evolving over time. We are building out the canonical data structure to make its understanding of individual event-types richer (more events, more fields per events) and to support more platforms. This needs to be done in a collaborative way with the SnowPlow community, so that the events and fields that you need to track can easily be accommodated in this data structure. Please [get in touch] (Talk-to-us) to discuss your ideas and requirements
2.1 Common fields (platform and event independent) 2.1.1 Application fields 2.1.2 Date / time fields 2.1.3 Event / transaction fields 2.1.4 SnowPlow version fields 2.1.5 User-related fields 2.1.6 Device-related fields 2.2 Platform-specific fields 2.2.1 Web-specific fields 2.3 Event-specific fields 2.3.1 Page views 2.3.2 Page pings 2.3.3 Link clicks 2.3.4 Ad impressions 2.3.5 Ecommerce transations 2.3.6 Social events 2.3.7 Item views 2.3.8 Error tracking 2.3.9 Custom structured events 2.3.10 Custom unstructured events
### 2.1 Common fields (platform and event independent)Back to top.
#### 2.1.1 Application fields #### 2.1.2 Date / time fields #### 2.1.3 Event / transaction fields #### 2.1.4 SnowPlow version fields #### 2.1.5 User-related fields #### 2.1.6 Device-related fields ### 2.2 Platform-specific fields #### 2.2.1 Web-specific fields ### 2.3 Event-specific fields #### 2.3.1 Page views #### 2.3.2 Page pings #### 2.3.3 Link clicks #### 2.3.4 Ad impressions #### 2.3.5 Ecommerce transactions #### 2.3.6 Social events #### 2.3.7 Item views #### 2.3.8 Error tracking #### 2.3.9 Custom structured events #### 2.3.10 Custom unstructured eventsThe fields recorded in the SnowPlow events table today:
FIELD | DATATYPE | CAN BE EMPTY? | DESCRIPTION |
---|---|---|---|
App | Data related to the specific website / application SnowPlow is tracking behaviour on | ||
app_id |
STRING | Yes | Unique identifier for the website or app (useful when SnowPlow is run across several websites / applications) |
platform |
STRING | Yes | Can be used to differentiate between data collected from e.g. iOS app vs web app vs XBox app etc. |
Date/time | The date and time of this page view or event | ||
dt |
STRING | No | Date |
tm |
STRING | No | Time |
Event | Identifying this logging event | ||
event |
STRING | No | The event type e.g. page_view , ad_impression , transaction , custom
|
event_id |
STRING | No | A unique event ID |
txn_id |
STRING | No | Generated by the Javascript tracker to enable deduplication where 1 event turns up twice by the Cloudfront collector |
SnowPlow version | Identifying what version of SnowPlow generated this line of data. (To allow for schema changes over time.) | ||
v_tracker |
STRING | No | Tracker type and version |
v_collector |
STRING | No | Collector type and version |
v_etl |
STRING | No | ETL type and version |
User and visit | Identifying the web user and this specific visit | ||
user_id |
STRING | No | A unique ID assigned to each browser and stored on the SnowPlow cookie. |
user_ipaddress |
STRING | No | Visitor IP Address |
user_fingerprint |
STRING | Yes | A user fingerprint generated by the Javascript tracker. Used to enable tracking of users across domains when using the Javascript (i.e. client-side) tracker. Note this feature is experimental. |
visit_id |
INT | No | A counter that indicates what visit this is for this particular user_id i.e. 1 if this is a user's first visit, 2 if it is his / her 2nd visit |
Page | Identifying the web page being visited | ||
page_url |
STRING | No | The web page URL |
page_title |
STRING | Yes | The web page title |
page_referrer |
STRING | No | The referrer URL. If this is the first page view of a session, it points at the referrering website / search engine if applicable |
Marketing | Marketing campaign attribution | ||
mkt_medium |
STRING | Yes | The type of ad used e.g. cpc, banner, email, affiliate... |
mkt_source |
STRING | Yes | The source of the ad: used e.g. Google, MSN, Facebook, TradeDoubler |
mkt_term |
STRING | Yes | Any keywords associated with the ad. This is relevant for search ads |
mkt_content |
STRING | Yes | The content of the ad, or a reference to the creative ID. Used e.g. to compare the results within a campaign between different creatives. |
mkt_campaign |
STRING | Yes | The campaign name. A single campaign may involve ads on multiple sources / mediums, so mkt_campaign is often a way of grouping them together into a single marketing initiative |
Custom Event | If this is an event being logged, its details | ||
ev_category |
STRING | Yes | The category of event e.g. 'ecomm', 'media' |
ev_action |
STRING | Yes | The action performed e.g. 'play-video', 'add-to-basket' |
ev_label |
STRING | Yes | A label associated with the event / action. This is often set to the object and action is performed on e.g. the product_id of the item added-to-basket, or the ID of the video played |
ev_property |
STRING | Yes | A property associated with the event / action. This might be the number of seconds into a video play starts, or the quantity of an item added to basket |
ev_value |
STRING | Yes | A value associated with with the action e.g. the value of the items added to basket |
Ecommerce | Ecomm transaction tracking | ||
tr_orderid |
STRING | Yes | Unique ID for the transaction |
tr_affiliation |
STRING | Yes | |
tr_total |
DECIMAL | Yes | Total transaction value |
tr_tax |
DECIMAL | Yes | Total sales tax |
tr_shipping |
DECIMAL | Yes | Total shipping charged |
tr_city |
STRING | Yes | Buyer city location |
tr_state |
STRING | Yes | Buyer state location |
tr_country |
STRING | Yes | Buyer country location |
ti_orderid |
STRING | Yes | Unique ID for the transaction, same as tr_orderid
|
ti_sku |
STRING | Yes | Item SKU |
ti_name |
STRING | Yes | Item name |
ti_category |
STRING | Yes | Category of item |
ti_price |
DECIMAL | Yes | Item price |
ti_quantity |
INT | Yes | Quantity of item purchased |
User Agent | |||
useragent |
STRING | Yes | Raw useragent string |
Browser | Information about the web browser | ||
br_name |
STRING | Yes | Browser name e.g. Internet Explorer |
br_family |
STRING | Yes | Browser family e.g. Chrome |
br_version |
STRING | Yes | Browser version |
br_type |
STRING | Yes | Type of client e.g. Browser, robot |
br_renderengine |
STRING | Yes | Browser rendering engine e.g. GECKO, WEBKIT |
br_lang |
STRING | Yes | Language that the browser is set to |
br_features |
ARRAY[STRING] | Yes | Contains a set of all features supported by this browser, e.g. fla for Flash, pdf for PDF support |
br_cookies |
BOOLEAN | Yes | Flag set to 'true' if browser permits cookies |
br_colordepth |
STRING | Yes | Browser color depth |
Operating System | Information about the host operating system | ||
os_name |
STRING | Yes | Operating system name e.g. Windows |
os_family |
STRING | Yes | Operating system family e.g. Android |
os_manufacturer |
STRING | Yes | Operating system manufacturer e.g. Apple Inc., Google Inc. |
os_timezone |
STRING | Yes | Timezone (as recorded by the operating system) |
Device/Hardware | Information about the host device/hardware | ||
dvce_type |
STRING | Yes | Device type e.g. computer, mobile... |
dvce_ismobile |
BOOLEAN | Yes | Flag set if user is browsing on a mobile device |
dvce_screenwidth |
INT | Yes | Screenwidth in pixels |
dvce_screenheight |
INT | Yes | Screenheight in pixels |
We are building out the SnowPlow events table to incorporate additional fields in the near future. A list of all the fields we currently plan to implement is given below: (This does not include those already implemented)
FIELD | DATATYPE | CAN BE EMPTY? | Implemented? | DESCRIPTION |
---|---|---|---|---|
User and visit | Yes | Identifying the web user and this specific visit | ||
user_domain |
STRING | Yes | No | User domain (based on IP address) |
connection_type |
STRING | Yes | No | Type of connection |
Marketing | Marketing campaign attribution | |||
mkt_referrerurl |
STRING | Yes | No | URL of referrer. Same as page_referrer but only set where page_referrer is not the same domain as the website |
Social event | Fields for social event tracking | |||
social_network |
STRING | Yes | No | Social network that action relates to e.g. Facebook , Twitter
|
social_action |
STRING | Yes | No | Action that user performed e.g. like
|
social_target |
STRING | Yes | No | The social object that the action was performed on e.g. the video that was 'liked' |
social_pagepath |
STRING | Yes | No | The page URL the action was committed on |
Browser | Information about the web browser | |||
br_jsversion |
STRING | Yes | No | Version of Javascript that browser is running |
br_windowheight |
INT | Yes | No | Height of browser window in pixels |
br_windowwidth |
INT | Yes | No | Width of browser window in pixels |
Geo / location | Information on the location of the visitor, derived from IP address | |||
geo_country |
STRING | Yes | No | Country visitor is located in |
geo_region |
STRING | Yes | No | Region visitor is located in, within country |
geo_city |
STRING | Yes | No | City visitor is located in |
geo_postcode |
STRING | Yes | No | Visitor postcode |
geo_latitude |
STRING | Yes | No | Visitor location latitude |
geo_longitude |
STRING | Yes | No | Visitor location longitude |
Custom variables | Customer variables | |||
cv_persist1 |
STRING | Yes | No | Custom variable which will persist across sessions (e.g. relates to the app install or the browser user) |
cv_persist2 |
STRING | Yes | No | As above |
cv_persist3 |
STRING | Yes | No | As above |
cv_persist4 |
STRING | Yes | No | As above |
cv_persist5 |
STRING | Yes | No | As above |
cv_persist6 |
STRING | Yes | No | As above |
cv_persist7 |
STRING | Yes | No | As above |
cv_persist8 |
STRING | Yes | No | As above |
cv_persist9 |
STRING | Yes | No | As above |
cv_persist10 |
STRING | Yes | No | As above |
cv_session1 |
STRING | Yes | No | Custom variable with scope set at the session (e.g. visit_id ) level. (Ie applies to this particular user engaged in this particular workflow.) |
cv_session2 |
STRING | Yes | No | As above |
cv_session3 |
STRING | Yes | No | As above |
cv_session4 |
STRING | Yes | No | As above |
cv_session5 |
STRING | Yes | No | As above |
cv_session6 |
STRING | Yes | No | As above |
cv_session7 |
STRING | Yes | No | As above |
cv_session8 |
STRING | Yes | No | As above |
cv_session9 |
STRING | Yes | No | As above |
cv_session10 |
STRING | Yes | No | As above |
cv_context1 |
STRING | Yes | No | Custom variable with scope set to this particular context e.g. page, screen, widget... |
cv_context2 |
STRING | Yes | No | As above |
cv_context3 |
STRING | Yes | No | As above |
cv_context4 |
STRING | Yes | No | As above |
cv_context5 |
STRING | Yes | No | As above |
cv_context6 |
STRING | Yes | No | As above |
cv_context7 |
STRING | Yes | No | As above |
cv_context8 |
STRING | Yes | No | As above |
cv_context9 |
STRING | Yes | No | As above |
cv_context10 |
STRING | Yes | No | As above |
cv_event1 |
STRING | Yes | No | Custom variable with scope set to the event level |
cv_event2 |
STRING | Yes | No | As above |
cv_event3 |
STRING | Yes | No | As above |
cv_event4 |
STRING | Yes | No | As above |
cv_event5 |
STRING | Yes | No | As above |
cv_event6 |
STRING | Yes | No | As above |
cv_event7 |
STRING | Yes | No | As above |
cv_event8 |
STRING | Yes | No | As above |
cv_event9 |
STRING | Yes | No | As above |
cv_event10 |
STRING | Yes | No | As above |
cv_json |
STRING | Yes | No | Field that can be used to stuff any sort of custom event JSON if desired |
- Currently, SnowPlow data is stored in S3 (for processing in Apache Hive, Pig, and / or Mahout) and Infobright (for processing by any SQL-compatible analytics package).
- There are minor differences between the structure of data in both formats. These relate to data structures that Hive supports (e.g. maps, JSONs) that Infobright does not
- Nevertheless, the structure of both is similar: representing a fat table
- Going forwards our intention is to move the storage format for data on S3 from the current flatfile structure to Avro. This will become the 'canonical SnowPlow data structure'. Other formats (e.g. Infobright, Redshift etc.) will simply be 'flattened' versions of the same data
Home | About | Project | Setup Guide | Technical Docs | Copyright © 2012-2013 Snowplow Analytics Ltd
HOME > [TECHNICAL DOCUMENTATION](Snowplow technical documentation)
1. Trackers
Overview
Javascript Tracker
No-JS Tracker
Lua Tracker
Arduino Tracker
2. Collectors
Overview
Cloudfront collector
Clojure collector (Elastic Beanstalk)
SnowCannon (node.js)
3. Enrich
Overview
EmrEtlRunner
Scalding-based Enrichment Process
C. Canonical Snowplow event model
4. Storage
Overview
[Storage in S3](S3 storage)
Storage in Redshift
Storage in PostgreSQL
Storage in Infobright (deprecated)
The StorageLoader
D. Snowplow storage formats (to write)
5. Analytics
Analytics documentation
Common
Artifact repositories