Skip to content

⚠️ MAINTENANCE-ONLY MODE: Snowplow maintained SQL data models for working with Snowplow web and mobile behavioral data.

Notifications You must be signed in to change notification settings

snowplow/data-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MAINTENANCE-ONLY MODE

⚠️ For any new developments we highly recommend using our dbt packages instead of these data models. SQL Runner and associated data models are no longer under active development and will only receive bug fixes in the future. Our dbt packages have all the same features as these SQL runner data models, and many many more features as well as more overall packages. They also support Redshift, BigQuery, Snowflake, Databricks, and Postgres.

maintained License

snowplow-logo

Snowplow is a scalable open-source platform for rich, high quality, low-latency data collection. It is designed to collect high quality, complete behavioral data for enterprise business.

Snowplow Pipeline Overview

snowplow-pipeline

The Snowplow trackers enable highly customizable collection of raw, unopinionated event data. The pipeline validates these events against a JSONSchema - to guarantee a high quality dataset - and adds information via both standard and custom enrichments.

This data is then made available in-stream for real-time processing, and can also be loaded to blob storage and data warehouse for analysis.

The Snowplow atomic data acts as an immutable log of all the actions that occurred across your digital products. The data model takes that data and transforms it into a set of derived tables optimized for analysis. Visit our documentation site for further explanation on the data modeling process.

Repo Contents

Documentation for the data models can be found on our documentation site.

Prerequisites

These models are written in a format that is runnable via SQL-runner - available for download as a zip file from Github Releases. The BigQuery model requires >= v0.9.2, and the Snowflake model requires >= v0.9.3 of sql-runner.

Those who don't wish to use sql-runner to run models can use the -t and -o flags of the run_config.sh script to output the pure sql for a model according to how it has been configured for sql-runner.

They each also require a dataset of Snowplow events, generated by one of the tracking SDKs, passed through the validation and enrichment steps of the pipeline, and loaded to a database.

For the testing framework, Python3 is required. Install requirements with:

cd .tests
pip3 install -r requirements.txt

Quick start

To run a model and tests end to end, run the .scripts/e2e.sh bash script.

end-to-end

For a quick start guide to each individual model, and specific details on each module, see the README in the model's database-specific folder (eg. web/v1/redshift).

For detail on the structure of a model, see the README in the model's main folder (eg. web/v1).

For detail on using the helper scripts, see the README in .scripts/

Running models in production

Using SQL-runner

Snowplow BDP

Snowplow BDP customers can configure jobs for SQL-runner in production via configuration files. See our docs site for details on doing so. The configs/datamodeling.json file in each model is an example configuration for the standard model. The configs/example_with_custom.json file is an example configuration with a customization.

Open Source

For open-source users, the JSON files in configs folders can't be directly used, but serve as a representation of the dependencies for what to run. Open Source users using SQL-runner should instrument their jobs to run playbooks individually according to the dependencies specified.

For local use, the .scripts/run_config.sh script can be used to run a config - note that it does not resolve dependencies but runs the playbooks in order of appearance.

dbt

For users using dbt we have the snowplow-web and snowplow-mobile dbt packages, allowing you to run the web and/or mobile models via dbt. These packages support Redshift, BigQuery, Snowflake, Databricks, and Postgres.

Using other tools

For those who wish to use other tools, one may configure playbooks and config JSON files for the desired model, then use the .scripts/run_configs.sh script's -p and -o flags to fill templates and output pure SQL to file:

bash .scripts/run_config.sh -b ~/pathTo/sql-runner -c web/v1/bigquery/sql-runner/configs/example_with_custom.json -p -o tmp/sql;

This SQL can then be used directly or amended to suit the relevant tool.

Copyright and license

The Snowplow Data Models project is Copyright 2020-2021 Snowplow Analytics Ltd.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.