Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User Export #317

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
138 changes: 97 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,47 +163,6 @@ vars:
value_type: "int_value"
```

### User Properties

User properties are provided by GA4 in the `user_properties` repeated field. The most recent user property for each user will be extracted and included in the `dim_ga4__users` model by configuring the `user_properties` variable in your project as follows:

```
vars:
ga4:
user_properties:
- user_property_name: "membership_level"
value_type: "int_value"
- user_property_name: "account_status"
value_type: "string_value"
```

### Derived User Properties

Derived user properties are different from "User Properties" in that they are derived from event parameters. This provides additional flexibility in allowing users to turn any event parameter into a user property.

Derived User Properties are included in the `dim_ga4__users` model and contain the latest event parameter value per user.

```
derived_user_properties:
- event_parameter: "[your event parameter]"
user_property_name: "[a unique name for the derived user property]"
value_type: "[string_value|int_value|float_value|double_value]"
```

For example:

```
vars:
ga4:
derived_user_properties:
- event_parameter: "page_location"
user_property_name: "most_recent_page_location"
value_type: "string_value"
- event_parameter: "another_event_param"
user_property_name: "most_recent_param"
value_type: "string_value"
```

### Derived Session Properties

Derived session properties are similar to derived user properties, but on a per-session basis, for properties that change slowly over time. This provides additional flexibility in allowing users to turn any event parameter into a session property.
Expand Down Expand Up @@ -280,6 +239,103 @@ vars:
- name: "some_other_parameter"
value_type: "string_value"
```
# User Tables

This package contains two sets of user tables: an original set of user tables implemented from the inception of this package and a new set of user tables designed to use the GA4 BigQuery user export tables that were released after this package was first launched.

The original user tables build one-row-per-user tables and include data like first and last device, first and last geo, user properties, and derived user properties. To build them, they need to process all-time data. Large sites might want to consider disabling these tables to save costs.

The newer user tables that leverage the GA4 user export setting. They are partitioned tables so they are more appropriate for high-traffic sites. They lose the first and last columns and derived user properties, but include user properties, audiences, user LTV, and predictive data.

## Settings Common to Both Sets of User Tables

### User Properties

User properties are provided by GA4 in the `user_properties` repeated field. The most recent user property for each user will be extracted and included in the `dim_ga4__users` model by configuring the `user_properties` variable in your project as follows:

```
vars:
ga4:
user_properties:
- user_property_name: "membership_level"
value_type: "int_value"
- user_property_name: "account_status"
value_type: "string_value"
```

## dbt-GA4 Original User Table Settings

### Derived User Properties

Derived user properties are different from "User Properties" in that they are derived from event parameters. This provides additional flexibility in allowing users to turn any event parameter into a user property.

Derived User Properties are included in the `dim_ga4__users` model and contain the latest event parameter value per user.

```
derived_user_properties:
- event_parameter: "[your event parameter]"
user_property_name: "[a unique name for the derived user property]"
value_type: "[string_value|int_value|float_value|double_value]"
```

For example:

```
vars:
ga4:
derived_user_properties:
- event_parameter: "page_location"
user_property_name: "most_recent_page_location"
value_type: "string_value"
- event_parameter: "another_event_param"
user_property_name: "most_recent_param"
value_type: "string_value"
```

## GA4 User Export Settings

The GA4 user export models are disabled by default.

Enable them by adding the following model configs:

```
models:
ga4:
staging:
base:
base_ga4__pseudonymous_users:
+enabled: true
base_ga4__users:
+enabled: true
stg_ga4__client_keys:
+enabled: true
stg_ga4__users:
+enabled: true
```

### Audiences

The GA4 User Export includes an Audiences repeated record that stores the audience membership details. Audiences are enabled by adding a list of audience names that match values in the `audiences.name` fields of your `psuedonymous_users_` and `users__` tables as shown below.

```
vars:
ga4:
audiences: ['Purchases', 'All Users']
```

This example will add the following columns to the relevant dbt-GA4 models:

- purchases_id
- purchases_name
- purchases_membership_start_timestamp_micros
- purchases_membership_expiry_timestamp_micros
- purchases_npa
- all_users_id
- all_users_name
- all_users_membership_start_timestamp_micros
- all_users_membership_expiry_timestamp_micros
- all_users_npa

# Connecting to BigQuery

This package assumes that BigQuery is the source of your GA4 data. Full instructions for connecting DBT to BigQuery are here: https://docs.getdbt.com/reference/warehouse-profiles/bigquery-profile
Expand Down
37 changes: 36 additions & 1 deletion macros/base_select.sql
Original file line number Diff line number Diff line change
Expand Up @@ -163,4 +163,39 @@
WHEN event_name = 'purchase' THEN 1
ELSE 0
END AS is_purchase
{% endmacro %}
{% endmacro %}

{% macro base_select_usr_source() %}
{{ return(adapter.dispatch('base_select_usr_source', 'ga4')()) }}
{% endmacro %}

{% macro default__base_select_usr_source() %}
, user_info.last_active_timestamp_micros as user_info_last_active_timestamp_micros
, user_info.user_first_touch_timestamp_micros as user_info_user_first_touch_timestamp_micros
, user_info.first_purchase_date as user_info_first_purchase_date
, device.operating_system as device_operating_system
, device.category as device_category
, device.mobile_brand_name as device_mobile_brand_name
, device.mobile_model_name as device_mobile_model_name
, device.unified_screen_name as device_unified_sceen_name
, geo.city as geo_city
, geo.country as geo_country
, geo.continent as geo_continent
, geo.region as geo_region
, user_ltv.revenue_in_usd as user_ltv_revenue_in_usd
, user_ltv.sessions as user_ltv_sessions
, user_ltv.engagement_time_millis as user_ltv_engagement_time_millis
, user_ltv.purchases as user_ltv_purchases
, user_ltv.engaged_sessions as user_ltv_engaged_sessions
, user_ltv.session_duration_micros as user_ltv_session_duration_micros
, predictions.in_app_purchase_score_7d as predictions_in_app_purchase_score_7d
, predictions.purchase_score_7d as predictions_purchase_score_7d
, predictions.churn_score_7d as predictions_churn_score_7d
, predictions.revenue_28d_in_usd as predictions_revenue_28d_in_usd
, privacy_info.is_limited_ad_tracking as privacy_info_is_limited_ad_tracking
, privacy_info.is_ads_personalization_allowed as privacy_info_is_ads_personalization_allowed
, parse_date('%Y%m%d' , occurrence_date) as occurrence_date
, parse_date('%Y%m%d' , last_updated_date) as last_updated_date
, user_properties
, audiences
{% endmacro %}
70 changes: 47 additions & 23 deletions macros/combine_property_data.sql
Original file line number Diff line number Diff line change
Expand Up @@ -10,37 +10,61 @@
{# Otherwise use 'start_date' variable #}
{%- set earliest_shard_to_retrieve = var('start_date')|int -%}
{% endif %}

{% for property_id in var('property_ids') %}
{%- set schema_name = "analytics_" + property_id|string -%}

{% set modifications = [] %}
{%- set combine_specified_property_data_query -%}
create schema if not exists `{{target.project}}.{{var('combined_dataset')}}`;
{% if this.name == 'base_ga4__events' %}
{# Copy intraday tables #}
{%- set relations = dbt_utils.get_relations_by_pattern(schema_pattern=schema_name, table_pattern='events_intraday_%', database=var('source_project')) -%}
{% for relation in relations %}
{%- set relation_suffix = relation.identifier|replace('events_intraday_', '') -%}
{%- if relation_suffix|int >= earliest_shard_to_retrieve|int -%}
create or replace table `{{target.project}}.{{var('combined_dataset')}}.events_intraday_{{relation_suffix}}{{property_id}}` clone `{{var('source_project')}}.analytics_{{property_id}}.events_intraday_{{relation_suffix}}`;
{% do modifications.append( {'source_partition': 'events_intraday_' + relation_suffix , 'destination_partition': 'events_intraday_' + relation_suffix + property_id|string } ) %}
{%- endif -%}
{% endfor %}

{# Copy intraday tables #}
{%- set relations = dbt_utils.get_relations_by_pattern(schema_pattern=schema_name, table_pattern='events_intraday_%', database=var('source_project')) -%}
{% for relation in relations %}
{%- set relation_suffix = relation.identifier|replace('events_intraday_', '') -%}
{%- if relation_suffix|int >= earliest_shard_to_retrieve|int -%}
create or replace table `{{target.project}}.{{var('combined_dataset')}}.events_intraday_{{relation_suffix}}{{property_id}}` clone `{{var('source_project')}}.analytics_{{property_id}}.events_intraday_{{relation_suffix}}`;
{%- endif -%}
{% endfor %}

{# Copy daily tables and drop old intraday table #}
{%- set relations = dbt_utils.get_relations_by_pattern(schema_pattern=schema_name, table_pattern='events_%', exclude='events_intraday_%', database=var('source_project')) -%}
{% for relation in relations %}
{%- set relation_suffix = relation.identifier|replace('events_', '') -%}
{%- if relation_suffix|int >= earliest_shard_to_retrieve|int -%}
create or replace table `{{target.project}}.{{var('combined_dataset')}}.events_{{relation_suffix}}{{property_id}}` clone `{{var('source_project')}}.analytics_{{property_id}}.events_{{relation_suffix}}`;
drop table if exists `{{target.project}}.{{var('combined_dataset')}}.events_intraday_{{relation_suffix}}{{property_id}}`;
{%- endif -%}
{% endfor %}
{# Copy daily tables and drop old intraday table #}
{%- set relations = dbt_utils.get_relations_by_pattern(schema_pattern=schema_name, table_pattern='events_%', exclude='events_intraday_%', database=var('source_project')) -%}
{% for relation in relations %}
{%- set relation_suffix = relation.identifier|replace('events_', '') -%}
{%- if relation_suffix|int >= earliest_shard_to_retrieve|int -%}
create or replace table `{{target.project}}.{{var('combined_dataset')}}.events_{{relation_suffix}}{{property_id}}` clone `{{var('source_project')}}.analytics_{{property_id}}.events_{{relation_suffix}}`;
drop table if exists `{{target.project}}.{{var('combined_dataset')}}.events_intraday_{{relation_suffix}}{{property_id}}`;
{% do modifications.append( {'source_partition': 'events_' + relation_suffix , 'destination_partition': 'events_' + relation_suffix + property_id|string } ) %}
{%- endif -%}
{% endfor %}
{% elif this.name == 'base_ga4__pseudonymous_users' %}
{# Copy pseudonymous_users tables #}
{%- set relations = dbt_utils.get_relations_by_pattern(schema_pattern=schema_name, table_pattern='pseudonymous_users_%', database=var('source_project')) -%}
{{ log("Relations: " ~ relations ) }}
{% for relation in relations %}
{%- set relation_suffix = relation.identifier|replace('pseudonymous_users_', '') -%}
{%- if relation_suffix|int >= earliest_shard_to_retrieve|int -%}
create or replace table `{{target.project}}.{{var('combined_dataset')}}.pseudonymous_users_{{relation_suffix}}{{property_id}}` clone `{{var('source_project')}}.analytics_{{property_id}}.pseudonymous_users_{{relation_suffix}}`;
{% do modifications.append( {'source_partition': 'pseudonymous_users_' + relation_suffix , 'destination_partition': 'pseudonymous_users_' + relation_suffix + property_id|string } ) %}
{%- endif -%}
{% endfor %}
{% elif this.name == 'base_ga4__users' %}
{# Copy users tables #}
{%- set relations = dbt_utils.get_relations_by_pattern(schema_pattern=schema_name, table_pattern='users_%', database=var('source_project')) -%}
{% for relation in relations %}
{%- set relation_suffix = relation.identifier|replace('users_', '') -%}
{%- if relation_suffix|int >= earliest_shard_to_retrieve|int -%}
create or replace table `{{target.project}}.{{var('combined_dataset')}}.users_{{relation_suffix}}{{property_id}}` clone `{{var('source_project')}}.analytics_{{property_id}}.users_{{relation_suffix}}`;
{% do modifications.append( {'source_partition': 'users_' + relation_suffix , 'destination_partition': 'users_' + relation_suffix + property_id|string } ) %}
{%- endif -%}
{% endfor %}
{% endif %}
{%- endset -%}

{% do run_query(combine_specified_property_data_query) %}

{% if execute %}
{{ log("Cloned from `" ~ var('source_project') ~ ".analytics_" ~ property_id ~ ".events_*` to `" ~ target.project ~ "." ~ var('combined_dataset') ~ ".events_YYYYMMDD" ~ property_id ~ "`.", True) }}
{% for modification in modifications%}
{{ log("Cloned from `" ~ var('source_project') ~ ".analytics_" ~ property_id|string ~ "." ~ modification.source_partition ~"` to `" ~ target.project ~ "." ~ var('combined_dataset') ~ "." ~ modification.destination_partition ~"`", True) }}
{% endfor %}
{% endif %}
{% endfor %}
{% endmacro %}

30 changes: 30 additions & 0 deletions models/staging/base/base_ga4__pseudonymous_users.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{% set partitions_to_replace = ['current_date'] %}
{% for i in range(var('static_incremental_days')) %}
{% set partitions_to_replace = partitions_to_replace.append('date_sub(current_date, interval ' + (i+1)|string + ' day)') %}
{% endfor %}
{{
config(
pre_hook="{{ ga4.combine_property_data() }}" if var('combined_dataset', false) else "",
materialized = 'incremental',
incremental_strategy = 'insert_overwrite',
enabled=false,
partition_by={
"field": "occurrence_date",
"data_type": "date",
},
partitions = partitions_to_replace,
)
}}

with source as (
select
pseudo_user_id
, stream_id
{{ ga4.base_select_usr_source() }}
from {{ source('ga4', 'pseudonymous_users') }}
{% if is_incremental() %}
where parse_date('%Y%m%d', left(_table_suffix, 8)) in ({{ partitions_to_replace | join(',') }})
{% endif %}
)

select * from source
Loading
Loading