Skip to content

Commit

Permalink
Merge branch 'current' into nfiann-prerelease
Browse files Browse the repository at this point in the history
  • Loading branch information
nataliefiann authored Dec 30, 2024
2 parents 8142d5f + 65a46f1 commit ea46b86
Show file tree
Hide file tree
Showing 46 changed files with 927 additions and 210 deletions.
92 changes: 92 additions & 0 deletions website/docs/docs/build/dimensions.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ All dimensions require a `name`, `type`, and can optionally include an `expr` pa
| `description` | A clear description of the dimension. | Optional | String |
| `expr` | Defines the underlying column or SQL query for a dimension. If no `expr` is specified, MetricFlow will use the column with the same name as the group. You can use the column name itself to input a SQL expression. | Optional | String |
| `label` | Defines the display value in downstream tools. Accepts plain text, spaces, and quotes (such as `orders_total` or `"orders_total"`). | Optional | String |
| [`meta`](/reference/resource-configs/meta) | Set metadata for a resource and organize resources. Accepts plain text, spaces, and quotes. | Optional | Dictionary |

Refer to the following for the complete specification for dimensions:

Expand All @@ -37,6 +38,8 @@ dimensions:
Refer to the following example to see how dimensions are used in a semantic model:
<VersionBlock firstVersion="1.9">
```yaml
semantic_models:
- name: transactions
Expand All @@ -59,13 +62,50 @@ semantic_models:
type_params:
time_granularity: day
label: "Date of transaction" # Recommend adding a label to provide more context to users consuming the data
config:
meta:
data_owner: "Finance team"
expr: ts
- name: is_bulk
type: categorical
expr: case when quantity > 10 then true else false end
- name: type
type: categorical
```
</VersionBlock>
<VersionBlock lastVersion="1.8">
```yaml
semantic_models:
- name: transactions
description: A record for every transaction that takes place. Carts are considered multiple transactions for each SKU.
model: {{ ref('fact_transactions') }}
defaults:
agg_time_dimension: order_date
# --- entities ---
entities:
- name: transaction
type: primary
...
# --- measures ---
measures:
...
# --- dimensions ---
dimensions:
- name: order_date
type: time
type_params:
time_granularity: day
label: "Date of transaction" # Recommend adding a label to provide more context to users consuming the data
expr: ts
- name: is_bulk
type: categorical
expr: case when quantity > 10 then true else false end
- name: type
type: categorical
```
</VersionBlock>
Dimensions are bound to the primary entity of the semantic model they are defined in. For example the dimension `type` is defined in a model that has `transaction` as a primary entity. `type` is scoped to the `transaction` entity, and to reference this dimension you would use the fully qualified dimension name i.e `transaction__type`.

Expand Down Expand Up @@ -101,12 +141,28 @@ This section further explains the dimension definitions, along with examples. Di

Categorical dimensions are used to group metrics by different attributes, features, or characteristics such as product type. They can refer to existing columns in your dbt model or be calculated using a SQL expression with the `expr` parameter. An example of a categorical dimension is `is_bulk_transaction`, which is a group created by applying a case statement to the underlying column `quantity`. This allows users to group or filter the data based on bulk transactions.

<VersionBlock firstVersion="1.9">

```yaml
dimensions:
- name: is_bulk_transaction
type: categorical
expr: case when quantity > 10 then true else false end
config:
meta:
usage: "Filter to identify bulk transactions, like where quantity > 10."
```
</VersionBlock>

<VersionBlock lastVersion="1.8">

```yaml
dimensions:
- name: is_bulk_transaction
type: categorical
expr: case when quantity > 10 then true else false end
```
</VersionBlock>

## Time

Expand All @@ -130,12 +186,17 @@ You can set `is_partition` for time to define specific time spans. Additionally,

Use `is_partition: True` to show that a dimension exists over a specific time window. For example, a date-partitioned dimensional table. When you query metrics from different tables, the dbt Semantic Layer uses this parameter to ensure that the correct dimensional values are joined to measures.

<VersionBlock firstVersion="1.9">

```yaml
dimensions:
- name: created_at
type: time
label: "Date of creation"
expr: ts_created # ts_created is the underlying column name from the table
config:
meta:
notes: "Only valid for orders from 2022 onward"
is_partition: True
type_params:
time_granularity: day
Expand All @@ -156,6 +217,37 @@ measures:
expr: 1
agg: sum
```
</VersionBlock>

<VersionBlock lastVersion="1.8">

```yaml
dimensions:
- name: created_at
type: time
label: "Date of creation"
expr: ts_created # ts_created is the underlying column name from the table
is_partition: True
type_params:
time_granularity: day
- name: deleted_at
type: time
label: "Date of deletion"
expr: ts_deleted # ts_deleted is the underlying column name from the table
is_partition: True
type_params:
time_granularity: day
measures:
- name: users_deleted
expr: 1
agg: sum
agg_time_dimension: deleted_at
- name: users_created
expr: 1
agg: sum
```
</VersionBlock>

</TabItem>

Expand Down
69 changes: 63 additions & 6 deletions website/docs/docs/build/entities.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,17 +95,67 @@ Natural keys are columns or combinations of columns in a table that uniquely ide

The following is the complete spec for entities:

<VersionBlock firstVersion="1.9">

```yaml
semantic_models:
- name: semantic_model_name
..rest of the semantic model config
entities:
- name: entity_name ## Required
type: Primary, natural, foreign, or unique ## Required
description: A description of the field or role the entity takes in this table ## Optional
expr: The field that denotes that entity (transaction_id). ## Optional
Defaults to name if unspecified.
[config](/reference/resource-properties/config): Specify configurations for entity. ## Optional
[meta](/reference/resource-configs/meta): {<dictionary>} Set metadata for a resource and organize resources. Accepts plain text, spaces, and quotes. ## Optional
```
</VersionBlock>
<VersionBlock lastVersion="1.8">
```yaml
semantic_models:
- name: semantic_model_name
..rest of the semantic model config
entities:
- name: entity_name ## Required
type: Primary, or natural, or foreign, or unique ## Required
description: A description of the field or role the entity takes in this table ## Optional
expr: The field that denotes that entity (transaction_id). ## Optional
Defaults to name if unspecified.
```
</VersionBlock>
Here's an example of how to define entities in a semantic model:
<VersionBlock firstVersion="1.9">
```yaml
entities:
- name: transaction ## Required
type: Primary or natural or foreign or unique ## Required
- name: transaction
type: primary
expr: id_transaction
- name: order
type: foreign
expr: id_order
- name: user
type: foreign
expr: substring(id_order from 2)
entities:
- name: transaction
type:
description: A description of the field or role the entity takes in this table ## Optional
expr: The field that denotes that entity (transaction_id). ## Optional
expr: The field that denotes that entity (transaction_id).
Defaults to name if unspecified.
[config](/reference/resource-properties/config):
[meta](/reference/resource-configs/meta):
data_owner: "Finance team"
```
</VersionBlock>
<VersionBlock lastVersion="1.8">
Here's an example of how to define entities in a semantic model:
```yaml
entities:
- name: transaction
Expand All @@ -117,11 +167,18 @@ entities:
- name: user
type: foreign
expr: substring(id_order from 2)
entities:
- name: transaction
type:
description: A description of the field or role the entity takes in this table ## Optional
expr: The field that denotes that entity (transaction_id).
Defaults to name if unspecified.
```
</VersionBlock>
## Combine columns with a key
If a table doesn't have any key (like a primary key), use _surrogate combination_ to form a key that will help you identify a record by combining two columns. This applies to any [entity type](/docs/build/entities#entity-types). For example, you can combine `date_key` and `brand_code` from the `raw_brand_target_weekly` table to form a _surrogate key_. The following example creates a surrogate key by joining `date_key` and `brand_code` using a pipe (`|`) as a separator.
If a table doesn't have any key (like a primary key), use _surrogate combination_ to form a key that will help you identify a record by combining two columns. This applies to any [entity type](/docs/build/entities#entity-types). For example, you can combine `date_key` and `brand_code` from the `raw_brand_target_weekly` table to form a _surrogate key_. The following example creates a surrogate key by joining `date_key` and `brand_code` using a pipe (`|`) as a separator.

```yaml
Expand Down
1 change: 0 additions & 1 deletion website/docs/docs/build/environment-variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,6 @@ dbt Cloud has a number of pre-defined variables built in. Variables are set auto
The following environment variable is set automatically for the dbt Cloud IDE:

- `DBT_CLOUD_GIT_BRANCH` &mdash; Provides the development Git branch name in the [dbt Cloud IDE](/docs/cloud/dbt-cloud-ide/develop-in-the-cloud).
- Available in dbt v1.6 and later.
- The variable changes when the branch is changed.
- Doesn't require restarting the IDE after a branch change.
- Currently not available in the [dbt Cloud CLI](/docs/cloud/cloud-cli-installation).
Expand Down
2 changes: 1 addition & 1 deletion website/docs/docs/build/incremental-microbatch.md
Original file line number Diff line number Diff line change
Expand Up @@ -232,7 +232,7 @@ from {{ source('sales', 'transactions') }}

### Full refresh

As a best practice, we recommend configuring `full_refresh: False` on microbatch models so that they ignore invocations with the `--full-refresh` flag. If you need to reprocess historical data, do so with a targeted backfill that specifies explicit start and end dates.
As a best practice, we recommend [configuring `full_refresh: false`](/reference/resource-configs/full_refresh) on microbatch models so that they ignore invocations with the `--full-refresh` flag. If you need to reprocess historical data, do so with a targeted backfill that specifies explicit start and end dates.

## Usage

Expand Down
82 changes: 70 additions & 12 deletions website/docs/docs/build/join-logic.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,24 +10,28 @@ Joins are a powerful part of MetricFlow and simplify the process of making all v

Joins use `entities` defined in your semantic model configs as the join keys between tables. Assuming entities are defined in the semantic model, MetricFlow creates a graph using the semantic models as nodes and the join paths as edges to perform joins automatically. MetricFlow chooses the appropriate join type and avoids fan-out or chasm joins with other tables based on the entity types.

<details>
<summary>What are fan-out or chasm joins?</summary>
<div>
<div>&mdash; Fan-out joins are when one row in a table is joined to multiple rows in another table, resulting in more output rows than input rows.<br /><br />
&mdash; Chasm joins are when two tables have a many-to-many relationship through an intermediate table, and the join results in duplicate or missing data. </div>
</div>
</details>

<Expandable alt_header="What are fan-out or chasm joins?" >
- Fan-out joins are when one row in a table is joined to multiple rows in another table, resulting in more output rows than input rows.
- Chasm joins are when two tables have a many-to-many relationship through an intermediate table, and the join results in duplicate or missing data.
</Expandable>

## Types of joins

:::tip Joins are auto-generated
MetricFlow automatically generates the necessary joins to the defined semantic objects, eliminating the need for you to create new semantic models or configuration files.

This document explains the different types of joins that can be used with entities and how to query them using the CLI.
This section explains the different types of joins that can be used with entities and how to query them.
:::

MetricFlow primarily uses left joins for joins, and restricts the use of fan-out and chasm joins. Refer to the table below to identify which joins are or aren't allowed based on specific entity types to prevent the creation of risky joins.
Metricflow uses these specific join strategies:

- Primarily uses left joins when joining `fct` and `dim` models. Left joins make sure all rows from the "base" table are retained, while matching rows are included from the joined table.
- For queries that involve multiple `fct` models, MetricFlow uses full outer joins to ensure all data points are captured, even when some `dim` or `fct` models are missing in certain tables.
- MetricFlow restricts the use of fan-out and chasm joins.

Refer to [SQL examples](#sql-examples) for more information on how MetricFlow handles joins in practice.

The following table identifies which joins are allowed based on specific entity types to prevent the creation of risky joins. This table primarily represents left joins unless otherwise specified. For scenarios involving multiple `fct` models, MetricFlow uses full outer joins.

| entity type - Table A | entity type - Table B | Join type |
|---------------------------|---------------------------|----------------------|
Expand All @@ -39,9 +43,19 @@ MetricFlow primarily uses left joins for joins, and restricts the use of fan-out
| Unique | Foreign | ❌ Fan-out (Not allowed) |
| Foreign | Primary | ✅ Left |
| Foreign | Unique | ✅ Left |
| Foreign | Foreign | ❌ Fan-out (Not allowed) |
| Foreign | Foreign | ❌ Fan-out (Not allowed) |

### Semantic validation

### Example
MetricFlow performs semantic validation by executing `explain` queries in the data platform to ensure that the generated SQL gets executed without errors. This validation includes:

- Verifying that all referenced tables and columns exist.
- Ensuring the data platform supports SQL functions, such as `date_diff(x, y)`.
- Checking for ambiguous joins or paths in multi-hop joins.

If validation fails, MetricFlow surfaces errors for users to address before executing the query.

## Example

The following example uses two semantic models with a common entity and shows a MetricFlow query that requires a join between the two semantic models. The two semantic models are:
- `transactions`
Expand Down Expand Up @@ -83,6 +97,50 @@ dbt sl query --metrics average_purchase_price --group-by metric_time,user_id__ty
mf query --metrics average_purchase_price --group-by metric_time,user_id__type # In dbt Core
```

#### SQL examples

These SQL examples show how MetricFlow handles both left join and full outer join scenarios in practice:

<Tabs>
<TabItem value="SQL example for left join">

Using the previous example for `transactions` and `user_signup` semantic models, this shows a left join between those two semantic models.

```sql
select
transactions.user_id,
transactions.purchase_price,
user_signup.type
from transactions
left outer join user_signup
on transactions.user_id = user_signup.user_id
where transactions.purchase_price is not null
group by
transactions.user_id,
user_signup.type;
```
</TabItem>

<TabItem value="SQL example for outer joins">

If you have multiple `fct` models, let's say `sales` and `returns`, MetricFlow uses full outer joins to ensure all data points are captured.

This example shows a full outer join between the `sales` and `returns` semantic models.

```sql
select
sales.user_id,
sales.total_sales,
returns.total_returns
from sales
full outer join returns
on sales.user_id = returns.user_id
where sales.user_id is not null or returns.user_id is not null;
```

</TabItem>
</Tabs>

## Multi-hop joins

MetricFlow allows users to join measures and dimensions across a graph of entities by moving from one table to another within a graph. This is referred to as "multi-hop join".
Expand Down
Loading

0 comments on commit ea46b86

Please sign in to comment.