Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Microbatch" incremental strategy #6194

Merged
merged 32 commits into from
Oct 3, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
67cdc83
Initialize microbatch docs
jtcohen6 Oct 1, 2024
446e6e4
Merge branch 'current' into jerco/microbatch-docs
mirnawong1 Oct 1, 2024
a9018b2
Update website/docs/docs/build/incremental-strategy.md
mirnawong1 Oct 1, 2024
80dbb03
Update website/docs/docs/build/incremental-microbatch.md
mirnawong1 Oct 1, 2024
6e306c3
Update website/docs/docs/build/incremental-microbatch.md
mirnawong1 Oct 1, 2024
9de0ddd
Update website/docs/docs/build/incremental-microbatch.md
mirnawong1 Oct 1, 2024
f92cd11
Merge branch 'current' into jerco/microbatch-docs
mirnawong1 Oct 1, 2024
239f5bf
fold some of grace's feedback for jerco
mirnawong1 Oct 1, 2024
651106a
Merge branch 'jerco/microbatch-docs' of github.com:dbt-labs/docs.getd…
mirnawong1 Oct 1, 2024
53c65ac
Update release-notes.md
mirnawong1 Oct 1, 2024
96dbe1a
Update website/docs/docs/dbt-versions/release-notes.md
mirnawong1 Oct 1, 2024
a98b449
Update website/docs/docs/build/incremental-microbatch.md
mirnawong1 Oct 1, 2024
b463f39
Update release-notes.md
mirnawong1 Oct 1, 2024
ff05b8c
Merge branch 'current' into jerco/microbatch-docs
mirnawong1 Oct 2, 2024
4d814fa
upload imgs white background
mirnawong1 Oct 2, 2024
440b4e0
PR feedback
jtcohen6 Oct 2, 2024
6cc4bcb
Merge branch 'current' into jerco/microbatch-docs
jtcohen6 Oct 2, 2024
a3bfafb
Self review
jtcohen6 Oct 2, 2024
5bb463d
Redshift not yet
jtcohen6 Oct 2, 2024
e0b1efb
Merge remote-tracking branch 'origin/jerco/microbatch-docs' into jerc…
jtcohen6 Oct 2, 2024
b94c74f
Update website/docs/docs/dbt-versions/release-notes.md
jtcohen6 Oct 3, 2024
371d92d
Merge branch 'current' into jerco/microbatch-docs
mirnawong1 Oct 3, 2024
03e6768
update table and tweaks
mirnawong1 Oct 3, 2024
e959430
Merge branch 'current' into jerco/microbatch-docs
mirnawong1 Oct 3, 2024
0070297
add microbatch
mirnawong1 Oct 3, 2024
68b5734
Merge branch 'current' into jerco/microbatch-docs
runleonarun Oct 3, 2024
cefaba3
Update dbt-versions.js
runleonarun Oct 3, 2024
e2373e4
Merge branch 'current' into jerco/microbatch-docs
runleonarun Oct 3, 2024
2151ecf
Merge branch 'current' into jerco/microbatch-docs
dbeatty10 Oct 3, 2024
9bfde1c
Restore deleted release note for inferring `primary_key`
dbeatty10 Oct 3, 2024
5ed2149
Merge branch 'current' into jerco/microbatch-docs
mirnawong1 Oct 3, 2024
e1694ab
Merge branch 'current' into jerco/microbatch-docs
mirnawong1 Oct 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions website/dbt-versions.js
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,10 @@ exports.versions = [
* @property {string} lastVersion The last version the page is visible in the sidebar
*/
exports.versionedPages = [
{
page: "docs/build/incremental-microbatch",
lastVersion: "1.9",
dbeatty10 marked this conversation as resolved.
Show resolved Hide resolved
},
{
page: "reference/resource-configs/target_database",
lastVersion: "1.8",
Expand Down
145 changes: 145 additions & 0 deletions website/docs/docs/build/incremental-microbatch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
---
title: "About microbatch incremental models"
description: "Learn about the 'microbatch' strategy for incremental models."
id: "incremental-microbatch"
---

mirnawong1 marked this conversation as resolved.
Show resolved Hide resolved
:::info Microbatch <Lifecycle status="beta" />
mirnawong1 marked this conversation as resolved.
Show resolved Hide resolved

The `microbatch` strategy is available in [dbt Cloud Versionless](/docs/dbt-versions/upgrade-dbt-version-in-cloud#versionless) and dbt Core v1.9.

mirnawong1 marked this conversation as resolved.
Show resolved Hide resolved
Read and participate in the discussion: [dbt-core#10672](https://github.com/dbt-labs/dbt-core/discussions/10672)

:::

## What is microbatch?

Incremental models in dbt are a [materialization](https://docs.getdbt.com/docs/build/materializations) designed to efficiently update your data warehouse tables by only transforming and loading _new or changed data_ since the last run. Instead of reprocessing an entire dataset every time, incremental models append, update, or replace rows in the existing table with the new data just processed. This can significantly reduce the time and resources required for your data transformations.
mirnawong1 marked this conversation as resolved.
Show resolved Hide resolved

Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` column you configure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` column you configure.
Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` column you configure. Instead of processing all of your data at once. Since each "batch" is based on a time period, like a single day, it makes it much faster and more efficient to update large datasets, especially when you're working with data that changes over time (like new records being added daily).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion! I will add more here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` column you configure.
Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` columns and `batch_size` you configure.


Where other incremental strategies operate only on "old" and "new" data, microbatch models treat each "batch" of data as a unit that can be built or replaced on its own. Each batch is independent and <Term id="idempotent" />. This is a powerful abstraction that makes it possible for dbt to run batches separately — in the future, concurrently — and to retry them independently.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we have an example in the "## How does microbatch compare to other incremental strategies?" section - but maybe we should have one here as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good all! adding an example


### Available configs

- `event_time` - The column indicating "at what time did the row occur" (for both your microbatch model and its direct parents)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a callout that event_time and begin need to be in UTC? (also true of the CLI args --event-time-start/end

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe having the "timezone"s section is better!


<Lightbox src="/img/docs/building-a-dbt-project/microbatch/event_time.png" title="The event_time column configures the real-world time of this record"/>

- `batch_size` (string, optional) - The granularity of your batches. The default is `day`, and currently that is the only granularity supported.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on making this a table instead of a bulleted list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding table

- `lookback` (integer, optional) - Process X batches prior to the latest bookmark, in order to capture late-arriving records. The default value is `0`.
- `begin` (date, optional) - The "beginning of time" for your data. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches. (It's a leap year!)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is optional


As a best practice, we recommend configuring `full_refresh: False` on microbatch models so that they ignore invocations with the `--full-refresh` flag. If you need to reprocess historical data, do so with a targeted backfill.

### Usage

**You write your model query to process (read and return) one day of data**. You don’t need to think about `is_incremental` filtering or DML (upserting/merging/replacing) - we take care of that for you.

dbt will then evaluate which batches need to be loaded, break them up into a SQL query per batch, and load each one independently.

dbt will automatically filter upstream inputs (`source` or `ref`) that define `event_time`, based on the `lookback` and `batch_size` configs for this model.

During standard incremental runs, dbt will process new batches and any according to the configured `lookback` (with one query per batch)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
During standard incremental runs, dbt will process new batches and any according to the configured `lookback` (with one query per batch)
During standard incremental runs, dbt will process new batches and any according to the current timestamp and configured `lookback` (with one query per batch).


<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_lookback.png" title="Configure a lookback to reprocess additional batches during standard incremental runs"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix img


If there’s an upstream model that configures `event_time`, but you *don’t* want the reference to it to be filtered, you can specify `ref('upstream_model').render()` to opt-out of auto-filtering.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on nesting this under the sentence "dbt will automatically filter upstream inputs (source or ref) that define event_time, based on the lookback and batch_size configs for this model."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried both above / below. I'd rather stick this below the image, because it feels like an exception rather than the rule — but I could be wrong! If we expect people to need to do this very often, that's one more vote in favor of a more intuitive (less-ugly) syntax


dbt will evaluate which batches need to be loaded **by processing the current batch (current_timestamp) + any batches in your configured lookback**, break them up into a SQL query per batch, and load them all independently.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant


### Backfills

Whether to fix erroneous source data, or retroactively apply a change in business logic, you may need to reprocess a large amount of historical data.

Backfilling a microbatch model is as simple as selecting it to run or build, and specifying a "start" and "end" for `event_time`. As always, dbt will process the batches between the start and end as independent queries.

```bash
dbt run --event-time-start "2024-09-01" --event-time-end "2024-09-04"
```

<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_backfill.png" title="Configure a lookback to reprocess additional batches during standard incremental runs"/>

### Retry

If one or more of your batches fail, you can use `dbt retry` to reprocess _only_ the failed batches.

![Partial retry](https://github.com/user-attachments/assets/f94c4797-dcc7-4875-9623-639f70c97b8f)

### Timezones

For now, dbt assumes that all values supplied are in UTC:

- `event_time`
- `begin`
- `--event-time-start` and `--event-time-end`

While we may consider adding support for custom timezones in the future, we also believe that defining these values in UTC makes everyone's lives easier.

## How does `microbatch` compare to other incremental strategies?

Most incremental models rely on the end user (you) to explicitly tell dbt what "new" means, in the context of each model, by writing a filter in an `is_incremental()` block. You are responsibly for crafting this SQL in a way that queries `this` to check when the most recent record was last loaded, with an optional look-back window for late-arriving records. Other incremental strategies will control _how_ the data is being added into the table — whether append-only `insert`, `delete` + `insert`, `merge`, `insert overwrite`, etc — but they all have this in common.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Other incremental strategies will control how the data is being added into the table"

maybe add a callout of how microbatch choose which of these "how"s is best?


As an example:

```sql
{{
config(
materialized='incremental',
incremental_strategy='delete+insert',
unique_key='date_day'
)
}}

select * from {{ ref('stg_events') }}

{% if is_incremental() %}
-- this filter will only be applied on an incremental run
-- add a lookback window of 3 days to account for late-arriving records
where date_day >= (select {{ dbt.dateadd("day", -3, "max(date_day)") }} from {{ this }})
{% endif %}

```

For this incremental model:

- "New" records are those with a `date_day` greater than the maximum `date_day` that has previously been loaded
- The lookback window is 3 days
- When there are new records for a given `date_day`, the existing data for `date_day` is deleted and the new data is inserted

Let’s take our same example from before, and instead use the new `microbatch` incremental strategy:

```sql
{{
config(
materialized='incremental',
incremental_strategy='microbatch',
event_time='event_occured_at',
batch_size='day',
lookback=3,
begin='2020-01-01',
full_refresh=false
)
}}

select * from {{ ref('stg_events') }} # this ref will be auto-filtered
```

Where you’ve also set an `event_time` for the model’s direct parents - in this case `stg_events`:

```yaml
models:
- name: stg_events
config:
event_time: my_time_field
```

And that’s it! When you run the model, each batch templates a separate query. The batch for `2024-10-01` would template:

```sql
select * from (
select * from {{ ref('stg_events') }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we show the compiled SQL for the ref statement here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes... obviously

where my_time_field >= '2024-10-01 00:00:00'
and my_time_field < '2024-10-02 00:00:00'
) # this ref will be auto-filtered
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) # this ref will be auto-filtered
)

```
1 change: 1 addition & 0 deletions website/docs/docs/build/incremental-models-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,4 +42,5 @@ Transaction management, a process used in certain data platforms, ensures that a
## Related docs
- [Incremental models](/docs/build/incremental-models) to learn how to configure incremental models in dbt.
- [Incremental strategies](/docs/build/incremental-strategy) to understand how dbt implements incremental models on different databases.
- [Microbatch](/docs/build/incremental-strategy) <Lifecycle status="beta" /> to understand a new incremental strategy intended for efficient and resilient processing of very large time-series datasets.
- [Materializations best practices](/best-practices/materializations/1-guide-overview) to learn about the best practices for using materializations in dbt.
37 changes: 19 additions & 18 deletions website/docs/docs/build/incremental-strategy.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,32 +10,33 @@ There are various strategies to implement the concept of incremental materializa
* The reliability of your `unique_key`.
* The support of certain features in your data platform.

An optional `incremental_strategy` config is provided in some adapters that controls the code that dbt uses
to build incremental models.
An optional `incremental_strategy` config is provided in some adapters that controls the code that dbt uses to build incremental models.

### Supported incremental strategies by adapter
dbeatty10 marked this conversation as resolved.
Show resolved Hide resolved
:::callout Microbatch <Lifecycle status="beta" />
mirnawong1 marked this conversation as resolved.
Show resolved Hide resolved

Click the name of the adapter in the below table for more information about supported incremental strategies.
The [`microbatch` incremental strategy](/docs/build/incremental-microbatch) is intended for large time-series datasets. dbt will process the incremental model in multiple queries (or "batches") based on a configured `event_time` column. Depending on the volume and nature of your data, this can be more efficient and resilient than using a single query for adding new data.

The `merge` strategy is available in dbt-postgres and dbt-redshift beginning in dbt v1.6.
:::

| data platform adapter | `append` | `merge` | `delete+insert` | `insert_overwrite` |
|-----------------------------------------------------------------------------------------------------|:--------:|:-------:|:---------------:|:------------------:|
| [dbt-postgres](/reference/resource-configs/postgres-configs#incremental-materialization-strategies) | ✅ | ✅ | ✅ | |
| [dbt-redshift](/reference/resource-configs/redshift-configs#incremental-materialization-strategies) | ✅ | ✅ | ✅ | |
| [dbt-bigquery](/reference/resource-configs/bigquery-configs#merge-behavior-incremental-models) | | ✅ | | ✅ |
| [dbt-spark](/reference/resource-configs/spark-configs#incremental-models) | ✅ | ✅ | | ✅ |
| [dbt-databricks](/reference/resource-configs/databricks-configs#incremental-models) | ✅ | ✅ | | ✅ |
| [dbt-snowflake](/reference/resource-configs/snowflake-configs#merge-behavior-incremental-models) | ✅ | ✅ | ✅ | |
| [dbt-trino](/reference/resource-configs/trino-configs#incremental) | ✅ | ✅ | ✅ | |
| [dbt-fabric](/reference/resource-configs/fabric-configs#incremental) | ✅ | | ✅ | |
### Supported incremental strategies by adapter

This table represents the availability of each incremental strategy

:::note Snowflake Configurations
Click the name of the adapter in the below table for more information about supported incremental strategies.

dbt has changed the default materialization for incremental table merges from `temporary table` to `view`. For more information about this change and instructions for setting the configuration to a temp table, please read about [Snowflake temporary tables](/reference/resource-configs/snowflake-configs#temporary-tables).
The `merge` strategy is available in dbt-postgres and dbt-redshift beginning in dbt v1.6.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Or can we add it to the table somehow?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just going to cut it because older versions are EOL anyway


:::
| data platform adapter | `append` | `merge` | `delete+insert` | `insert_overwrite` | `microbatch` |
|-----------------------------------------------------------------------------------------------------|:--------:|:-------:|:---------------:|:------------------:|: ------------:|
| [dbt-postgres](/reference/resource-configs/postgres-configs#incremental-materialization-strategies) | ✅ | ✅ | ✅ | | ✅ |
| [dbt-redshift](/reference/resource-configs/redshift-configs#incremental-materialization-strategies) | ✅ | ✅ | ✅ | | ✅ |
| [dbt-bigquery](/reference/resource-configs/bigquery-configs#merge-behavior-incremental-models) | | ✅ | | ✅ | ✅ |
| [dbt-spark](/reference/resource-configs/spark-configs#incremental-models) | ✅ | ✅ | | ✅ | ✅ |
| [dbt-databricks](/reference/resource-configs/databricks-configs#incremental-models) | ✅ | ✅ | | ✅ | |
| [dbt-snowflake](/reference/resource-configs/snowflake-configs#merge-behavior-incremental-models) | ✅ | ✅ | ✅ | | ✅ |
| [dbt-trino](/reference/resource-configs/trino-configs#incremental) | ✅ | ✅ | ✅ | | |
| [dbt-fabric](/reference/resource-configs/fabric-configs#incremental) | ✅ | | ✅ | | |
| [dbt-athena](/reference/resource-configs/athena-configs#incremental-models) | ✅ | ✅ | | ✅ | |

### Configuring incremental strategy

Expand Down
1 change: 1 addition & 0 deletions website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -421,6 +421,7 @@ const sidebarSettings = {
"docs/build/incremental-models-overview",
"docs/build/incremental-models",
"docs/build/incremental-strategy",
"docs/build/incremental-microbatch",
],
},
],
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading