"Microbatch" incremental strategy #6194

jtcohen6 · 2024-10-01T15:54:31Z

Resolves #6136

What are you changing in this pull request and why?

Add docs on microbatch incremental strategy (v1.9+)
Add microbatch to table of incremental strategies

Checklist

I have reviewed the Content style guide so my content adheres to these guidelines.
The topic I'm writing about is for specific dbt version(s) and I have versioned it according to the version a whole page and/or version a block of content guidelines.
I have added checklist item(s) to this list for anything anything that needs to happen before this PR is merged, such as "needs technical review" or "change base branch."
Add/remove page in website/sidebars.js
Provide a unique filename for new pages

vercel · 2024-10-01T15:54:37Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated (UTC)
docs-getdbt-com	✅ Ready (Inspect)	Visit Preview	Oct 3, 2024 7:45pm

website/dbt-versions.js

website/docs/docs/build/incremental-strategy.md

website/docs/docs/build/incremental-microbatch.md

mirnawong1 · 2024-10-01T17:18:06Z

website/docs/docs/build/incremental-microbatch.md

+
+Incremental models in dbt are a [materialization](/docs/build/materializations) designed to efficiently update your data warehouse tables by only transforming and loading _new or changed data_ since the last run. Instead of reprocessing an entire dataset every time, incremental models append, update, or replace rows in the existing table with the new data just processed. This can significantly reduce the time and resources required for your data transformations.
+
+Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` column you configure.


Suggested change

Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` column you configure.

Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` column you configure. Instead of processing all of your data at once. Since each "batch" is based on a time period, like a single day, it makes it much faster and more efficient to update large datasets, especially when you're working with data that changes over time (like new records being added daily).

thanks for the suggestion! I will add more here

mirnawong1 · 2024-10-01T18:25:32Z

website/docs/docs/build/incremental-microbatch.md

+
+During standard incremental runs, dbt will process new batches and any according to the configured `lookback` (with one query per batch)
+
+<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_lookback.png" title="Configure a lookback to reprocess additional batches during standard incremental runs"/>


will fix img

graciegoheen

Thank you so much - these are looking really solid! Just a few suggestions

website/docs/docs/build/incremental-microbatch.md

graciegoheen · 2024-10-01T18:46:53Z

website/docs/docs/build/incremental-microbatch.md

+
+Incremental models in dbt are a [materialization](/docs/build/materializations) designed to efficiently update your data warehouse tables by only transforming and loading _new or changed data_ since the last run. Instead of reprocessing an entire dataset every time, incremental models append, update, or replace rows in the existing table with the new data just processed. This can significantly reduce the time and resources required for your data transformations.
+
+Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` column you configure.


Suggested change

Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` column you configure.

Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` columns and `batch_size` you configure.

graciegoheen · 2024-10-01T18:47:42Z

website/docs/docs/build/incremental-microbatch.md

+
+<Lightbox src="/img/docs/building-a-dbt-project/microbatch/event_time.png" title="The event_time column configures the real-world time of this record"/>
+
+- `batch_size` (string, optional) - The granularity of your batches. The default is `day`, and currently that is the only granularity supported.


Thoughts on making this a table instead of a bulleted list?

adding table

graciegoheen · 2024-10-01T18:48:07Z

website/docs/docs/build/incremental-microbatch.md

+
+- `batch_size` (string, optional) - The granularity of your batches. The default is `day`, and currently that is the only granularity supported.
+- `lookback` (integer, optional) - Process X batches prior to the latest bookmark, in order to capture late-arriving records. The default value is `0`.
+- `begin` (date, optional) - The "beginning of time" for your data. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches. (It's a leap year!)


I don't think this is optional

graciegoheen · 2024-10-01T18:49:12Z

website/docs/docs/build/incremental-microbatch.md

+
+<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_lookback.png" title="Configure a lookback to reprocess additional batches during standard incremental runs"/>
+
+If there’s an upstream model that configures `event_time`, but you *don’t* want the reference to it to be filtered, you can specify `ref('upstream_model').render()` to opt-out of auto-filtering.


Thoughts on nesting this under the sentence "dbt will automatically filter upstream inputs (source or ref) that define event_time, based on the lookback and batch_size configs for this model."

I tried both above / below. I'd rather stick this below the image, because it feels like an exception rather than the rule — but I could be wrong! If we expect people to need to do this very often, that's one more vote in favor of a more intuitive (less-ugly) syntax

graciegoheen · 2024-10-01T18:50:27Z

website/docs/docs/build/incremental-microbatch.md

+
+### Available configs
+
+- `event_time` - The column indicating "at what time did the row occur" (for both your microbatch model and its direct parents)


maybe a callout that event_time and begin need to be in UTC? (also true of the CLI args --event-time-start/end

or maybe having the "timezone"s section is better!

graciegoheen · 2024-10-01T18:52:44Z

website/docs/docs/build/incremental-microbatch.md

+    select * from {{ ref('stg_events') }}
+    where my_time_field >= '2024-10-01 00:00:00'
+      and my_time_field < '2024-10-02 00:00:00'
+) # this ref will be auto-filtered


Suggested change

) # this ref will be auto-filtered

)

website/docs/docs/build/incremental-strategy.md

graciegoheen · 2024-10-01T18:56:25Z

website/docs/docs/build/incremental-microbatch.md

+
+Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` column you configure.
+
+Where other incremental strategies operate only on "old" and "new" data, microbatch models treat each "batch" of data as a unit that can be built or replaced on its own. Each batch is independent and <Term id="idempotent" />. This is a powerful abstraction that makes it possible for dbt to run batches separately — in the future, concurrently — and to retry them independently.


I know we have an example in the "## How does microbatch compare to other incremental strategies?" section - but maybe we should have one here as well?

good all! adding an example

graciegoheen · 2024-10-01T18:56:46Z

website/docs/docs/build/incremental-strategy.md


-dbt has changed the default materialization for incremental table merges from `temporary table` to `view`. For more information about this change and instructions for setting the configuration to a temp table, please read about [Snowflake temporary tables](/reference/resource-configs/snowflake-configs#temporary-tables).
+The `merge` strategy is available in dbt-postgres and dbt-redshift beginning in dbt v1.6.


Do we need this? Or can we add it to the table somehow?

I'm just going to cut it because older versions are EOL anyway

graciegoheen · 2024-10-01T18:58:35Z

website/docs/docs/build/incremental-microbatch.md

+
+## How does `microbatch` compare to other incremental strategies?
+
+Most incremental models rely on the end user (you) to explicitly tell dbt what "new" means, in the context of each model, by writing a filter in an `is_incremental()` block. You are responsibly for crafting this SQL in a way that queries `this` to check when the most recent record was last loaded, with an optional look-back window for late-arriving records. Other incremental strategies will control _how_ the data is being added into the table — whether append-only `insert`, `delete` + `insert`, `merge`, `insert overwrite`, etc — but they all have this in common.


"Other incremental strategies will control how the data is being added into the table"

maybe add a callout of how microbatch choose which of these "how"s is best?

graciegoheen · 2024-10-01T19:02:57Z

Currently microbatch is supported on:

postgres
snowflake
bigquery
spark

with more adapters to come!

graciegoheen · 2024-10-01T19:04:20Z

Currently, as microbatch is still in "beta", this functionality is still gated behind an env var (will be swapped to behavior change flag ahead of the final 1.9 release) - so you need to set DBT_EXPERIMENTAL_MICROBATCH to true in your project

website/docs/docs/build/incremental-microbatch.md

…bt.com into jerco/microbatch-docs

…o/microbatch-docs

website/docs/docs/dbt-versions/release-notes.md

Co-authored-by: Grace Goheen <[email protected]>

mirnawong1 · 2024-10-03T12:52:31Z

website/docs/docs/build/incremental-microbatch.md

+| Config   | Type | Description   | Default |
+|----------|------|---------------|---------|
+| `event_time` | Column             | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered.          | N/A     |
+| `begin`      | Date    | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today."        | N/A     |


assuming begin is required , right?

mirnawong1

this looks great, i'm so excited by this feature! i've fixed up the table as it was rendering weird and made some other small tweaks.

i have two questions:

is the begin config optional or required?
should we add the microbatch strategy on this 'built in strategies' section or is it not necessary bc it's still beta?

dbeatty10 · 2024-10-03T19:17:29Z

website/docs/docs/dbt-versions/release-notes.md

@@ -28,7 +29,7 @@ Release notes are grouped by month for both multi-tenant and virtual private clo
 - **Enhancement**: In dbt Cloud Versionless, snapshots defined in SQL files can now use `config` defined in `schema.yml` YAML files. This update resolves the previous limitation that required snapshot properties to be defined exclusively in `dbt_project.yml` and/or a `config()` block within the SQL file. This will also be released in dbt Core 1.9.
 - **Enhancement**: In dbt Cloud versionless, dbt infers a model's `primary_key` based on configured data tests and/or constraints within `manifest.json`. The inferred `primary_key` is visible in dbt Explorer and utilized by the dbt Cloud [compare changes](/docs/deploy/run-visibility#compare-tab) feature. This will also be released in dbt Core 1.9.
 - **New**: In dbt Cloud Versionless, the `snapshot_meta_column_names` config allows for customizing the snapshot metadata columns. This feature allows an organization to align these automatically-generated column names with their conventions, and will be included in the upcoming dbt Core 1.9 release.
- **Enhancement**: dbt Cloud versionless began inferring a model's `primary_key` based on configured data tests and/or constraints within `manifest.json`. The inferred `primary_key` is visible in dbt Explorer and utilized by the dbt Cloud [compare changes](/docs/deploy/run-visibility#compare-tab) feature. This will also be released in dbt Core 1.9.
+


Fixing this in a separate PR (#6236) because I think we actually want to remove the other one. This is because this line is followed by content that will be orphaned otherwise:

Read about the order dbt infers columns can be used as primary key of a model.

Initialize microbatch docs

67cdc83

jtcohen6 requested review from graciegoheen and mirnawong1 October 1, 2024 15:54

github-actions bot added content Improvements or additions to content size: medium This change will take up to a week to address labels Oct 1, 2024

Merge branch 'current' into jerco/microbatch-docs

446e6e4

vercel bot deployed to Preview October 1, 2024 16:16 View deployment

mirnawong1 reviewed Oct 1, 2024

View reviewed changes

website/dbt-versions.js Outdated Show resolved Hide resolved

mirnawong1 reviewed Oct 1, 2024

View reviewed changes

website/docs/docs/build/incremental-strategy.md Outdated Show resolved Hide resolved

Update website/docs/docs/build/incremental-strategy.md

a9018b2

vercel bot deployed to Preview October 1, 2024 17:02 View deployment

mirnawong1 reviewed Oct 1, 2024

View reviewed changes

website/docs/docs/build/incremental-microbatch.md Show resolved Hide resolved

Update website/docs/docs/build/incremental-microbatch.md

80dbb03

mirnawong1 reviewed Oct 1, 2024

View reviewed changes

website/docs/docs/build/incremental-microbatch.md Outdated Show resolved Hide resolved

Update website/docs/docs/build/incremental-microbatch.md

6e306c3

mirnawong1 reviewed Oct 1, 2024

View reviewed changes

website/docs/docs/build/incremental-microbatch.md Outdated Show resolved Hide resolved

Update website/docs/docs/build/incremental-microbatch.md

9de0ddd

vercel bot deployed to Preview October 1, 2024 17:12 View deployment

mirnawong1 reviewed Oct 1, 2024

View reviewed changes

graciegoheen reviewed Oct 1, 2024

View reviewed changes

Merge branch 'current' into jerco/microbatch-docs

f92cd11

vercel bot deployed to Preview October 1, 2024 19:36 View deployment

mirnawong1 reviewed Oct 1, 2024

View reviewed changes

website/docs/docs/build/incremental-microbatch.md Show resolved Hide resolved

mirnawong1 added 2 commits October 1, 2024 20:40

fold some of grace's feedback for jerco

239f5bf

Merge branch 'jerco/microbatch-docs' of github.com:dbt-labs/docs.getd…

651106a

…bt.com into jerco/microbatch-docs

Merge remote-tracking branch 'origin/jerco/microbatch-docs' into jerc…

e0b1efb

…o/microbatch-docs

vercel bot deployed to Preview October 2, 2024 22:38 View deployment

graciegoheen reviewed Oct 2, 2024

View reviewed changes

website/docs/docs/dbt-versions/release-notes.md Outdated Show resolved Hide resolved

jtcohen6 mentioned this pull request Oct 2, 2024

[Core] 1.9 upgrade guide #6184

Merged

3 tasks

Update website/docs/docs/dbt-versions/release-notes.md

b94c74f

Co-authored-by: Grace Goheen <[email protected]>

vercel bot deployed to Preview October 3, 2024 00:21 View deployment

Merge branch 'current' into jerco/microbatch-docs

371d92d

vercel bot deployed to Preview October 3, 2024 09:10 View deployment

mirnawong1 reviewed Oct 3, 2024

View reviewed changes

mirnawong1 and others added 2 commits October 3, 2024 14:04

update table and tweaks

03e6768

Merge branch 'current' into jerco/microbatch-docs

e959430

mirnawong1 approved these changes Oct 3, 2024

View reviewed changes

vercel bot deployed to Preview October 3, 2024 13:12 View deployment

add microbatch

0070297

vercel bot deployed to Preview October 3, 2024 16:19 View deployment

Merge branch 'current' into jerco/microbatch-docs

68b5734

vercel bot deployed to Preview October 3, 2024 18:16 View deployment

runleonarun added 2 commits October 3, 2024 11:16

Update dbt-versions.js

cefaba3

Merge branch 'current' into jerco/microbatch-docs

e2373e4

vercel bot deployed to Preview October 3, 2024 18:27 View deployment

dbeatty10 reviewed Oct 3, 2024

View reviewed changes

dbeatty10 and others added 3 commits October 3, 2024 13:20

Merge branch 'current' into jerco/microbatch-docs

2151ecf

Restore deleted release note for inferring primary_key

9bfde1c

Merge branch 'current' into jerco/microbatch-docs

5ed2149

vercel bot deployed to Preview October 3, 2024 19:31 View deployment

Merge branch 'current' into jerco/microbatch-docs

e1694ab

vercel bot deployed to Preview October 3, 2024 19:45 View deployment

dbeatty10 approved these changes Oct 3, 2024

View reviewed changes

dbeatty10 merged commit 671be01 into current Oct 3, 2024
6 checks passed

dbeatty10 deleted the jerco/microbatch-docs branch October 3, 2024 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Microbatch" incremental strategy #6194

"Microbatch" incremental strategy #6194

jtcohen6 commented Oct 1, 2024

vercel bot commented Oct 1, 2024 •

edited

Loading

mirnawong1 Oct 1, 2024

jtcohen6 Oct 2, 2024

mirnawong1 Oct 1, 2024

graciegoheen left a comment

graciegoheen Oct 1, 2024

graciegoheen Oct 1, 2024

mirnawong1 Oct 1, 2024

graciegoheen Oct 1, 2024

graciegoheen Oct 1, 2024

jtcohen6 Oct 2, 2024

graciegoheen Oct 1, 2024

graciegoheen Oct 1, 2024

graciegoheen Oct 1, 2024

graciegoheen Oct 1, 2024

jtcohen6 Oct 2, 2024

graciegoheen Oct 1, 2024

jtcohen6 Oct 2, 2024

graciegoheen Oct 1, 2024

graciegoheen commented Oct 1, 2024 •

edited

Loading

graciegoheen commented Oct 1, 2024

mirnawong1 Oct 3, 2024 •

edited

Loading

mirnawong1 left a comment

dbeatty10 Oct 3, 2024


		Incremental models in dbt are a [materialization](/docs/build/materializations) designed to efficiently update your data warehouse tables by only transforming and loading _new or changed data_ since the last run. Instead of reprocessing an entire dataset every time, incremental models append, update, or replace rows in the existing table with the new data just processed. This can significantly reduce the time and resources required for your data transformations.

		Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` column you configure.


		During standard incremental runs, dbt will process new batches and any according to the configured `lookback` (with one query per batch)

		<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_lookback.png" title="Configure a lookback to reprocess additional batches during standard incremental runs"/>


		<Lightbox src="/img/docs/building-a-dbt-project/microbatch/event_time.png" title="The event_time column configures the real-world time of this record"/>

		- `batch_size` (string, optional) - The granularity of your batches. The default is `day`, and currently that is the only granularity supported.


		<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_lookback.png" title="Configure a lookback to reprocess additional batches during standard incremental runs"/>

		If there’s an upstream model that configures `event_time`, but you don’t want the reference to it to be filtered, you can specify `ref('upstream_model').render()` to opt-out of auto-filtering.


		### Available configs

		- `event_time` - The column indicating "at what time did the row occur" (for both your microbatch model and its direct parents)


		Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` column you configure.

		Where other incremental strategies operate only on "old" and "new" data, microbatch models treat each "batch" of data as a unit that can be built or replaced on its own. Each batch is independent and <Term id="idempotent" />. This is a powerful abstraction that makes it possible for dbt to run batches separately — in the future, concurrently — and to retry them independently.


		dbt has changed the default materialization for incremental table merges from `temporary table` to `view`. For more information about this change and instructions for setting the configuration to a temp table, please read about [Snowflake temporary tables](/reference/resource-configs/snowflake-configs#temporary-tables).
		The `merge` strategy is available in dbt-postgres and dbt-redshift beginning in dbt v1.6.


		## How does `microbatch` compare to other incremental strategies?

		Most incremental models rely on the end user (you) to explicitly tell dbt what "new" means, in the context of each model, by writing a filter in an `is_incremental()` block. You are responsibly for crafting this SQL in a way that queries `this` to check when the most recent record was last loaded, with an optional look-back window for late-arriving records. Other incremental strategies will control _how_ the data is being added into the table — whether append-only `insert`, `delete` + `insert`, `merge`, `insert overwrite`, etc — but they all have this in common.

"Microbatch" incremental strategy #6194

"Microbatch" incremental strategy #6194

Conversation

jtcohen6 commented Oct 1, 2024

What are you changing in this pull request and why?

Checklist

vercel bot commented Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

graciegoheen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

graciegoheen commented Oct 1, 2024 • edited Loading

graciegoheen commented Oct 1, 2024

mirnawong1 Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

mirnawong1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vercel bot commented Oct 1, 2024 •

edited

Loading

graciegoheen commented Oct 1, 2024 •

edited

Loading

mirnawong1 Oct 3, 2024 •

edited

Loading