Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created new section called Parallel batch execution #6589

Merged
merged 83 commits into from
Dec 7, 2024
Merged
Changes from 66 commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
9db07cc
Created new section called Parallel batch execution
nataliefiann Dec 4, 2024
2f5ec6e
Merge branch 'current' into nfiann-rbip
mirnawong1 Dec 4, 2024
ca9e4e9
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
8fd0faa
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
ca3e04c
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
0cb99ba
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
099a2bc
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
40c57e8
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
2491176
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
e9a42d7
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
efbeae5
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
ed3132e
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
3e207c0
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
cbed4d6
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
6d9e7c1
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
f349a16
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
a8e56df
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
7c89a59
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
c17a31a
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 4, 2024
5e24a5a
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
7ae1121
Merge branch 'current' into nfiann-rbip
nataliefiann Dec 5, 2024
4da02b1
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
bc2adaf
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
b30107c
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
1460526
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
ed0fa04
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
dadbe4a
Merge branch 'current' into nfiann-rbip
nataliefiann Dec 5, 2024
6a7af2c
Update incremental-microbatch.md
mirnawong1 Dec 5, 2024
3c9539b
Update incremental-microbatch.md
mirnawong1 Dec 5, 2024
9a4b3c0
Update incremental-microbatch.md
mirnawong1 Dec 5, 2024
4cfa151
Update incremental-microbatch.md
mirnawong1 Dec 5, 2024
f989037
Update incremental-microbatch.md
mirnawong1 Dec 5, 2024
28e11af
Update website/docs/docs/build/incremental-microbatch.md
mirnawong1 Dec 5, 2024
e4c9bf2
Update website/docs/docs/build/incremental-microbatch.md
mirnawong1 Dec 5, 2024
be2d332
Update website/docs/docs/build/incremental-microbatch.md
mirnawong1 Dec 5, 2024
7384dd9
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
979f689
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
690286f
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
dc700ab
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
5f41e4c
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
e54777a
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
08c4b5c
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
fd5351f
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
ab12f7b
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
32144f5
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
4b54892
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
8f38762
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
864f52d
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
e060242
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
504fb91
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
113b356
Merge branch 'current' into nfiann-rbip
mirnawong1 Dec 5, 2024
dfc6555
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
33f0166
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
97d96c0
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
ff16511
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 5, 2024
b06f3ea
Merge branch 'current' into nfiann-rbip
mirnawong1 Dec 6, 2024
7586373
Update website/docs/docs/build/incremental-microbatch.md
mirnawong1 Dec 6, 2024
c9c1012
Update website/docs/docs/build/incremental-microbatch.md
mirnawong1 Dec 6, 2024
d1f94cc
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
8504487
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
14ccc71
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
c7f6642
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
349f912
Merge branch 'current' into nfiann-rbip
nataliefiann Dec 6, 2024
2227b72
Update website/docs/docs/build/incremental-microbatch.md
mirnawong1 Dec 6, 2024
f4b3b15
Merge branch 'current' into nfiann-rbip
mirnawong1 Dec 6, 2024
292aef4
Update website/docs/docs/build/incremental-microbatch.md
mirnawong1 Dec 6, 2024
722851f
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
7967a82
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
78d0dd6
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
37a8ba8
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
722778b
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
685cbd1
Added nested bullet
nataliefiann Dec 6, 2024
ed8b85b
Merge branch 'nfiann-rbip' of https://github.com/dbt-labs/docs.getdbt…
nataliefiann Dec 6, 2024
19b4095
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
f102e15
Update website/docs/docs/build/incremental-microbatch.md
nataliefiann Dec 6, 2024
d873999
Merge branch 'current' into nfiann-rbip
nataliefiann Dec 6, 2024
7f824a4
Updated upgrading to v1.9 guide to included parallel batch execution …
nataliefiann Dec 6, 2024
261f781
Created concurrent batches page (#6601)
nataliefiann Dec 6, 2024
6f84d2b
Apply suggestions from code review
runleonarun Dec 6, 2024
ef3ccc3
Update incremental-microbatch.md
runleonarun Dec 6, 2024
3926908
Update incremental-microbatch.md
runleonarun Dec 6, 2024
6c66d84
Apply suggestions from code review
runleonarun Dec 6, 2024
258a66d
Update incremental-microbatch.md
runleonarun Dec 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 128 additions & 6 deletions website/docs/docs/build/incremental-microbatch.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,12 +179,15 @@ It does not matter whether the table already contains data for that day. Given t

Several configurations are relevant to microbatch models, and some are required:

| Config | Type | Description | Default |
|----------|------|---------------|---------|
| [`event_time`](/reference/resource-configs/event-time) | Column (required) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A |
| [`begin`](/reference/resource-configs/begin) | Date (required) | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A |
| [`batch_size`](/reference/resource-configs/batch-size) | String (required) | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A |
| [`lookback`](/reference/resource-configs/lookback) | Integer (optional) | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` |

| Config | Description | Default | Type | Required |
|----------|---------------|---------|------|---------|
| [`event_time`](/reference/resource-configs/event-time) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A | Column | Required |
| `begin` | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A | Date | Required |
| `batch_size` | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A | String | Required |
| `lookback` | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` | Integer | Optional |
| `concurrent_batches` | An override for whether batches run concurrently (at the same time) or sequentially (one after the other). | `None` | Boolean | Optional |


<Lightbox src="/img/docs/building-a-dbt-project/microbatch/event_time.png" title="The event_time column configures the real-world time of this record"/>

Expand Down Expand Up @@ -280,6 +283,125 @@ For now, dbt assumes that all values supplied are in UTC:

While we may consider adding support for custom time zones in the future, we also believe that defining these values in UTC makes everyone's lives easier.

## Parallel batch execution

The microbatch strategy offers the benefit of updating a model in smaller, more manageable batches.

Parallel batch execution means that multiple batches are processed at the same time, instead of one after the other (sequentially) for faster processing of your microbatch models.

dbt automatically detects whether a batch can be run in parallel in most cases, which means you don’t need to configure this setting. However, the `concurrent_batches` config is available as an override (not a gate), allowing you to specify whether batches should or shouldn’t be run in parallel in specific cases.

For example, if you have a microbatch model with 12 batches, you can execute those batches to run in parallel. Specifically they'll run in parallel limited by the number of [available threads](/docs/running-a-dbt-project/using-threads).

### Prerequisites

To enable parallel execution, you must meet the following conditions:
runleonarun marked this conversation as resolved.
Show resolved Hide resolved

- You use the following supported adapters:
runleonarun marked this conversation as resolved.
Show resolved Hide resolved
- Snowflake
- Databricks
- More adapters coming soon!
nataliefiann marked this conversation as resolved.
Show resolved Hide resolved
nataliefiann marked this conversation as resolved.
Show resolved Hide resolved
- You meet [additional conditions](#how-parallel-batch-execution-works) mentioned in the next section
runleonarun marked this conversation as resolved.
Show resolved Hide resolved

### How parallel batch execution works

A batch can only run in parallel if:

| Step | Condition | Parallel execution | Sequential execution|
| ---- | ---------------| :------------------: | :----------: |
| 1. | **Not** the first batch | ✅ | - |
| 2. | **Not** the last batch | ✅ | - |
| 3. | [Adapter supports](#prerequisites) parallel batches | ✅ | - |


After checking for 1, 2, and 3 in the previous table &mdash; and if `concurrent_batches` value isn't set, dbt will intelligently auto-detect if the model invokes the [`{{ this }}`](/reference/dbt-jinja-functions/this) Jinja function. If it references `{{ this }}`, the batches will run sequentially since `{{ this }}` represents the database of the current model and referencing the same relation causes conflict.

Otherwise, if the `concurrent_batches` value isn't set _and_ `{{ this }}` isn't detected (and other conditions are met), the batches will run in parallel.
Copy link
Contributor Author

@nataliefiann nataliefiann Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hiya @QMalcolm

I also wanted to ask about line 317. In the git issue, it advises:

"If neither (1) nor (2) was hit, then we check if a config concurrent_batches is set for the model. If the value for that config is True then we run the batches in parallel, if False we run the batches sequentially.

If however concurrent_batches is None (i.e. not set), then we check if the model jinja contains a reference to this. If it references this then we run the batches sequentially. Otherwise, we run them in parallel."

Whereas, from your notes, it advises:

"After [1], [2], and [3] we check if the this jinja function is invoked in the model. If this is used, then the batch will be run sequentially, as it may be that your batch depends on the existence of prior batches. If this isn't used, the batch will be run in parallel.

You can override the check for this by setting concurrent_batches to either True or False. If set to False, the batch will be run sequentially. If set to True the batch will be run in parallel (assuming [1], [2], and [3])..."

I just wanted to double check if they meant the same thing as I was a little confused.

Kind Regards
Natalie

Copy link
Contributor

@QMalcolm QMalcolm Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question! Happy to clarify ❤️ Also sorry for the confusion 😅

Those two things are saying the same thing. The first one walks through the actual logical order of operations that happens in core. The second is framed from how the end user should think about it. Basically, the user in ~98% of cases shouldn't think about or set concurrent_batches. Core does its best to automatically detect whether a batch can be run in parallel or not. The config concurrent_batches is an "escape hatch" to allow the user to say "I actually think these batches should/shouldn't be run in parallel". concurrent_batches isn't a gate but instead an override.

nataliefiann marked this conversation as resolved.
Show resolved Hide resolved
### Parallel or sequential execution

nataliefiann marked this conversation as resolved.
Show resolved Hide resolved


Choosing between parallel batch execution and sequential processing depends on the specific requirements of your use case.

- Parallel batch execution is faster but requires logic that's independent of batch execution order. For example, if you're developing a data pipeline for a system that processes user transactions in batches, each batch is executed in parallel for better performance. However, the logic used to process each transaction shouldn't depend on the order of how batches are executed or completed.
nataliefiann marked this conversation as resolved.
Show resolved Hide resolved
- Sequential processing is slower but essential for calculations like [cumulative metrics](/docs/build/cumulative) in microbatch models. It processes data in the correct order, allowing each step to build on the previous one.

<!-- You can override the check for `this` by setting `concurrent_batches` to either `True` or `False`. If set to `False`, the batch will be run sequentially. If set to `True` the batch will be run in parallel (assuming [1], [2], and [3])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nataliefiann checking w quigley

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be removed

To override the `this` check, use the `concurrent_batches` configuration:

nataliefiann marked this conversation as resolved.
Show resolved Hide resolved

<File name='dbt_project.yml'>

```yaml
models:
+concurrent_batches: True
```

</File>

or:

<File name='models/my_model.sql'>

```sql
{{
config(
materialized='incremental',
concurrent_batches=True,
incremental_strategy='microbatch'
nataliefiann marked this conversation as resolved.
Show resolved Hide resolved

...
)
}}

select ...
```

</File>
nataliefiann marked this conversation as resolved.
Show resolved Hide resolved
-->

nataliefiann marked this conversation as resolved.
Show resolved Hide resolved
### Configure `concurrent_batches`

By default, dbt auto-detects whether batches can run in parallel for microbatch models, and this works correctly in most cases. However, you can override dbt's detection by setting the `concurrent_batches` config in your `dbt_project.yml` or model `.sql` file to specify parallel or sequential execution, given you meet all the [conditions](#prerequisites):
runleonarun marked this conversation as resolved.
Show resolved Hide resolved

<Tabs>
<TabItem value="yaml" label="dbt_project.yml">

<File name='dbt_project.yml'>

```yaml
models:
+concurrent_batches: True # value set to True to run batches in parallel
```

</File>
</TabItem>

<TabItem value="sql" label="my_model.sql">

<File name='models/my_model.sql'>

```sql
{{
config(
materialized='incremental',
incremental_strategy='microbatch',
event_time='session_start',
begin='2020-01-01',
batch_size='day
concurrent_batches=True, # value set to True to run batches in parallel
...
)
}}

select ...
```
</File>
</TabItem>
</Tabs>

Depending on your use case, configuring your microbatch models to run in parallel offer faster processing, in comparison to running batches sequentially.

nataliefiann marked this conversation as resolved.
Show resolved Hide resolved
## How `microbatch` compares to other incremental strategies?
nataliefiann marked this conversation as resolved.
Show resolved Hide resolved

Most incremental models rely on the end user (you) to explicitly tell dbt what "new" means, in the context of each model, by writing a filter in an `{% if is_incremental() %}` conditional block. You are responsible for crafting this SQL in a way that queries [`{{ this }}`](/reference/dbt-jinja-functions/this) to check when the most recent record was last loaded, with an optional look-back window for late-arriving records.
Expand Down
Loading