diff --git a/website/docs/docs/build/incremental-microbatch.md b/website/docs/docs/build/incremental-microbatch.md index 55c7dc92367..fad3ac17a66 100644 --- a/website/docs/docs/build/incremental-microbatch.md +++ b/website/docs/docs/build/incremental-microbatch.md @@ -25,7 +25,8 @@ Incremental models in dbt are a [materialization](/docs/build/materializations) Microbatch is an incremental strategy designed for large time-series datasets: - It relies solely on a time column ([`event_time`](/reference/resource-configs/event-time)) to define time-based ranges for filtering. Set the `event_time` column for your microbatch model and its direct parents (upstream models). Note, this is different to `partition_by`, which groups rows into partitions. - It complements, rather than replaces, existing incremental strategies by focusing on efficiency and simplicity in batch processing. -- Unlike traditional incremental strategies, microbatch doesn't require implementing complex conditional logic for [backfilling](#backfills). +- Unlike traditional incremental strategies, microbatch enables you to [reprocess failed batches](/docs/build/incremental-microbatch#retry), auto-detect [parallel batch execution](#parallel-batch-execution), and eliminate the need to implement complex conditional logic for [backfilling](#backfills). + - Note, microbatch might not be the best strategy for all use cases. Consider other strategies for use cases such as not having a reliable `event_time` column or if you want more control over the incremental logic. Read more in [How `microbatch` compares to other incremental strategies](#how-microbatch-compares-to-other-incremental-strategies). ### How microbatch works @@ -179,12 +180,15 @@ It does not matter whether the table already contains data for that day. Given t Several configurations are relevant to microbatch models, and some are required: -| Config | Type | Description | Default | -|----------|------|---------------|---------| -| [`event_time`](/reference/resource-configs/event-time) | Column (required) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A | -| [`begin`](/reference/resource-configs/begin) | Date (required) | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A | -| [`batch_size`](/reference/resource-configs/batch-size) | String (required) | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A | -| [`lookback`](/reference/resource-configs/lookback) | Integer (optional) | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` | + +| Config | Description | Default | Type | Required | +|----------|---------------|---------|------|---------| +| [`event_time`](/reference/resource-configs/event-time) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A | Column | Required | +| `begin` | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A | Date | Required | +| `batch_size` | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A | String | Required | +| `lookback` | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` | Integer | Optional | +| `concurrent_batches` | An override for whether batches run concurrently (at the same time) or sequentially (one after the other). | `None` | Boolean | Optional | + @@ -280,7 +284,127 @@ For now, dbt assumes that all values supplied are in UTC: While we may consider adding support for custom time zones in the future, we also believe that defining these values in UTC makes everyone's lives easier. -## How `microbatch` compares to other incremental strategies? +## Parallel batch execution + +The microbatch strategy offers the benefit of updating a model in smaller, more manageable batches. Depending on your use case, configuring your microbatch models to run in parallel offers faster processing, in comparison to running batches sequentially. + +Parallel batch execution means that multiple batches are processed at the same time, instead of one after the other (sequentially) for faster processing of your microbatch models. + +dbt automatically detects whether a batch can be run in parallel in most cases, which means you don’t need to configure this setting. However, the `concurrent_batches` config is available as an override (not a gate), allowing you to specify whether batches should or shouldn’t be run in parallel in specific cases. + +For example, if you have a microbatch model with 12 batches, you can execute those batches to run in parallel. Specifically they'll run in parallel limited by the number of [available threads](/docs/running-a-dbt-project/using-threads). + +### Prerequisites + +To enable parallel execution, you must: + +- Use a supported adapter: + - Snowflake + - Databricks + - More adapters coming soon! + - We'll be continuing to test and add concurrency support for adapters. This means that some adapters might get concurrency support _after_ the 1.9 initial release. + +- Meet [additional conditions](#how-parallel-batch-execution-works) described in the following section. + +### How parallel batch execution works + +A batch can only run in parallel if all of these conditions are met: + +| Condition | Parallel execution | Sequential execution| +| ---------------| :------------------: | :----------: | +| **Not** the first batch | ✅ | - | +| **Not** the last batch | ✅ | - | +| [Adapter supports](#prerequisites) parallel batches | ✅ | - | + + +After checking for the conditions in the previous table — and if `concurrent_batches` value isn't set, dbt will intelligently auto-detect if the model invokes the [`{{ this }}`](/reference/dbt-jinja-functions/this) Jinja function. If it references `{{ this }}`, the batches will run sequentially since `{{ this }}` represents the database of the current model and referencing the same relation causes conflict. + +Otherwise, if `{{ this }}` isn't detected (and other conditions are met), the batches will run in parallel, which can be overriden when you [set a value for `concurrent_batches`](/reference/resource-properties/concurrent_batches). + +### Parallel or sequential execution + +Choosing between parallel batch execution and sequential processing depends on the specific requirements of your use case. + +- Parallel batch execution is faster but requires logic independent of batch execution order. For example, if you're developing a data pipeline for a system that processes user transactions in batches, each batch is executed in parallel for better performance. However, the logic used to process each transaction shouldn't depend on the order of how batches are executed or completed. +- Sequential processing is slower but essential for calculations like [cumulative metrics](/docs/build/cumulative) in microbatch models. It processes data in the correct order, allowing each step to build on the previous one. + + + +### Configure `concurrent_batches` + +By default, dbt auto-detects whether batches can run in parallel for microbatch models, and this works correctly in most cases. However, you can override dbt's detection by setting the [`concurrent_batches` config](/reference/resource-properties/concurrent_batches) in your `dbt_project.yml` or model `.sql` file to specify parallel or sequential execution, given you meet all the [conditions](#prerequisites): + + + + + + +```yaml +models: + +concurrent_batches: true # value set to true to run batches in parallel +``` + + + + + + + + +```sql +{{ + config( + materialized='incremental', + incremental_strategy='microbatch', + event_time='session_start', + begin='2020-01-01', + batch_size='day + concurrent_batches=true, # value set to true to run batches in parallel + ... + ) +}} + +select ... +``` + + + + +## How microbatch compares to other incremental strategies + +As data warehouses roll out new operations for concurrently replacing/upserting data partitions, we may find that the new operation for the data warehouse is more efficient than what the adapter uses for microbatch. In such instances, we reserve the right the update the default operation for microbatch, so long as it works as intended/documented for models that fit the microbatch paradigm. Most incremental models rely on the end user (you) to explicitly tell dbt what "new" means, in the context of each model, by writing a filter in an `{% if is_incremental() %}` conditional block. You are responsible for crafting this SQL in a way that queries [`{{ this }}`](/reference/dbt-jinja-functions/this) to check when the most recent record was last loaded, with an optional look-back window for late-arriving records. diff --git a/website/docs/docs/dbt-versions/core-upgrade/06-upgrading-to-v1.9.md b/website/docs/docs/dbt-versions/core-upgrade/06-upgrading-to-v1.9.md index 6ade3d5013f..a7d8be0e8a1 100644 --- a/website/docs/docs/dbt-versions/core-upgrade/06-upgrading-to-v1.9.md +++ b/website/docs/docs/dbt-versions/core-upgrade/06-upgrading-to-v1.9.md @@ -49,6 +49,8 @@ Starting in Core 1.9, you can use the new [microbatch strategy](/docs/build/incr - Simplified query design: Write your model query for a single batch of data. dbt will use your `event_time`, `lookback`, and `batch_size` configurations to automatically generate the necessary filters for you, making the process more streamlined and reducing the need for you to manage these details. - Independent batch processing: dbt automatically breaks down the data to load into smaller batches based on the specified `batch_size` and processes each batch independently, improving efficiency and reducing the risk of query timeouts. If some of your batches fail, you can use `dbt retry` to load only the failed batches. - Targeted reprocessing: To load a *specific* batch or batches, you can use the CLI arguments `--event-time-start` and `--event-time-end`. +- [Automatic parallel batch execution](/docs/build/incremental-microbatch#parallel-batch-execution): Process multiple batches at the same time, instead of one after the other (sequentially) for faster processing of your microbatch models. dbt intelligently auto-detects if your batches can run in parallel, while also allowing you to manually override parallel execution with the `concurrent_batches` config. + Currently microbatch is supported on these adapters with more to come: * postgres diff --git a/website/docs/reference/resource-properties/concurrent_batches.md b/website/docs/reference/resource-properties/concurrent_batches.md new file mode 100644 index 00000000000..4d6b2ea0af4 --- /dev/null +++ b/website/docs/reference/resource-properties/concurrent_batches.md @@ -0,0 +1,90 @@ +--- +title: "concurrent_batches" +resource_types: [models] +datatype: model_name +description: "Learn about concurrent_batches in dbt." +--- + +:::note + +Available in dbt Core v1.9+ or the [dbt Cloud "Latest" release tracks](/docs/dbt-versions/cloud-release-tracks). + +::: + + + + + + + +```yaml +models: + +concurrent_batches: true +``` + + + + + + + + + + +```sql +{{ + config( + materialized='incremental', + concurrent_batches=true, + incremental_strategy='microbatch' + ... + ) +}} +select ... +``` + + + + + + +## Definition + +`concurrent_batches` is an override which allows you to decide whether or not you want to run batches in parallel or sequentially (one at a time). + +For more information, refer to [how batch execution works](/docs/build/incremental-microbatch#how-parallel-batch-execution-works). +## Example + +By default, dbt auto-detects whether batches can run in parallel for microbatch models. However, you can override dbt's detection by setting the `concurrent_batches` config to `false` in your `dbt_project.yml` or model `.sql` file to specify parallel or sequential execution, given you meet these conditions: +* You've configured a microbatch incremental strategy. +* You're working with cumulative metrics or any logic that depends on batch order. + +Set `concurrent_batches` config to `false` to ensure batches are processed sequentially. For example: + + + +```yaml +models: + my_project: + cumulative_metrics_model: + +concurrent_batches: false +``` + + + + + +```sql +{{ + config( + materialized='incremental', + incremental_strategy='microbatch' + concurrent_batches=false + ) +}} +select ... + +``` + + + diff --git a/website/sidebars.js b/website/sidebars.js index 5d6e0582765..08494e4c713 100644 --- a/website/sidebars.js +++ b/website/sidebars.js @@ -956,6 +956,7 @@ const sidebarSettings = { "reference/resource-configs/materialized", "reference/resource-configs/on_configuration_change", "reference/resource-configs/sql_header", + "reference/resource-properties/concurrent_batches", ], }, {