From 0a203cac2fff91f9cfd1fad659c79ba68a5c7ec8 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 10 Aug 2023 15:42:35 -0400 Subject: [PATCH 1/3] Deprecate v1.0 from docs --- contributing/single-sourcing-content.md | 2 +- website/dbt-versions.js | 4 -- website/docs/docs/build/incremental-models.md | 6 --- .../building-models/python-models.md | 2 +- .../connect-data-platform/bigquery-setup.md | 50 ------------------- .../core/connect-data-platform/spark-setup.md | 4 -- .../faqs/Core/install-python-compatibility.md | 6 --- website/docs/guides/legacy/best-practices.md | 6 --- .../docs/reference/node-selection/methods.md | 5 -- .../docs/reference/node-selection/syntax.md | 11 ---- .../resource-configs/persist_docs.md | 2 +- .../reference/resource-properties/config.md | 6 --- website/docs/reference/source-configs.md | 8 --- 13 files changed, 3 insertions(+), 109 deletions(-) diff --git a/contributing/single-sourcing-content.md b/contributing/single-sourcing-content.md index ca27372e5bc..fe64ce6521a 100644 --- a/contributing/single-sourcing-content.md +++ b/contributing/single-sourcing-content.md @@ -90,7 +90,7 @@ This component can be added directly to a markdown file in a similar way as othe Both properties can be used together to set a range where the content should show. In the example below, this content will only show if the selected version is between **0.21** and **1.0**: ```markdown - + Versioned content here diff --git a/website/dbt-versions.js b/website/dbt-versions.js index a59822101e9..655d4f02b7b 100644 --- a/website/dbt-versions.js +++ b/website/dbt-versions.js @@ -23,10 +23,6 @@ exports.versions = [ version: "1.1", EOLDate: "2023-04-28", }, - { - version: "1.0", - EOLDate: "2022-12-03" - }, ] exports.versionedPages = [ diff --git a/website/docs/docs/build/incremental-models.md b/website/docs/docs/build/incremental-models.md index 89115652a9c..d3c3f25890b 100644 --- a/website/docs/docs/build/incremental-models.md +++ b/website/docs/docs/build/incremental-models.md @@ -79,12 +79,6 @@ A `unique_key` enables updating existing rows instead of just appending new rows Not specifying a `unique_key` will result in append-only behavior, which means dbt inserts all rows returned by the model's SQL into the preexisting target table without regard for whether the rows represent duplicates. - - -The optional `unique_key` parameter specifies a field that can uniquely identify each row within your model. You can define `unique_key` in a configuration block at the top of your model. If your model doesn't contain a single field that is unique, but rather a combination of columns, we recommend that you create a single column that can serve as a unique identifier (by concatenating and hashing those columns), and pass it into your model's configuration. - - - The optional `unique_key` parameter specifies a field (or combination of fields) that define the grain of your model. That is, the field(s) identify a single unique row. You can define `unique_key` in a configuration block at the top of your model, and it can be a single column name or a list of column names. diff --git a/website/docs/docs/building-a-dbt-project/building-models/python-models.md b/website/docs/docs/building-a-dbt-project/building-models/python-models.md index 1aab8ac7a92..9c1127bb9f2 100644 --- a/website/docs/docs/building-a-dbt-project/building-models/python-models.md +++ b/website/docs/docs/building-a-dbt-project/building-models/python-models.md @@ -19,7 +19,7 @@ Below, you'll see sections entitled "❓ **Our questions**." We are excited to h dbt Python ("dbt-py") models will help you solve use cases that can't be solved with SQL. You can perform analyses using tools available in the open source Python ecosystem, including state-of-the-art packages for data science and statistics. Before, you would have needed separate infrastructure and orchestration to run Python transformations in production. By defining your Python transformations in dbt, they're just models in your project, with all the same capabilities around testing, documentation, and lineage. - + Python models are supported in dbt Core 1.3 and above. Learn more about [upgrading your version in dbt Cloud](https://docs.getdbt.com/docs/dbt-cloud/cloud-configuring-dbt-cloud/cloud-upgrading-dbt-versions) and [upgrading dbt Core versions](https://docs.getdbt.com/docs/core-versions#upgrading-to-new-patch-versions). diff --git a/website/docs/docs/core/connect-data-platform/bigquery-setup.md b/website/docs/docs/core/connect-data-platform/bigquery-setup.md index b0fc9fa7cf0..a34a4a0def2 100644 --- a/website/docs/docs/core/connect-data-platform/bigquery-setup.md +++ b/website/docs/docs/core/connect-data-platform/bigquery-setup.md @@ -317,56 +317,6 @@ my-profile: - - -BigQuery supports query timeouts. By default, the timeout is set to 300 seconds. If a dbt model takes longer than this timeout to complete, then BigQuery may cancel the query and issue the following error: - -``` - Operation did not complete within the designated timeout. -``` - -To change this timeout, use the `timeout_seconds` configuration: - - - -```yaml -my-profile: - target: dev - outputs: - dev: - type: bigquery - method: oauth - project: abc-123 - dataset: my_dataset - timeout_seconds: 600 # 10 minutes -``` - - - -The `retries` profile configuration designates the number of times dbt should retry queries that result in unhandled server errors. This configuration is only specified for BigQuery targets. Example: - - - -```yaml -# This example target will retry BigQuery queries 5 -# times with a delay. If the query does not succeed -# after the fifth attempt, then dbt will raise an error - -my-profile: - target: dev - outputs: - dev: - type: bigquery - method: oauth - project: abc-123 - dataset: my_dataset - retries: 5 -``` - - - - - ### Dataset locations The location of BigQuery datasets can be configured using the `location` configuration in a BigQuery profile. diff --git a/website/docs/docs/core/connect-data-platform/spark-setup.md b/website/docs/docs/core/connect-data-platform/spark-setup.md index 2e3b5a66de8..c3886f37e9e 100644 --- a/website/docs/docs/core/connect-data-platform/spark-setup.md +++ b/website/docs/docs/core/connect-data-platform/spark-setup.md @@ -207,8 +207,6 @@ your_profile_name: - - ## Optional configurations ### Retries @@ -227,8 +225,6 @@ connect_retries: 3 - - ## Caveats ### Usage with EMR diff --git a/website/docs/faqs/Core/install-python-compatibility.md b/website/docs/faqs/Core/install-python-compatibility.md index d24466f4990..4d6066d931b 100644 --- a/website/docs/faqs/Core/install-python-compatibility.md +++ b/website/docs/faqs/Core/install-python-compatibility.md @@ -23,12 +23,6 @@ The latest version of `dbt-core` is compatible with Python versions 3.7, 3.8, 3. - - -As of v1.0, `dbt-core` is compatible with Python versions 3.7, 3.8, and 3.9. - - - Adapter plugins and their dependencies are not always compatible with the latest version of Python. For example, dbt-snowflake v0.19 is not compatible with Python 3.9, but dbt-snowflake versions 0.20+ are. New dbt minor versions will add support for new Python3 minor versions as soon as all dependencies can support it. In turn, dbt minor versions will drop support for old Python3 minor versions right before they reach [end of life](https://endoflife.date/python). diff --git a/website/docs/guides/legacy/best-practices.md b/website/docs/guides/legacy/best-practices.md index 018d48ba181..10e02271518 100644 --- a/website/docs/guides/legacy/best-practices.md +++ b/website/docs/guides/legacy/best-practices.md @@ -159,12 +159,6 @@ dbt test --select result:fail --exclude --defer --state path/to/p > Note: If you're using the `--state target/` flag, `result:error` and `result:fail` flags can only be selected concurrently(in the same command) if using the `dbt build` command. `dbt test` will overwrite the `run_results.json` from `dbt run` in a previous command invocation. - - -Only supported by v1.1 or newer. - - - Only supported by v1.1 or newer. diff --git a/website/docs/reference/node-selection/methods.md b/website/docs/reference/node-selection/methods.md index ff86d60c06a..ca66b00044f 100644 --- a/website/docs/reference/node-selection/methods.md +++ b/website/docs/reference/node-selection/methods.md @@ -252,11 +252,6 @@ $ dbt seed --select result:error --state path/to/artifacts # run all seeds that ``` ### The "source_status" method - - -Supported in v1.1 or newer. - - diff --git a/website/docs/reference/node-selection/syntax.md b/website/docs/reference/node-selection/syntax.md index 1a43a32e2bc..a60d23cd16f 100644 --- a/website/docs/reference/node-selection/syntax.md +++ b/website/docs/reference/node-selection/syntax.md @@ -174,12 +174,6 @@ $ dbt run --select result:+ state:modified+ --defer --state ./ - -Only supported by v1.1 or newer. - - - Only supported by v1.1 or newer. @@ -199,11 +193,6 @@ dbt build --select source_status:fresher+ For more example commands, refer to [Pro-tips for workflows](/guides/legacy/best-practices.md#pro-tips-for-workflows). ### The "source_status" status - - -Only supported by v1.1 or newer. - - diff --git a/website/docs/reference/resource-configs/persist_docs.md b/website/docs/reference/resource-configs/persist_docs.md index 6facf3945cb..7134972d2ca 100644 --- a/website/docs/reference/resource-configs/persist_docs.md +++ b/website/docs/reference/resource-configs/persist_docs.md @@ -151,7 +151,7 @@ Some known issues and limitations: - + - Column names that must be quoted, such as column names containing special characters, will cause runtime errors if column-level `persist_docs` is enabled. This is fixed in v1.2. diff --git a/website/docs/reference/resource-properties/config.md b/website/docs/reference/resource-properties/config.md index 32143c1da07..1d3a2de6592 100644 --- a/website/docs/reference/resource-properties/config.md +++ b/website/docs/reference/resource-properties/config.md @@ -108,12 +108,6 @@ version: 2 - - -We have added support for the `config` property on sources in dbt Core v1.1 - - - diff --git a/website/docs/reference/source-configs.md b/website/docs/reference/source-configs.md index ef428f5934c..49390c299c8 100644 --- a/website/docs/reference/source-configs.md +++ b/website/docs/reference/source-configs.md @@ -71,14 +71,6 @@ Sources can be configured via a `config:` block within their `.yml` definitions, - - -Sources can be configured from the `dbt_project.yml` file under the `sources:` key. This configuration is most useful for configuring sources imported from [a package](package-management). You can disable sources imported from a package to prevent them from rendering in the documentation, or to prevent [source freshness checks](/docs/build/sources#snapshotting-source-data-freshness) from running on source tables imported from packages. - -Unlike other resource types, sources do not yet support a `config` property. It is not possible to (re)define source configs hierarchically across multiple YAML files. - - - ### Examples #### Disable all sources imported from a package To apply a configuration to all sources included from a [package](/docs/build/packages), From a048718c042e69f0f6a9650772fcac975d20e7fb Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 31 Aug 2023 17:08:33 -0400 Subject: [PATCH 2/3] Fixing correct page --- website/docs/docs/build/python-models.md | 6 + .../building-models/python-models.md | 719 ------------------ 2 files changed, 6 insertions(+), 719 deletions(-) delete mode 100644 website/docs/docs/building-a-dbt-project/building-models/python-models.md diff --git a/website/docs/docs/build/python-models.md b/website/docs/docs/build/python-models.md index 12825648501..bff65362d06 100644 --- a/website/docs/docs/build/python-models.md +++ b/website/docs/docs/build/python-models.md @@ -16,11 +16,15 @@ We encourage you to: dbt Python (`dbt-py`) models can help you solve use cases that can't be solved with SQL. You can perform analyses using tools available in the open-source Python ecosystem, including state-of-the-art packages for data science and statistics. Before, you would have needed separate infrastructure and orchestration to run Python transformations in production. Python transformations defined in dbt are models in your project with all the same capabilities around testing, documentation, and lineage. + Python models are supported in dbt Core 1.3 and higher. Learn more about [upgrading your version in dbt Cloud](https://docs.getdbt.com/docs/dbt-cloud/cloud-configuring-dbt-cloud/cloud-upgrading-dbt-versions) and [upgrading dbt Core versions](https://docs.getdbt.com/docs/core-versions#upgrading-to-new-patch-versions). To read more about Python models, change the [docs version to 1.3](/docs/build/python-models?version=1.3) (or higher) in the menu bar. + + + @@ -711,3 +715,5 @@ You can also install packages at cluster creation time by [defining cluster prop + + \ No newline at end of file diff --git a/website/docs/docs/building-a-dbt-project/building-models/python-models.md b/website/docs/docs/building-a-dbt-project/building-models/python-models.md deleted file mode 100644 index 9c1127bb9f2..00000000000 --- a/website/docs/docs/building-a-dbt-project/building-models/python-models.md +++ /dev/null @@ -1,719 +0,0 @@ ---- -title: "Python models" ---- - -:::info Brand new! - -dbt Core v1.3 included first-ever support for Python models. Note that only [specific data platforms](#specific-data-platforms) support dbt-py models. - -We encourage you to: -- Read [the original discussion](https://github.com/dbt-labs/dbt-core/discussions/5261) that proposed this feature. -- Contribute to [best practices for developing Python models in dbt](https://discourse.getdbt.com/t/dbt-python-model-dbt-py-best-practices/5204 ). -- Weigh in on [next steps for Python models, beyond v1.3](https://github.com/dbt-labs/dbt-core/discussions/5742). -- Join the **#dbt-core-python-models** channel in the [dbt Community Slack](https://www.getdbt.com/community/join-the-community/). - -Below, you'll see sections entitled "❓ **Our questions**." We are excited to have released a first narrow set of functionality in v1.3, which will solve real use cases. We also know this is a first step into a much wider field of possibility. We don't pretend to have all the answers. We're excited to keep developing our opinionated recommendations and next steps for product development—and we want your help. Comment in the GitHub discussions; leave thoughts in Slack; bring up dbt + Python in casual conversation with colleagues and friends. -::: - -## About Python models in dbt - -dbt Python ("dbt-py") models will help you solve use cases that can't be solved with SQL. You can perform analyses using tools available in the open source Python ecosystem, including state-of-the-art packages for data science and statistics. Before, you would have needed separate infrastructure and orchestration to run Python transformations in production. By defining your Python transformations in dbt, they're just models in your project, with all the same capabilities around testing, documentation, and lineage. - - - -Python models are supported in dbt Core 1.3 and above. Learn more about [upgrading your version in dbt Cloud](https://docs.getdbt.com/docs/dbt-cloud/cloud-configuring-dbt-cloud/cloud-upgrading-dbt-versions) and [upgrading dbt Core versions](https://docs.getdbt.com/docs/core-versions#upgrading-to-new-patch-versions). - -To read more about Python models, change the docs version to 1.3 or higher in the menu above. - - - - - - - - -```python -import ... - -def model(dbt, session): - - my_sql_model_df = dbt.ref("my_sql_model") - - final_df = ... # stuff you can't write in SQL! - - return final_df -``` - - - - - -```yml -version: 2 - -models: - - name: my_python_model - - # Document within the same codebase - description: My transformation written in Python - - # Configure in ways that feel intuitive and familiar - config: - materialized: table - tags: ['python'] - - # Test the results of my Python transformation - columns: - - name: id - # Standard validation for 'grain' of Python results - tests: - - unique - - not_null - tests: - # Write your own validation logic (in SQL) for Python results - - [custom_generic_test](writing-custom-generic-tests) -``` - - - - - - -The prerequisites for dbt Python models include using an adapter for a data platform that supports a fully featured Python runtime. In a dbt Python model, all Python code is executed remotely on the platform. None of it is run by dbt locally. We believe in clearly separating _model definition_ from _model execution_. In this and many other ways, you'll find that dbt's approach to Python models mirrors its longstanding approach to modeling data in SQL. - -We've written this guide assuming that you have some familiarity with dbt. If you've never before written a dbt model, we encourage you to start by first reading [dbt Models](/docs/build/models). Throughout, we'll be drawing connections between Python models and SQL models, as well as making clear their differences. - -### What is a Python model? - -A dbt Python model is a function that reads in dbt sources or other models, applies a series of transformations, and returns a transformed dataset. DataFrame operations define the starting points, the end state, and each step along the way. - -This is similar to the role of CTEs in dbt SQL models. We use CTEs to pull in upstream datasets, define (and name) a series of meaningful transformations, and end with a final `select` statement. You can run the compiled version of a dbt SQL model to see the data included in the resulting view or table. When you `dbt run`, dbt wraps that query in `create view`, `create table`, or more complex DDL to save its results in the database. - -Instead of a final `select` statement, each Python model returns a final DataFrame. Each DataFrame operation is "lazily evaluated." In development, you can preview its data, using methods like `.show()` or `.head()`. When you run a Python model, the full result of the final DataFrame will be saved as a table in your data warehouse. - -dbt Python models have access to almost all of the same configuration options as SQL models. You can test them, document them, add `tags` and `meta` properties to them, grant access to their results to other users, and so on. You can select them by their name, their file path, their configurations, whether they are upstream or downstream of another model, or whether they have been modified compared to a previous project state. - -### Defining a Python model - -Each Python model lives in a `.py` file in your `models/` folder. It defines a function named **`model()`**, which takes two parameters: -- **`dbt`**: A class compiled by dbt Core, unique to each model, enables you to run your Python code in the context of your dbt project and DAG. -- **`session`**: A class representing your data platform’s connection to the Python backend. The session is needed to read in tables as DataFrames, and to write DataFrames back to tables. In PySpark, by convention, the `SparkSession` is named `spark`, and available globally. For consistency across platforms, we always pass it into the `model` function as an explicit argument called `session`. - -The `model()` function must return a single DataFrame. On Snowpark (Snowflake), this can be a Snowpark or pandas DataFrame. Via PySpark (Databricks + BigQuery), this can be a Spark, pandas, or pandas-on-Spark DataFrame. For more about choosing between pandas and native DataFrames, see [DataFrame API + syntax](#dataframe-api--syntax). - -When you `dbt run --select python_model`, dbt will prepare and pass in both arguments (`dbt` and `session`). All you have to do is define the function. This is how every single Python model should look: - - - -```python -def model(dbt, session): - - ... - - return final_df -``` - - - - -### Referencing other models - -Python models participate fully in dbt's directed acyclic graph (DAG) of transformations. Use the `dbt.ref()` method within a Python model to read in data from other models (SQL or Python). If you want to read directly from a raw source table, use `dbt.source()`. These methods return DataFrames pointing to the upstream source, model, seed, or snapshot. - - - -```python -def model(dbt, session): - - # DataFrame representing an upstream model - upstream_model = dbt.ref("upstream_model_name") - - # DataFrame representing an upstream source - upstream_source = dbt.source("upstream_source_name", "table_name") - - ... -``` - - - -Of course, you can `ref()` your Python model in downstream SQL models, too: - - - -```sql -with upstream_python_model as ( - - select * from {{ ref('my_python_model') }} - -), - -... -``` - - - -### Configuring Python models - -Just like SQL models, there are three ways to configure Python models: -1. In `dbt_project.yml`, where you can configure many models at once -2. In a dedicated `.yml` file, within the `models/` directory -3. Within the model's `.py` file, using the `dbt.config()` method - -Calling the `dbt.config()` method will set configurations for your model right within your `.py` file, similar to the `{{ config() }}` macro in `.sql` model files: - - - -```python -def model(dbt, session): - - # setting configuration - dbt.config(materialized="table") -``` - - - -There's a limit to how fancy you can get with the `dbt.config()` method. It accepts _only_ literal values (strings, booleans, and numeric types). Passing another function or a more complex data structure is not possible. The reason is that dbt statically analyzes the arguments to `config()` while parsing your model without executing your Python code. If you need to set a more complex configuration, we recommend you define it using the [`config` property](resource-properties/config) in a YAML file. - -#### Accessing project context - -dbt Python models don't use Jinja to render compiled code. Python models have limited access to global project contexts compared to SQL models. That context is made available from the `dbt` class, passed in as an argument to the `model()` function. - -Out of the box, the `dbt` class supports: -- Returning DataFrames referencing the locations of other resources: `dbt.ref()` + `dbt.source()` -- Accessing the database location of the current model: `dbt.this()` (also: `dbt.this.database`, `.schema`, `.identifier`) -- Determining if the current model's run is incremental: `dbt.is_incremental` - -It is possible to extend this context by "getting" them via `dbt.config.get()` after they are configured in the [model's config](/reference/model-configs). This includes inputs such as `var`, `env_var`, and `target`. If you want to use those values to power conditional logic in your model, we require setting them through a dedicated `.yml` file config: - - - -```yml -version: 2 - -models: - - name: my_python_model - config: - materialized: table - target_name: "{{ target.name }}" - specific_var: "{{ var('SPECIFIC_VAR') }}" - specific_env_var: "{{ env_var('SPECIFIC_ENV_VAR') }}" -``` - - - -Then, within the model's Python code, use the `dbt.config.get()` function to _access_ values of configurations that have been set: - - - -```python -def model(dbt, session): - target_name = dbt.config.get("target_name") - specific_var = dbt.config.get("specific_var") - specific_env_var = dbt.config.get("specific_env_var") - - orders_df = dbt.ref("fct_orders") - - # limit data in dev - if target_name == "dev": - orders_df = orders_df.limit(500) -``` - - - -### Materializations - -Python models support two materializations: -- `table` -- `incremental` - -Incremental Python models support all the same [incremental strategies](/docs/build/incremental-models#about-incremental_strategy) as their SQL counterparts. The specific strategies supported depend on your adapter. - -Python models can't be materialized as `view` or `ephemeral`. Python isn't supported for non-model resource types (like tests and snapshots). - -For incremental models, like SQL models, you will need to filter incoming tables to only new rows of data: - - - -
- - - -```python -import snowflake.snowpark.functions as F - -def model(dbt, session): - dbt.config( - materialized = "incremental", - unique_key = "id", - ) - df = dbt.ref("upstream_table") - - if dbt.is_incremental: - - # only new rows compared to max in current table - max_from_this = f"select max(updated_at) from {dbt.this}" - df = df.filter(df.updated_at > session.sql(max_from_this).collect()[0][0]) - - # or only rows from the past 3 days - df = df.filter(df.updated_at >= F.dateadd("day", F.lit(-3), F.current_timestamp())) - - ... - - return df -``` - - - -
- -
- - - -```python -import pyspark.sql.functions as F - -def model(dbt, session): - dbt.config( - materialized = "incremental", - unique_key = "id", - ) - df = dbt.ref("upstream_table") - - if dbt.is_incremental: - - # only new rows compared to max in current table - max_from_this = f"select max(updated_at) from {dbt.this}" - df = df.filter(df.updated_at > session.sql(max_from_this).collect()[0][0]) - - # or only rows from the past 3 days - df = df.filter(df.updated_at >= F.date_add(F.current_timestamp(), F.lit(-3))) - - ... - - return df -``` - - - -
- -
- -**Note:** Incremental models are supported on BigQuery/Dataproc for the `merge` incremental strategy. The `insert_overwrite` strategy is not yet supported. - -## Python-specific functionality - -### Defining functions - -In addition to defining a `model` function, the Python model can import other functions or define its own. Here's an example, on Snowpark, defining a custom `add_one` function: - - - -```python -def add_one(x): - return x + 1 - -def model(dbt, session): - dbt.config(materialized="table") - temps_df = dbt.ref("temperatures") - - # warm things up just a little - df = temps_df.withColumn("degree_plus_one", add_one(temps_df["degree"])) - return df -``` - - - -At present, Python functions defined in one dbt model can't be imported and reused in other models. See the ["Code reuse"](#code-reuse) section for the potential patterns we're considering. - -### Using PyPI packages - -You can also define functions that depend on third-party packages, so long as those packages are installed and available to the Python runtime on your data platform. See notes on "Installing Packages" for [specific data warehouses](#specific-data-warehouses). - -In this example, we use the `holidays` package to determine if a given date is a holiday in France. For simplicity and consistency across platforms, the code below uses the pandas API. The exact syntax, and the need to refactor for multi-node processing, still varies. - - - -
- - - -```python -import holidays - -def is_holiday(date_col): - # Chez Jaffle - french_holidays = holidays.France() - is_holiday = (date_col in french_holidays) - return is_holiday - -def model(dbt, session): - dbt.config( - materialized = "table", - packages = ["holidays"] - ) - - orders_df = dbt.ref("stg_orders") - - df = orders_df.to_pandas() - - # apply our function - # (columns need to be in uppercase on Snowpark) - df["IS_HOLIDAY"] = df["ORDER_DATE"].apply(is_holiday) - - # return final dataset (Pandas DataFrame) - return df -``` - - - -
- -
- - - -```python -import holidays - -def is_holiday(date_col): - # Chez Jaffle - french_holidays = holidays.France() - is_holiday = (date_col in french_holidays) - return is_holiday - -def model(dbt, session): - dbt.config( - materialized = "table", - packages = ["holidays"] - ) - - orders_df = dbt.ref("stg_orders") - - df = orders_df.to_pandas_on_spark() # Spark 3.2+ - # df = orders_df.toPandas() in earlier versions - - # apply our function - df["is_holiday"] = df["order_date"].apply(is_holiday) - - # convert back to PySpark - df = df.to_spark() # Spark 3.2+ - # df = session.createDataFrame(df) in earlier versions - - # return final dataset (PySpark DataFrame) - return df -``` - - - -
- -
- -#### Configuring packages - -We encourage you to explicitly configure required packages and versions so dbt can track them in project metadata. This configuration is required for the implementation on some platforms. If you need specific versions of packages, specify them. - - - -```python -def model(dbt, session): - dbt.config( - packages = ["numpy==1.23.1", "scikit-learn"] - ) -``` - - - - - -```yml -version: 2 - -models: - - name: my_python_model - config: - packages: - - "numpy==1.23.1" - - scikit-learn -``` - - - -#### UDFs - -You can use the `@udf` decorator or `udf` function to define an "anonymous" function and call it within your `model` function's DataFrame transformation. This is a typical pattern for applying more complex functions as DataFrame operations, especially if those functions require inputs from third-party packages. -- [Snowpark Python: Creating UDFs](https://docs.snowflake.com/en/developer-guide/snowpark/python/creating-udfs.html) -- [PySpark functions: udf](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.udf.html) - - - -
- - - -```python -import snowflake.snowpark.types as T -import snowflake.snowpark.functions as F -import numpy - -def register_udf_add_random(): - add_random = F.udf( - # use 'lambda' syntax, for simple functional behavior - lambda x: x + numpy.random.normal(), - return_type=T.FloatType(), - input_types=[T.FloatType()] - ) - return add_random - -def model(dbt, session): - - dbt.config( - materialized = "table", - packages = ["numpy"] - ) - - temps_df = dbt.ref("temperatures") - - add_random = register_udf_add_random() - - # warm things up, who knows by how much - df = temps_df.withColumn("degree_plus_random", add_random("degree")) - return df -``` - - - -**Note:** Due to a Snowpark limitation, it is not currently possible to register complex named UDFs within stored procedures, and therefore dbt Python models. We are looking to add native support for Python UDFs as a project/DAG resource type in a future release. For the time being, if you want to create a "vectorized" Python UDF via the Batch API, we recommend either: -- Writing [`create function`](https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-batch.html) inside a SQL macro, to run as a hook or run-operation -- [Registering from a staged file](https://docs.snowflake.com/ko/developer-guide/snowpark/reference/python/_autosummary/snowflake.snowpark.udf.html#snowflake.snowpark.udf.UDFRegistration.register_from_file) within your Python model code - -
- -
- - - -```python -from pyspark.sql.types as T -import pyspark.sql.functions as F -import numpy - -# use a 'decorator' for more readable code -@F.udf(returnType=T.DoubleType()) -def add_random(x): - random_number = numpy.random.normal() - return x + random_number - -def model(dbt, session): - dbt.config( - materialized = "table", - packages = ["numpy"] - ) - - temps_df = dbt.ref("temperatures") - - # warm things up, who knows by how much - df = temps_df.withColumn("degree_plus_random", add_random("degree")) - return df -``` - - - -
- -
- -#### Code reuse - -Currently, you cannot import or reuse Python functions defined in one dbt model, in other models. This is something we'd like dbt to support. There are two patterns we're considering: -1. Creating and registering **"named" UDFs**. This process is different across data platforms and has some performance limitations. (Snowpark does support ["vectorized" UDFs](https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-batch.html): pandas-like functions that you can execute in parallel.) -2. Using **private Python packages**. In addition to importing reusable functions from public PyPI packages, many data platforms support uploading custom Python assets and registering them as packages. The upload process looks different across platforms, but your code’s actual `import` looks the same. - -:::note ❓ Our questions - -- Should dbt have a role in abstracting over UDFs? Should dbt support a new type of DAG node, `function`? Would the primary use case be code reuse across Python models or defining Python-language functions that can be called from SQL models? -- How can dbt help users when uploading or initializing private Python assets? Is this a new form of `dbt deps`? -- How can dbt support users who want to test custom functions? If defined as UDFs: "unit testing" in the database? If "pure" functions in packages: encourage adoption of `pytest`? - -💬 Discussion: ["Python models: package, artifact/object storage, and UDF management in dbt"](https://github.com/dbt-labs/dbt-core/discussions/5741) -::: - -### DataFrame API and syntax - -Over the past decade, most people writing data transformations in Python have adopted DataFrame as their common abstraction. dbt follows this convention by returning `ref()` and `source()` as DataFrames, and it expects all Python models to return a DataFrame. - -A DataFrame is a two-dimensional data structure (rows and columns). It supports convenient methods for transforming that data, creating new columns from calculations performed on existing columns. It also offers convenient ways for previewing data while developing locally or in a notebook. - -That's about where the agreement ends. There are numerous frameworks with their own syntaxes and APIs for DataFrames. The [pandas](https://pandas.pydata.org/docs/) library offered one of the original DataFrame APIs, and its syntax is the most common to learn for new data professionals. Most newer DataFrame APIs are compatible with pandas-style syntax, though few can offer perfect interoperability. This is true for Snowpark and PySpark, which have their own DataFrame APIs. - -When developing a Python model, you will find yourself asking these questions: - -**Why pandas?** It's the most common API for DataFrames. It makes it easy to explore sampled data and develop transformations locally. You can “promote” your code as-is into dbt models and run it in production for small datasets. - -**Why _not_ pandas?** Performance. pandas runs "single-node" transformations, which cannot benefit from the parallelism and distributed computing offered by modern data warehouses. This quickly becomes a problem as you operate on larger datasets. Some data platforms support optimizations for code written using pandas' DataFrame API, preventing the need for major refactors. For example, ["pandas on PySpark"](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html) offers support for 95% of pandas functionality, using the same API while still leveraging parallel processing. - -:::note ❓ Our questions -- When developing a new dbt Python model, should we recommend pandas-style syntax for rapid iteration and then refactor? -- Which open source libraries provide compelling abstractions across different data engines and vendor-specific APIs? -- Should dbt attempt to play a longer-term role in standardizing across them? - -💬 Discussion: ["Python models: the pandas problem (and a possible solution)"](https://github.com/dbt-labs/dbt-core/discussions/5738) -::: - -### Limitations - -Python models have capabilities that SQL models do not. They also have some drawbacks compared to SQL models: - -- **Time and cost.** Python models are slower to run than SQL models, and the cloud resources that run them can be more expensive. Running Python requires more general-purpose compute. That compute might sometimes live on a separate service or architecture from your SQL models. **However:** We believe that deploying Python models via dbt—with unified lineage, testing, and documentation—is, from a human standpoint, **dramatically** faster and cheaper. By comparison, spinning up separate infrastructure to orchestrate Python transformations in production and different tooling to integrate with dbt is much more time-consuming and expensive. -- **Syntax differences** are even more pronounced. Over the years, dbt has done a lot, via dispatch patterns and packages such as `dbt_utils`, to abstract over differences in SQL dialects across popular data warehouses. Python offers a **much** wider field of play. If there are five ways to do something in SQL, there are 500 ways to write it in Python, all with varying performance and adherence to standards. Those options can be overwhelming. As the maintainers of dbt, we will be learning from state-of-the-art projects tackling this problem and sharing guidance as we develop it. -- **These capabilities are very new.** As data warehouses develop new features, we expect them to offer cheaper, faster, and more intuitive mechanisms for deploying Python transformations. **We reserve the right to change the underlying implementation for executing Python models in future releases.** Our commitment to you is around the code in your model `.py` files, following the documented capabilities and guidance we're providing here. - -As a general rule, if there's a transformation you could write equally well in SQL or Python, we believe that well-written SQL is preferable: it's more accessible to a greater number of colleagues, and it's easier to write code that's performant at scale. If there's a transformation you _can't_ write in SQL, or where ten lines of elegant and well-annotated Python could save you 1000 lines of hard-to-read Jinja-SQL, Python is the way to go. - -## Specific data platforms - -In their initial launch, Python models are supported on three of the most popular data platforms: Snowflake, Databricks, and BigQuery/GCP (via Dataproc). Both Databricks and GCP's Dataproc use PySpark as the processing framework. Snowflake uses its own framework, Snowpark, which has many similarities to PySpark. - - - -
- -**Additional setup:** You will need to [acknowledge and accept Snowflake Third Party Terms](https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-packages.html#getting-started) to use Anaconda packages. - -**Installing packages:** Snowpark supports several popular packages via Anaconda. The complete list is at https://repo.anaconda.com/pkgs/snowflake/. Packages are installed at the time your model is being run. Different models can have different package dependencies. If you are using third-party packages, Snowflake recommends using a dedicated virtual warehouse for best performance rather than one with many concurrent users. - -**About "sprocs":** dbt submits Python models to run as "stored procedures," which some people call "sprocs" for short. By default, dbt will create a named sproc containing your model's compiled Python code, and then "call" it to execute. Snowpark has a Private Preview feature for "temporary" or "anonymous" stored procedures ([docs](https://docs.snowflake.com/en/LIMITEDACCESS/call-with.html)), which are faster and leave a cleaner query history. If this feature is enabled for your account, you can switch it on for your models by configuring `use_anonymous_sproc: True`. We plan to switch this on for all dbt + Snowpark Python models in a future release. - - - -```yml -# I asked Snowflake Support to enable this Private Preview feature, -# and now my dbt-py models run even faster! -models: - use_anonymous_sproc: True -``` - - - -**Docs:** ["Developer Guide: Snowpark Python"](https://docs.snowflake.com/en/developer-guide/snowpark/python/index.html) - -
- -
- -**Submission methods:** Databricks supports a few different mechanisms to submit PySpark code, each with relative advantages. Some are better for supporting iterative development, while others are better for supporting lower-cost production deployments. The options are: -- `all_purpose_cluster` (default): dbt will run your Python model using the cluster ID configured as `cluster` in your connection profile or for this specific model. These clusters are more expensive but also much more responsive. We recommend using an interactive all-purpose cluster for quicker iteration in development. - - `create_notebook: True`: dbt will upload your model's compiled PySpark code to a notebook in the namespace `/Shared/dbt_python_model/{schema}`, where `{schema}` is the configured schema for the model, and execute that notebook to run using the all-purpose cluster. The appeal of this approach is that you can easily open the notebook in the Databricks UI for debugging or fine-tuning right after running your model. Remember to copy any changes into your dbt `.py` model code before re-running. - - `create_notebook: False` (default): dbt will use the [Command API](https://docs.databricks.com/dev-tools/api/1.2/index.html#run-a-command), which is slightly faster. -- `job_cluster`: dbt will upload your model's compiled PySpark code to a notebook in the namespace `/Shared/dbt_python_model/{schema}`, where `{schema}` is the configured schema for the model, and execute that notebook to run using a short-lived jobs cluster. For each Python model, Databricks will need to spin up the cluster, execute the model's PySpark transformation, and then spin down the cluster. As such, job clusters take longer before and after model execution, but they're also less expensive, so we recommend these for longer-running Python models in production. To use the `job_cluster` submission method, your model must be configured with `job_cluster_config`, which defines key-value properties for `new_cluster`, as defined in the [JobRunsSubmit API](https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsRunsSubmit). - -You can configure each model's `submission_method` in all the standard ways you supply configuration: - -```python -def model(dbt, session): - dbt.config( - submission_method="all_purpose_cluster", - create_notebook=True, - cluster_id="abcd-1234-wxyz" - ) - ... -``` -```yml -version: 2 -models: - - name: my_python_model - config: - submission_method: job_cluster - job_cluster_config: - spark_version: ... - node_type_id: ... -``` -```yml -# dbt_project.yml -models: - project_name: - subfolder: - # set defaults for all .py models defined in this subfolder - +submission_method: all_purpose_cluster - +create_notebook: False - +cluster_id: abcd-1234-wxyz -``` - -If not configured, `dbt-spark` will use the built-in defaults: the all-purpose cluster (based on `cluster` in your connection profile) without creating a notebook. The `dbt-databricks` adapter will default to the cluster configured in `http_path`. We encourage explicitly configuring the clusters for Python models in Databricks projects. - -**Installing packages:** When using all-purpose clusters, we recommend installing packages which you will be using to run your Python models. - -**Docs:** -- [PySpark DataFrame syntax](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html) -- [Databricks: Introduction to DataFrames - Python](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-python.html) - -
- -
- -The `dbt-bigquery` adapter uses a service called Dataproc to submit your Python models as PySpark jobs. That Python/PySpark code will read from your tables and views in BigQuery, perform all computation in Dataproc, and write the final result back to BigQuery. - -**Submission methods.** Dataproc supports two submission methods: `serverless` and `cluster`. Dataproc Serverless does not require a ready cluster, which saves on hassle and cost—but it is slower to start up, and much more limited in terms of available configuration. For example, Dataproc Serverless supports only a small set of Python packages, though it does include `pandas`, `numpy`, and `scikit-learn`. (See the full list [here](https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers#example_custom_container_image_build), under "The following packages are installed in the default image"). Whereas, by creating a Dataproc Cluster in advance, you can fine-tune the cluster's configuration, install any PyPI packages you want, and benefit from faster, more responsive runtimes. - -Use the `cluster` submission method with dedicated Dataproc clusters you or your organization manage. Use the `serverless` submission method to avoid managing a Spark cluster. The latter may be quicker for getting started, but both are valid for production. - -**Additional setup:** -- Create or use an existing [Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets) -- Enable Dataproc APIs for your project + region -- If using the `cluster` submission method: Create or use an existing [Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) with the [Spark BigQuery connector initialization action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/connectors#bigquery-connectors). (Google recommends copying the action into your own Cloud Storage bucket, rather than using the example version shown in the screenshot below.) - - - -The following configurations are needed to run Python models on Dataproc. You can add these to your [BigQuery profile](/reference/warehouse-setups/bigquery-setup#running-python-models-on-dataproc), or configure them on specific Python models: -- `gcs_bucket`: Storage bucket to which dbt will upload your model's compiled PySpark code. -- `dataproc_region`: GCP region in which you have enabled Dataproc (for example `us-central1`) -- `dataproc_cluster_name`: Name of Dataproc cluster to use for running Python model (executing PySpark job). Only required if `submission_method: cluster`. - -```python -def model(dbt, session): - dbt.config( - submission_method="cluster", - dataproc_cluster_name="my-favorite-cluster" - ) - ... -``` -```yml -version: 2 -models: - - name: my_python_model - config: - submission_method: serverless -``` - -Any user or service account that runs dbt Python models will need the following permissions, in addition to permissions needed for BigQuery ([docs](https://cloud.google.com/dataproc/docs/concepts/iam/iam)): -``` -dataproc.clusters.use -dataproc.jobs.create -dataproc.jobs.get -dataproc.operations.get -storage.buckets.get -storage.objects.create -storage.objects.delete -``` - -**Installing packages:** If you are using a Dataproc Cluster (as opposed to Dataproc Serverless), you can add third-party packages while creating the cluster. - -Google recommends installing Python packages on Dataproc clusters via initialization actions: -- [How initialization actions are used](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/README.md#how-initialization-actions-are-used) -- [Actions for installing via `pip` or `conda`](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/python) - -You can also install packages at cluster creation time by [defining cluster properties](https://cloud.google.com/dataproc/docs/tutorials/python-configuration#image_version_20): `dataproc:pip.packages` or `dataproc:conda.packages`. - - - -**Docs:** -- [Dataproc overview](https://cloud.google.com/dataproc/docs/concepts/overview) -- [PySpark DataFrame syntax](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html) - -
- -
- -
From fc1e487621c329ea14d47a0a8733332feb84b0cf Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Fri, 8 Sep 2023 15:31:02 -0400 Subject: [PATCH 3/3] Update website/docs/docs/core/connect-data-platform/spark-setup.md --- website/docs/docs/core/connect-data-platform/spark-setup.md | 1 - 1 file changed, 1 deletion(-) diff --git a/website/docs/docs/core/connect-data-platform/spark-setup.md b/website/docs/docs/core/connect-data-platform/spark-setup.md index 5d74a932c45..b22416fd3a5 100644 --- a/website/docs/docs/core/connect-data-platform/spark-setup.md +++ b/website/docs/docs/core/connect-data-platform/spark-setup.md @@ -230,7 +230,6 @@ connect_retries: 3 -
### Server side configuration