Skip to content

Commit

Permalink
incorporating feedback from dbt to style and clean the page
Browse files Browse the repository at this point in the history
  • Loading branch information
wazi55 committed Oct 12, 2023
1 parent ab391be commit 2079654
Show file tree
Hide file tree
Showing 4 changed files with 573 additions and 1,047 deletions.
22 changes: 5 additions & 17 deletions website/docs/docs/build/python-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -651,25 +651,13 @@ If not configured, `dbt-spark` will use the built-in defaults: the all-purpose c

**Submission methods:** The `dbt-bigquery` adapter uses [Dataproc](https://cloud.google.com/dataproc) to submit your Python models as PySpark jobs. Dataproc supports two submission methods: `cluster` and `serverless`.

<File name='dbt_project.yml'>

```yml
models:
config:
submission_method: serverless # or cluster
# dataproc_cluster_name
```
</File>


- Cluster Submission Method: Create or use an existing Dataproc Cluster [See example](/reference/resource-configs/bigquery-configs.md#submitting-a-python-model) within dbt_project.yml or yml file within the `models/` directory

- Serverless Submission Method: Dataproc Serverless does not require a ready cluster, but it can also mean the cluster is slower to start. [See example](/reference/resource-configs/bigquery-configs.md#submitting-a-python-model) submitting a job to a serverless cluster in the `.py` file


**Installing packages**: If you are using a Dataproc Cluster (as opposed to Dataproc Serverless), you can add third-party packages while creating the cluster with the [Spark BigQuery connector initialization action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/connectors#bigquery-connectors). If you are using Dataproc Serverless, you can build your own [custom container image](https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers#python_packages) with the packages you need.
> | | `Cluster` | `Serverless` |
> | -------------------------- | -------------------------- | -------------------------- |
> | Submission Method | Create or use an existing Dataproc Cluster, [Submit](/reference/resource-configs/bigquery-configs.md#submitting-a-python-model) a python model within dbt_project.yml or yml file within the `models/` directory | Dataproc Serverless does not require a ready cluster, but it can also mean the cluster is slower to start. [Submit](/reference/resource-configs/bigquery-configs.md#submitting-a-python-model) to a serverless cluster in the `.py` file
> | Additional Packages | Add third-party packages while creating the cluster with the [Spark BigQuery connector initialization action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/connectors#bigquery-connectors). | Build your own [custom container image](https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers#python_packages) with the packages you need |

**Additional setup**: The user or role should have the adequate IAM permission to be able to trigger a job through Dataproc Cluster or Dataproc Serverless
**Additional setup**: The user or role should have the adequate [IAM permission](/reference/resource-configs/bigquery-configs.md#submitting-a-python-model) to be able to trigger a job through Dataproc Cluster or Dataproc Serverless

**Docs:**
- [Dataproc overview](https://cloud.google.com/dataproc/docs/concepts/overview)
Expand Down
45 changes: 37 additions & 8 deletions website/docs/reference/resource-configs/bigquery-configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -726,17 +726,48 @@ Just like SQL models, there are three ways to configure Python models:
2. In a dedicated `.yml` file, within the `models/` directory
3. Within the model's `.py` file, using the `dbt.config()` method

<File name='dbt_project.yml'>
Any user or service account that runs dbt Python models will need the following permissions(in addition to the required BigQuery permissions) ([docs](https://cloud.google.com/dataproc/docs/concepts/iam/iam)):
```
dataproc.batches.create
dataproc.clusters.use
dataproc.jobs.create
dataproc.jobs.get
dataproc.operations.get
dataproc.operations.list
storage.buckets.get
storage.objects.create
storage.objects.delete
```

Set up the profile to include the required parameters including `gcs_bucket` and `dataproc_region`
<File name='profile.yml'>

```yml
jaffle_shop:
target: dev
outputs:
dev:
type: bigquery
method: oauth
project: <your_project>
dataset: <your_dataset>
gcs_bucket: <your_bucket> # required for python models
dataproc_region: <your_region> # required for python models
threads: 4
```
</File>
Then based on the submission method, you can configure the model in `dbt_project.yml` or `models/<modelname>.yml` or within the model's `.py` file.

<File name='models.yml'>

```yml
# dbt_project.yml with a python model submitting jobs against a dataproc cluster
# models.yml with a python model submitting jobs against a dataproc cluster
models:
- name: my_python_model
config:
submission_method: cluster
dataproc_cluster_name: my-favorite-cluster
dataproc_region: us-central1
gcs_bucket: my-favorite-bucket
dataproc_cluster_name: my-favorite-cluster # Need to supply dataproc_cluster_name in profile or config to submit python job with cluster submission method
```

</File>
Expand All @@ -747,9 +778,7 @@ models:
def model(dbt, session):
dbt.config(
submission_method="serverless",
dataproc_region="us-central1",
gcs_bucket="my-favorite-bucket"
submission_method="serverless"
)
...
Expand Down
Loading

0 comments on commit 2079654

Please sign in to comment.