Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist source table and column descriptions #248

Closed
katieclaiborne opened this issue Jan 16, 2024 · 7 comments
Closed

Persist source table and column descriptions #248

katieclaiborne opened this issue Jan 16, 2024 · 7 comments
Labels
enhancement New feature or request triage

Comments

@katieclaiborne
Copy link

Describe the feature

If source table and column descriptions are configured in dbt, I'd like for them to be persisted in the database when the stage_external_sources operation is run.

Describe alternatives you've considered

Configuring source table and column descriptions does make them visible in dbt Docs and dbt Explorer, which is helpful.

Additional context

We're working in BigQuery, but I could imagine similar use cases in other databases.

I'm not sure if this is a known limitation, given that the persist_docs config is not implemented for sources. It could also be that we've just missed a step in configuring and staging our external sources!

Who will this benefit?

This feature will be helpful for users discovering sources through the database, rather than dbt Docs or dbt Explorer.

@katieclaiborne katieclaiborne added enhancement New feature or request triage labels Jan 16, 2024
@LoHertel
Copy link

LoHertel commented Feb 7, 2024

I was looking for a way in dbt to add table and column descriptions to external tables in BigQuery as well.
The current implementation of this package does not consider descriptions at all (not only for BigQuery).

The BigQuery DDL supports descriptions for external tables:

CREATE EXTERNAL TABLE `project`.`dataset`.`table_name` (
  order_id INTEGER OPTIONS(description = "Unique identifier of the order (primary key).")
)
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://gcs-name/orders.parquet'],
  description = "Contains all order line items."
);

In a naive approach, the stage_external_sources operation could always add descriptions to external tables, when they are set in the source properties.

The following mock code changes might illustrate this approach:

Add column descriptions

create or replace external table {{source(source_node.source_name, source_node.name)}}
{%- if columns -%}(
{% for column in columns %}
{%- set column_quoted = adapter.quote(column.name) if column.quote else column.name %}
{{column_quoted}} {{column.data_type}} {{- ',' if not loop.last -}}
{%- endfor -%}
)

         create or replace external table {{source(source_node.source_name, source_node.name)}} 
                 {%- if columns -%}(
                 {% for column in columns %}
                     {%- set column_quoted = adapter.quote(column.name) if column.quote else column.name %}
+                    {%- set column_options = 'options(description = "' + column.description + '")' if 'description' in column else '' %}
-                    {{column_quoted}} {{column.data_type}} {{- ',' if not loop.last -}}
+                    {{column_quoted}} {{column.data_type}} {{column_options}} {{- ',' if not loop.last -}}
                 {%- endfor -%}
                 )

Add table description

options (
uris = [{%- for uri in uris -%} '{{uri}}' {{- "," if not loop.last}} {%- endfor -%}]
{%- if options is mapping -%}
{%- for key, value in options.items() if key != 'uris' %}
{%- if value is string -%}
, {{key}} = '{{value}}'
{%- else -%}
, {{key}} = {{value}}
{%- endif -%}
{%- endfor -%}
{%- endif -%}
)

         options ( 
             uris = [{%- for uri in uris -%} '{{uri}}' {{- "," if not loop.last}} {%- endfor -%}] 
+            {%- if 'description' in external %} 
+            , description = "{{external.description}}"
+            {%- endif -%} 
             {%- if options is mapping -%} 

Another approach would be to add the persist_docs option for sources in dbt-core first and then build the feature on top of it in this package. I'm just not quite sure, whether it makes sense to add this option to dbt-core, because dbt-core won't make use of it at all. The sole purpose would be to enable persistence for descriptions in the dbt-external-tables dbt package.

Do you have any thoughts on it? Just add descriptions always, or consider (currently not existing) config options?

@thomas-vl
Copy link
Contributor

I've started on a fix setting descriptions on columns in this PR:
#252

@katieclaiborne
Copy link
Author

Another approach would be to add the persist_docs option for sources in dbt-core first and then build the feature on top of it in this package. I'm just not quite sure, whether it makes sense to add this option to dbt-core, because dbt-core won't make use of it at all. The sole purpose would be to enable persistence for descriptions in the dbt-external-tables dbt package.

Do you have any thoughts on it? Just add descriptions always, or consider (currently not existing) config options?

@LoHertel, I agree with your instinct that persisting source table descriptions is specific to the dbt-external-tables package, and would lean toward building the capability here rather than in dbt-core. External source tables seem like an exception to the rule, in that they can be configured within and produced by dbt. All other source tables are simply taken as inputs.

@thomas-vl
Copy link
Contributor

My PR fixes it in the same way as it works on the Provider without additional configuration.

@dataders
Copy link
Collaborator

resolved by: #252

@katieclaiborne do you want to get the new beta release a try?

@katieclaiborne7
Copy link

@dataders, new handle, same me!

Apologies for the delay, but I've confirmed that external table column descriptions are persisted in BigQuery on version 0.9.0. Thanks again for this work!

@LoHertel
Copy link

LoHertel commented May 22, 2024

@dataders, awesome, that column descriptions for BigQuery are available now 🚀

This issue addresses table descriptions as well, which have not been introduced by #252.
Therefore, I would like to ask to reopen this issue, because it is not fully resolved.

By table description I mean the string, which describes a source table in a yaml property file:

version: 2

sources:
  - name: olist
    tables:
      - name: orders
        description: |
          Contains the order head information.
          Each order could contain multiple line items, which are stored in the order_items table.

        external:
          location: 'gs://gcs-name/orders.parquet/*'
          options:
            format: 'PARQUET'
            hive_partition_uri_prefix: 'gs://gcs-name/orders.parquet'
            require_hive_partition_filter: false
          partitions:
            - name: year
              data_type: INTEGER

        columns:
          - name: order_id
            description: Unique identifier of the order (primary key).
            data_type: STRING

It would be great, if the table description is added to the OPTIONS block at the end of the CREATE EXTERNAL TABLE statement for the BigQuery dialect:

CREATE EXTERNAL TABLE `project`.`dataset`.`orders` (
  order_id STRING OPTIONS(description = "Unique identifier of the order (primary key).")
)
WITH PARTITION COLUMNS (
  year INTEGER
)
OPTIONS (
  uris = ['gs://gcs-name/orders.parquet/*'],
  format = 'PARQUET',
  hive_partition_uri_prefix = 'gs://gcs-name/orders.parquet',
  require_hive_partition_filter = false,
  description = 'Contains the order head information.\nEach order could contain multiple line items, which are stored in a seperate table.'
);

The stage_external_sources operation would need to check for the existence of a table description in the dbt source properties and append it to the OPTIONS SQL block.

Would this be a sensible way of implementing it?
I could contribute here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

No branches or pull requests

5 participants