Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-764] Persist Column level comments when creating views #372

Closed
shuaahvee opened this issue Jun 21, 2022 · 11 comments · Fixed by #893 or #931
Closed

[CT-764] Persist Column level comments when creating views #372

shuaahvee opened this issue Jun 21, 2022 · 11 comments · Fixed by #893 or #931
Assignees
Labels
enhancement New feature or request help_wanted Extra attention is needed

Comments

@shuaahvee
Copy link

Describe the feature

The persist_docs config works for tables by materializing a model as a table and then running ALTER TABLE statements to update the comment on each column. This method does not work for views because ALTER VIEW doesn't support adding comments. Instead, comments need to be defined when running the CREATE OR REPLACE VIEW statement. There should be an option to handle column level comments for models materialized as views.

I originally opened this issue in dbt-databricks but they told me to open it here instead

Describe alternatives you've considered

I tried creating a macro that adds the necessary comment clause and put it at the top of my model but the SQL gets inserted after the AS in CREATE OR REPLACE VIEW AS

Who will this benefit?

This will benefit anyone using column level comments on views in databricks

Are you interested in contributing this feature?

Yes! I've created a macro to do this. Attaching below:

{# This macro replaces the regular create view macro databricks uses to create views #}
{# The goal of this macro is to persist column level comments #}
{# In general, dbt will only add or update comments on columns for tables. It does this by creating the table and then running #}
{# alter table <tbl> change column <col> comment <comment> #}
{# But there's no corresponding option to add/update comments on columns for views. So column level comments don't get applied #}
{# However, it is possible to assign comments on view columns when creating the view by running #}
{# 'create view <view> (col_1 comment <comment>, col_2 comment <comment> ...)'. Call this extra sql the 'column_comment_clause' #}
{# This macro does just that- it edits the DDL run by dbt to include the column_comment_clause #}
{# The simple approach would be to edit the ddl used when creating views by just inserting the column_comment_clause, but this runs into some issues #}

{# Macro Breakdown: #}
{# 1. aggregate_cols, a macro that returns a list of columns and descriptions to reference when generating the column_comment_clause #}
{# 2. column_comment_clause, a macro that checks if the model has the persist_docs config set to True for columns and generates the column_comment_clause_sql #}
{# 3. comment_clause, a macro that is unchanged from the original create view macro. This generates the view level comment #}
{# 4. databricks__create_view_as_with_comments, the main macro we're 'overwriting'. This generates the 'create view ...' sql #}
{#    To help apply this macro more easily in dbt_project.yml, this macro first checks if the model is being materialized as a view #}
{#    Then, it inserts the column_comment_clause into the view ddl #}


{# This macro works by: #}
{# 1. Getting a list of columns in the model in the same order as the select clause #}
{#    The column_comment_clause must list the columns in the same order as the select clause, otherwise comments will be applied to the wrong column. #}
{#    In order to get the columns in the correct order, we run a 'DESCRIBE TABLE <tbl> and store the results in a variable, ddl_column_names.' #}
{# 2. Iterating over the column names in ddl_column_names adding them to a dict called ddl_column_dict_list that we'll reference in the column_comment_clause #}
{# 3. Getting a list of columns and their descriptions from a model's associated yml file using 'model.columns.values()' #}
{# 4. Searching for column name matches between the ddl_column_dict_list and model.columns.values() #}
{#    When we find a match, we update ddl_column_dict_list with the column description.  #}
{#    Doing this ensures that our columns and descriptions retain the correct ordering #}
{# 5. Iterating over ddl_column_dict_list and adding a generic 'missing description' message to any columns missing comments in the yml #}
{#    If we didn't do this then the ddl wouldn't compile since all columns need to have descriptions (descriptions can be blank) #}
{# 6. Checking that the model has the persist_docs config set to True and escaping single quotes in column desriptions #}
{# 7. Generating the create view ddl and adding the column_comment_clause #}
{# 8. Checking if the model is being materialized as a view. If it is then the ddl gets run, otherwise the macro exits #}
{#    We perform this check to make it easier to apply this macro in dbt_project.yml #}
{#    That way, even if some models in a path are being materialized as tables and some as views, this will only run when appropriate #}

{# Important Notes #}
{# This macro must run as a post hook. This is necessary to ensure any new or removed columns get included or excluded form the column_comment_clause#}
{# If the macro only ran as usual then the 'DESCRIBE TABLE...' query would only include the columns that were ALREADY in the model #}
{# It wouldn't account for columns you've added or removed, the resulting column_comment_clause would have the wrong number of columns, generating an error  #}
{# dbt has some weird syntax when calling macros that use {{this}} in post hooks. See https://github.com/dbt-labs/dbt-core/issues/2370 #}
{# As a result, the correct syntax to use is (quotes and whitespace included): " {{ databricks__create_view_as_with_comments(this, sql) }} "  #}

{# How to use this macro #}
{# 1. Add this sql script to your macros folder #}
{# 2. Add the post hook in the usage notes to either your model config or dbt_project.yml #}



{% macro aggregate_cols(sql) %}
    {% set ddl_column_dict_list = [] %}
    {# Step 1. generate DESCRIBE TABLE statement to get columns in same order as select clause  #}
    {% set existing_cols_query %}
      describe table {{this}}
    {% endset %}
    {% set col_query_results = run_query(existing_cols_query) %}
    {% if execute %}
      {# Return the view's column names and assign to variable  #}
      {% set ddl_column_names = col_query_results.columns[0].values() %}
    {% endif %}

    {# Step 2 #}
    {% for ddl_col in ddl_column_names %}
      {% do ddl_column_dict_list.append({"name":ddl_col, "description":""}) %}
    {% endfor %}
    {% for ddl_dict in ddl_column_dict_list %}
      {# Step 3 #}
      {% for yml_col in model.columns.values() %}
        {# Step 4 #}
        {% if ddl_dict["name"] == yml_col["name"] %}
          {% do ddl_dict.update({"description":yml_col["description"] }) %}
        {% endif %}
      {% endfor %}
      {# Step 5 #}
      {% if ddl_dict["description"] == "" %}
        {% do ddl_dict.update({"description":"No Description in yml" }) %}
      {% endif %}
    {% endfor %}
    {{return(ddl_column_dict_list)}}
{% endmacro %}



{% macro column_comment_clause() %}
  {# Step 6 #}
  {%- set raw_persist_docs = config.get('persist_docs', {}) -%}

  {%- if raw_persist_docs is mapping -%}
    {%- set raw_relation_column_comments = raw_persist_docs.get('columns', false) -%}
      {%- if raw_relation_column_comments -%}
      (
        {% for col in aggregate_cols(sql) %}
        {{col["name"] + ' comment ' + "'" + col["description"] | replace("'", "\\'") + "'" }}
        {% if not loop.last %},{% endif %}
        {% endfor %}
        ) 
        {% endif %}
    {%- elif raw_persist_docs -%}
      {{ exceptions.raise_compiler_error("Invalid value provided for 'persist_docs'. Expected dict but got value: " ~ raw_persist_docs) }}
    {% endif %}
{% endmacro %}


{% macro comment_clause() %}
  {%- set raw_persist_docs = config.get('persist_docs', {}) -%}
  {%- if raw_persist_docs is mapping -%}
    {%- set raw_relation = raw_persist_docs.get('relation', false) -%}
      {%- if raw_relation -%}
      comment '{{ model.description | replace("'", "\\'") }}'
      {% endif %}
  {%- elif raw_persist_docs -%}
    {{ exceptions.raise_compiler_error("Invalid value provided for 'persist_docs'. Expected dict but got value: " ~ raw_persist_docs) }}
  {% endif %}
{%- endmacro -%}



{% macro databricks__create_view_as_with_comments(relation, sql) -%}
  {# Step 7 #}
  {%set model_materialization = config.get('materialized') %}
  {%set sql %}
    create or replace view {{ relation }}
    {{ column_comment_clause() }}
    {{ comment_clause() }}
    as 
    {{sql}}
  {% endset %}
  {# Step 8 #}
  {% if model_materialization == 'view' %}
    {% do run_query(sql) %}
  {% else %}
  {%endif%}
{% endmacro %}
@shuaahvee shuaahvee added enhancement New feature or request triage labels Jun 21, 2022
@github-actions github-actions bot changed the title Persist Column level comments when creating views [CT-764] Persist Column level comments when creating views Jun 21, 2022
@ValdarT
Copy link

ValdarT commented Aug 17, 2022

Perhaps it should be classified as a "bug" rather than an "enhancement" because there is currently no word in the docs about this limitation and I'd argue that it is quite unexpected for the user (it took my quite a while to figure out that it's not me doing something wrong and arrive here).

@lostmygithubaccount lostmygithubaccount removed their assignment Aug 23, 2022
@ValdarT
Copy link

ValdarT commented Oct 19, 2022

Any plans or timelines that could be shared, @jtcohen6?

@jtcohen6
Copy link
Contributor

Sorry for the delay in following up here!

I think we'd need to implement this similar to how we implemented persist_docs for views on Snowflake in dbt-snowflake==1.0. On that platform, we also don't have the ability to run alter / comment statements after a view is created. Instead, we need to run the model SQL first, with where false limit 0, just to introspect the column schema of the query result, and then provide that full column schema plus user-provided descriptions to the actual create view as statement.

Relevant code is here. The specific entry point within the create_view_as macro looks like:

    {% set query_columns = get_columns_in_query(sql) %}
    {{ get_persist_docs_column_list(model_columns, query_columns) }}

I'm going to mark this as help_wanted, since I think this could be contributed by a community member, following that same basic implementation. In the meantime, I have updated the docs to reflect that this is not supported (dbt-labs/docs.getdbt.com#1666).

@jtcohen6 jtcohen6 added help_wanted Extra attention is needed and removed triage labels Oct 19, 2022
@jtcohen6 jtcohen6 removed their assignment Oct 19, 2022
@ValdarT
Copy link

ValdarT commented Oct 19, 2022

Wouldn't it be better to run a DESCRIBE SELECT ... for the introspection purposes?

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Apr 18, 2023
@ValdarT
Copy link

ValdarT commented Apr 21, 2023

Commenting so the issue wouldn't be closed. Btw, on Databricks I see some field descriptions synced while others not now 🤷.

@github-actions github-actions bot removed the Stale label Apr 22, 2023
@CICDamen
Copy link

CICDamen commented May 26, 2023

We have several column descriptions that are persisted from other table models upstream in the pipeline. This is nice since we don't need to rewrite the same definitions in multiple yml files. When using the macro suggested in this issue, I don't think these descriptions persist. Also something to take into account in the final solution I guess.

@jurasan
Copy link
Contributor

jurasan commented Aug 10, 2023

created PR(#866) according to @jtcohen6 suggestion

@mikealfare
Copy link
Contributor

Re-opening until we confirm this is released in all places.

@mikealfare mikealfare reopened this Nov 30, 2023
@mikealfare
Copy link
Contributor

Closing; this has been backported to v1.5 and released to PyPI.

@tim-steinkuhler
Copy link
Contributor

@mikealfare, just a quick note the documentation website still says views are not supported: https://docs.getdbt.com/reference/resource-configs/persist_docs

image

colin-rogers-dbt added a commit to dbt-labs/docs.getdbt.com that referenced this issue Feb 28, 2024
Update persist_docs.md to reflect fixed databricks persist column comments

dbt-labs/dbt-spark#372 (comment)
mirnawong1 added a commit to dbt-labs/docs.getdbt.com that referenced this issue Mar 6, 2024
Update persist_docs.md to reflect fixed databricks persist column
comments

dbt-labs/dbt-spark#372 (comment)

## What are you changing in this pull request and why?
<!---
Describe your changes and why you're making them. If related to an open 
issue or a pull request on dbt Core, then link to them here! 

To learn more about the writing conventions used in the dbt Labs docs,
see the [Content style
guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md).
-->

## Checklist
<!--
Uncomment when publishing docs for a prerelease version of dbt:
- [ ] Add versioning components, as described in [Versioning
Docs](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#versioning-entire-pages)
- [ ] Add a note to the prerelease version [Migration
Guide](https://github.com/dbt-labs/docs.getdbt.com/tree/current/website/docs/docs/dbt-versions/core-upgrade)
-->
- [ ] Review the [Content style
guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md)
so my content adheres to these guidelines.
- [ ] For [docs
versioning](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#about-versioning),
review how to [version a whole
page](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#adding-a-new-version)
and [version a block of
content](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#versioning-blocks-of-content).
- [ ] Add a checklist item for anything that needs to happen before this
PR is merged, such as "needs technical review" or "change base branch."

Adding or removing pages (delete if not applicable):
- [ ] Add/remove page in `website/sidebars.js`
- [ ] Provide a unique filename for new pages
- [ ] Add an entry for deleted pages in `website/vercel.json`
- [ ] Run link testing locally with `npm run build` to update the links
that point to deleted pages
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help_wanted Extra attention is needed
Projects
None yet
8 participants