Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema to the default location root #239

Closed
JCZuurmond opened this issue Oct 22, 2021 · 9 comments
Closed

Add schema to the default location root #239

JCZuurmond opened this issue Oct 22, 2021 · 9 comments
Labels
enhancement New feature or request good_first_issue Good for newcomers Stale

Comments

@JCZuurmond
Copy link
Collaborator

JCZuurmond commented Oct 22, 2021

Describe the feature

Add the schema to the location root by default.

As the documentation says:

Note that this logic [the generate_schema_name macro] is designed so that two dbt users won't accidentally overwrite each other's work by writing to the same schema.

I like this behavior. In the case of Spark, if you decide to materialize your tables as files (which you commonly do) then it might be that two users do not overwrite eachothers table definitions as these have separate schemas (or databases in spark). However, the users do overwrite the underlying data of eachothers tables.

I think this is unexpected behavior, which we can avoid by adding the schema to the location root - by default.

Describe alternatives you've considered

Overwriting the macro myself.

Additional context

The location-root macro.

Who will this benefit?

Teams where people develop on dbt-spark concurrently. Unexpected behavior will happen when you unknowingly overwrite the data of each others tables.

Are you interested in contributing this feature?

This is the macro I use:

{% macro location_clause() %}
  {%- set location_root = config.get('location_root', validator=validation.any[basestring]) -%}
  {%- set schema = model['schema'] -%}
  {%- set identifier = model['alias'] -%}
  {%- set file_format = config.get('file_format', validator=validation.any[basestring]) -%}
  {%- if location_root is not none %}
  location '{{ location_root }}/schema={{ schema }}/model={{ identifier }}/file_format={{ file_format }}/'
  {%- endif %}
{%- endmacro -%}

We could choose something like:

{% macro location_clause() %}
  {%- set location_root = config.get('location_root', validator=validation.any[basestring]) -%}
  {%- set schema = model['schema'] -%}
  {%- set identifier = model['alias'] -%}
  {%- if location_root is not none %}
  location '{{ location_root }}/{{ schema }}/{{ identifier }}
  {%- endif %}
{%- endmacro -%}
@JCZuurmond JCZuurmond added enhancement New feature or request triage labels Oct 22, 2021
@jtcohen6 jtcohen6 removed the triage label Oct 22, 2021
@jtcohen6
Copy link
Contributor

@JCZuurmond I'm super supportive of this change. It might be breaking for some users, so we'll just want to call it out as such in the changelog / release notes, and wait for the next minor version to release — which will be v1.0 :)

I don't have a strong preference between the two options you've proposed:

'{{ location_root }}/schema={{ schema }}/model={{ identifier }}/
'{{ location_root }}/{{ schema }}/{{ identifier }}

The former is closer to standard Hive partitioning layout, right? If we're already changing the default, we may as well pick an approach that will serve us well for some time to come.

@jtcohen6 jtcohen6 added the good_first_issue Good for newcomers label Oct 22, 2021
@JCZuurmond
Copy link
Collaborator Author

JCZuurmond commented Oct 27, 2021

The <name>=<value> syntax is used for column partitions. As these are not partitioned columns it might be confusing to some users. I have that syntax because I like the explicitness of it, no need for guessing what the schema or model is.

@guillesd what is your take on this issue?

@JCZuurmond
Copy link
Collaborator Author

forgot to mention, I would choose: '{{ location_root }}/{{ schema }}/{{ identifier }}. I think this is what most users would expect.

@guillesd
Copy link

guillesd commented Oct 28, 2021

Hey chicos! Nice indeed to get this going. I'm not sure which of the two options I'd go for. If we choose your preferred option @JCZuurmond, then it'd look something like this for partitioned tables: '{{ location_root }}/{{ schema }}/{{ identifier }}/partition_column=partition_value'. Is this ok or confusing for the user that there is this change of syntax? No strong opinion though!

@JCZuurmond
Copy link
Collaborator Author

I think the other option {{ location_root }}/schema={{ schema }}/identifier={{ identifier }}/partition_column=partition_value is more confusing, as it suggests schema and identifier are columns.

@jtcohen6
Copy link
Contributor

@JCZuurmond Ah, that's a really good point. I think you've clinched it for me: '{{ location_root }}/{{ schema }}/{{ identifier }} feels right.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Apr 27, 2022
@jtcohen6 jtcohen6 removed the Stale label May 3, 2022
@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@JCZuurmond
Copy link
Collaborator Author

PR #339 is still open to solve this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good_first_issue Good for newcomers Stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants