Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add schema to the default location root #339

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .changes/unreleased/Breaking Changes-20221031-201109.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
kind: Breaking Changes
body: Add schema to the default location root
time: 2022-10-31T20:11:09.291461+01:00
custom:
Author: dan1elt0m JCZuurmond
Issue: "239"
PR: "339"
3 changes: 2 additions & 1 deletion dbt/include/spark/macros/adapters.sql
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,10 @@

{% macro spark__location_clause() %}
{%- set location_root = config.get('location_root', validator=validation.any[basestring]) -%}
{%- set schema = model['schema'] -%}
{%- set identifier = model['alias'] -%}
{%- if location_root is not none %}
location '{{ location_root }}/{{ identifier }}'
location '{{ location_root }}/{{ schema }}/{{ identifier }}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting way to make this non-breaking would be to make this configurable:

Suggested change
location '{{ location_root }}/{{ schema }}/{{ identifier }}'
{%- set use_schema_in_location = config.get('use_schema_in_location_root') -%}
{%- if location_root is not none %}
{%- if use_schema_in_location is not none %}
location '{{ location_root }}/{{ schema }}/{{ identifier }}'
{% else %}
location '{{ location_root }}/{{ identifier }}'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we also update config.get() to stuff the schema into the location root if it's present?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JCZuurmond do you have an opinion on that?

Copy link
Collaborator

@JCZuurmond JCZuurmond Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would like to have the location with the schema enabled by default.

Without the schema, users may overwrite each other's tables even when they have them in separate schemas because they are overwriting the underlying data via the location. This happened to colleagues of mine, which is a hard bug to track down.

I discussed the non-breaking/breaking change with @jtcohen6. I recall we were ok with it, but I do not see this conversation back in the issue maybe we had the conversation over Slack.

In any case, the change is kinda of a breaking change. The location is added when a user creates a new table; either because the data model is new (in the target schema), or when the materialization is table, or when using --full-refresh. Existing incremental tables are not affected and for the other options the tables are recreated anyway. Technically speaking, you could argue this is not (really) a breaking change.

My preference would be to enable the location with schema by default AND communicate clearly what is changed and why. It will be confusing if users do not know/understand this change and they see that some of the (older) tables have a location without schema and others have a location with schema. However, it will be even more confusing when multiple developers work together within the same warehouse and they are interfering with each other's tables unknowingly.

{%- endif %}
{%- endmacro -%}

Expand Down
8 changes: 5 additions & 3 deletions tests/unit/test_macros.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,8 +121,9 @@ def test_macros_create_table_as_location(self):
template = self.__get_template('adapters.sql')

self.config['location_root'] = '/mnt/root'
self.default_context['model'].schema = 'my_schema'
sql = self.__run_macro(template, 'spark__create_table_as', False, 'my_table', 'select 1').strip()
self.assertEqual(sql, "create table my_table location '/mnt/root/my_table' as select 1")
self.assertEqual(sql, "create table my_table location '/mnt/root/my_schema/my_table' as select 1")

def test_macros_create_table_as_comment(self):
template = self.__get_template('adapters.sql')
Expand All @@ -142,16 +143,17 @@ def test_macros_create_table_as_all(self):
self.config['buckets'] = '1'
self.config['persist_docs'] = {'relation': True}
self.default_context['model'].description = 'Description Test'
self.default_context['model'].schema = 'my_schema'

sql = self.__run_macro(template, 'spark__create_table_as', False, 'my_table', 'select 1').strip()
self.assertEqual(
sql,
"create or replace table my_table using delta partitioned by (partition_1,partition_2) clustered by (cluster_1,cluster_2) into 1 buckets location '/mnt/root/my_table' comment 'Description Test' as select 1"
"create or replace table my_table using delta partitioned by (partition_1,partition_2) clustered by (cluster_1,cluster_2) into 1 buckets location '/mnt/root/my_schema/my_table' comment 'Description Test' as select 1"
)

self.config['file_format'] = 'hudi'
sql = self.__run_macro(template, 'spark__create_table_as', False, 'my_table', 'select 1').strip()
self.assertEqual(
sql,
"create table my_table using hudi partitioned by (partition_1,partition_2) clustered by (cluster_1,cluster_2) into 1 buckets location '/mnt/root/my_table' comment 'Description Test' as select 1"
"create table my_table using hudi partitioned by (partition_1,partition_2) clustered by (cluster_1,cluster_2) into 1 buckets location '/mnt/root/my_schema/my_table' comment 'Description Test' as select 1"
)