Add support to infer schemas on BigQuery #249

jbergeskans · 2024-01-17T15:53:48Z

Describe the feature

Allow for the BigQuery plugin to infer schemas in the case that the user has chosen to document the external table for dbt docs (descriptions, constraints etc) but does not wish to also add data types for all columns.

Describe alternatives you've considered

Fully document the solution which, in our case, is several thousands of lines over a dozen of external tables.
Document the fields in models using these tables as sources

Additional context

Currently, if a user adds column descriptions to a sources file, the package will generate a table creation query using the column information available in the sources file. If no data_type is set, the creation of the table will fail with the error message:

15:10:27  Encountered an error while running operation: Database Error
  Type not found: None at [x:y]

This is because the generated query will look like this:

create or replace external table `my_project`.`my_dataset`.`some_table_name`(
        
        pk_col None,
        data_col None,
        loaded_date None)


with partition columns (
        loaded_date date) 

options (
        uris = ['gs://bucket/folder/*'], format = 'parquet', hive_partition_uri_prefix = 'gs://bucket/folder/')

Which happens because of this code block

This solution is already implemented in the Snowflake plugin and I don't see that the BigQuery fix needs to be more complicated than allowing to pass in infer_schema and checking its truthiness.

Who will this benefit?

This will allow for users to infer the schema while checking descriptions and constraints in the dbt docs.

The text was updated successfully, but these errors were encountered:

jbergeskans · 2024-01-17T15:54:33Z

Willing to create a PR of this, assuming the fix is as straightforward as it seems.

lassebenni · 2024-02-07T08:47:00Z

I second this feature!

thomas-vl · 2024-04-08T08:54:43Z

For me this feels very conflicting, you want to infer the schema automatically but do want to manually add the column names for documentation.

I see two problems with this setup:

The inferred schema in BigQuery might be different than the columns you put in manually so the documentation no longer reflects reality.
What will happen when you apply a data test on a column that is not inferred in BigQuery because its removed from the parquet file?

dataders · 2024-04-10T19:48:54Z

@jbergeskans I agree that manually specifying column names and datatypes can be a pain!

I looked at #259, but the implementation is different from that of Snowflake's. For dbt-snowflake's version, if infer_schema: true then the following happens:

a one-off select * from table( infer_schema( location=>'{{external.location}}', file_format=>'{{external.file_format}}') ) is executed with the results stored as columns_infer (source: macros/plugins/snowflake/create_external_table.sql#L8-L15)
the results of columns_infer are iterated over to generate the correct datatypes for the CREATE EXTERNAL TABLE statement (source)

I agree with @thomas-vl and am conflicted about your implementation. However, I'm not in love with the precedent that is Snowflake's implementation either.

@jbergeskans, am I right in guessing that if the sources.yml columns names and types were generated for you, then you wouldn't need schema inference? Especially in your case where you're happy to enumerate the column names because it provides value?

What you are not looking for is something that would allow for schema evolution of an external table, as in: if actual external files have one of the below changes, then dbt shouldn't flinch and make the table accordingly:

a new column is added
an existing columns type has changed from what existed previously

My gut tells me that we DON'T want auto-schema evolution -- data changing silently should not be a feature right?

I've also opened dbt-labs/dbt-codegen#173, this feels like the "right solution"™️ to me.

jbergeskans · 2024-04-11T06:06:29Z

@dataders fully agree that silently changing the schema is not something that should ever be allowed. And yes, I would be happy enough if this was generated for me so that I could easily go in and add tests / apply policy tags / add constraints where needed.

In this case what we had was a spreadsheet with column names and their descriptions. So I wrote a script to generate the yaml file based on the spreadsheet and we ran into this issue. But if I could've get the field names and data types generated for me, I'd 100% take that.

jbergeskans added enhancement New feature or request triage labels Jan 17, 2024

jbergeskans mentioned this issue Mar 7, 2024

Add support for bigquery schema inference #259

Closed

3 tasks

dataders added this to the 1.0.0 milestone Apr 4, 2024

dataders mentioned this issue Apr 10, 2024

Generate Source, but for external files dbt-labs/dbt-codegen#173

Closed

jbergeskans closed this as completed Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to infer schemas on BigQuery #249

Add support to infer schemas on BigQuery #249

jbergeskans commented Jan 17, 2024

jbergeskans commented Jan 17, 2024

lassebenni commented Feb 7, 2024

thomas-vl commented Apr 8, 2024

dataders commented Apr 10, 2024 •

edited

Loading

jbergeskans commented Apr 11, 2024

Add support to infer schemas on BigQuery #249

Add support to infer schemas on BigQuery #249

Comments

jbergeskans commented Jan 17, 2024

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

jbergeskans commented Jan 17, 2024

lassebenni commented Feb 7, 2024

thomas-vl commented Apr 8, 2024

dataders commented Apr 10, 2024 • edited Loading

jbergeskans commented Apr 11, 2024

dataders commented Apr 10, 2024 •

edited

Loading