Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for bigquery schema inference #259

Conversation

jbergeskans
Copy link

Description & motivation

Fixes #249 to support schema inference on BigQuery.

Currently, if the column names are defined in the sources file but the data types are not, the following SQL will be generated:

create or replace external table `my_project`.`my_dataset`.`some_table_name`(
        
        pk_col None,
        data_col None,
        loaded_date None)


with partition columns (
        loaded_date date) 

options (
        uris = ['gs://bucket/folder/*'], format = 'parquet', hive_partition_uri_prefix = 'gs://bucket/folder/')

In order for BigQuery to infer the schema, the column names and data types needs to be omitted:

create or replace external table `my_project`.`my_dataset`.`some_table_name`

with partition columns (
        loaded_date date) 

options (
        uris = ['gs://bucket/folder/*'], format = 'parquet', hive_partition_uri_prefix = 'gs://bucket/folder/')

This has been achieved by introducing the variable infer_schema. When set to true, it will not iterate over the list of columns.
Example source file

sources:
  - name: my_source
    project: my_project
    dataset: my_dataset
    loader: dbt_external_tables
    loaded_at_field: loaded_date
    tables:
      - name: infered_table
        external:
          location: "gs://bucket/folder/*"
          infer_schema: true
          options:
            format: parquet
            hive_partition_uri_prefix: "gs://bucket/folder/"
          partitions:
            - name: loaded_date
              data_type: date
        columns:
          - name: id
            description: my id
          - name: value
            description: my value

Checklist

  • I have verified that these changes work locally
  • I have updated the README.md (if applicable)
  • I have added an integration test for my fix/feature (if applicable)

@jbergeskans jbergeskans requested a review from jeremyyeo as a code owner March 7, 2024 10:26
@thomas-vl
Copy link
Contributor

@jbergeskans why would you list the columns if you do not want to explicitly set them?
You can leave the columns array blank and achieve the same result?

@jbergeskans
Copy link
Author

jbergeskans commented Mar 11, 2024

@jbergeskans why would you list the columns if you do not want to explicitly set them? You can leave the columns array blank and achieve the same result?

We want to use other documentation features such as description, tests, and constraints. Basically, this allows us to omit the data type field which, when you're using parquet files, isn't needed anyway.

@thomas-vl
Copy link
Contributor

thomas-vl commented Mar 11, 2024

We want to use other documentation features such as description, tests, and constraints. Basically, this allows us to omit the data type field which, when you're using parquet files, isn't needed anyway.

For me this feels very conflicting, you want to infer the schema automatically but do want to manually add the column names for documentation.

I see two problems with this setup:

  • The inferred schema in BigQuery might be different than the columns you put in manually so the documentation no longer reflects reality.
  • What will happen when you apply a data test on a column that is not inferred in BigQuery because its removed from the parquet file?

@dataders dataders modified the milestone: 1.0.0 Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support to infer schemas on BigQuery
3 participants