Add support for bigquery schema inference #259

jbergeskans · 2024-03-07T10:26:11Z

Description & motivation

Fixes #249 to support schema inference on BigQuery.

Currently, if the column names are defined in the sources file but the data types are not, the following SQL will be generated:

create or replace external table `my_project`.`my_dataset`.`some_table_name`(
        
        pk_col None,
        data_col None,
        loaded_date None)


with partition columns (
        loaded_date date) 

options (
        uris = ['gs://bucket/folder/*'], format = 'parquet', hive_partition_uri_prefix = 'gs://bucket/folder/')

In order for BigQuery to infer the schema, the column names and data types needs to be omitted:

create or replace external table `my_project`.`my_dataset`.`some_table_name`

with partition columns (
        loaded_date date) 

options (
        uris = ['gs://bucket/folder/*'], format = 'parquet', hive_partition_uri_prefix = 'gs://bucket/folder/')

This has been achieved by introducing the variable infer_schema. When set to true, it will not iterate over the list of columns.
Example source file

sources:
  - name: my_source
    project: my_project
    dataset: my_dataset
    loader: dbt_external_tables
    loaded_at_field: loaded_date
    tables:
      - name: infered_table
        external:
          location: "gs://bucket/folder/*"
          infer_schema: true
          options:
            format: parquet
            hive_partition_uri_prefix: "gs://bucket/folder/"
          partitions:
            - name: loaded_date
              data_type: date
        columns:
          - name: id
            description: my id
          - name: value
            description: my value

Checklist

I have verified that these changes work locally
I have updated the README.md (if applicable)
I have added an integration test for my fix/feature (if applicable)

…ating SQL

thomas-vl · 2024-03-08T19:50:43Z

@jbergeskans why would you list the columns if you do not want to explicitly set them?
You can leave the columns array blank and achieve the same result?

jbergeskans · 2024-03-11T13:00:25Z

@jbergeskans why would you list the columns if you do not want to explicitly set them? You can leave the columns array blank and achieve the same result?

We want to use other documentation features such as description, tests, and constraints. Basically, this allows us to omit the data type field which, when you're using parquet files, isn't needed anyway.

thomas-vl · 2024-03-11T14:05:57Z

We want to use other documentation features such as description, tests, and constraints. Basically, this allows us to omit the data type field which, when you're using parquet files, isn't needed anyway.

For me this feels very conflicting, you want to infer the schema automatically but do want to manually add the column names for documentation.

I see two problems with this setup:

The inferred schema in BigQuery might be different than the columns you put in manually so the documentation no longer reflects reality.
What will happen when you apply a data test on a column that is not inferred in BigQuery because its removed from the parquet file?

Jesper Bergeskans added 2 commits January 17, 2024 17:05

Adding a var to allow for schema inference and checking it when gener…

d2ac49c

…ating SQL

Merge branch 'main' into feature/allow-bigquery-schema-inference

2a744db

jbergeskans requested a review from jeremyyeo as a code owner March 7, 2024 10:26

dataders modified the milestone: 1.0.0 Apr 4, 2024

Merge branch 'main' into feature/allow-bigquery-schema-inference

61ea7d1

dataders had a problem deploying to ci_testing April 4, 2024 21:08 — with GitHub Actions Error

dataders had a problem deploying to ci_testing April 4, 2024 21:08 — with GitHub Actions Failure

dataders had a problem deploying to ci_testing April 4, 2024 21:08 — with GitHub Actions Error

dataders had a problem deploying to ci_testing April 4, 2024 22:35 — with GitHub Actions Error

dataders had a problem deploying to ci_testing April 4, 2024 22:35 — with GitHub Actions Failure

Merge branch 'main' into feature/allow-bigquery-schema-inference

13e70c1

dataders temporarily deployed to ci_testing April 5, 2024 01:37 — with GitHub Actions Inactive

dataders mentioned this pull request Apr 10, 2024

Add support to infer schemas on BigQuery #249

Closed

jbergeskans closed this Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for bigquery schema inference #259

Add support for bigquery schema inference #259

jbergeskans commented Mar 7, 2024

thomas-vl commented Mar 8, 2024

jbergeskans commented Mar 11, 2024 •

edited

Loading

thomas-vl commented Mar 11, 2024 •

edited

Loading

Add support for bigquery schema inference #259

Add support for bigquery schema inference #259

Conversation

jbergeskans commented Mar 7, 2024

Description & motivation

Checklist

thomas-vl commented Mar 8, 2024

jbergeskans commented Mar 11, 2024 • edited Loading

thomas-vl commented Mar 11, 2024 • edited Loading

jbergeskans commented Mar 11, 2024 •

edited

Loading

thomas-vl commented Mar 11, 2024 •

edited

Loading