Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Bigquery Emulator settings to be set #1017

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

OTooleMichael
Copy link

@OTooleMichael OTooleMichael commented Nov 10, 2023

resolves #358
docs dbt-labs/docs.getdbt.com/#

This expands out optionally allowing to api_endpoint to be set. This is supported Biqquery way of overriding the http endpoint, similar to Snowflake. This is needed to connecting to an emulator/proxy - in a similar way to Snowflake. Issue 358 references this.

Problem

  • Using a Bigquery emulator is useful in local dev and cannot currently be done via existing config
  • Setting the api_endpoint is also useful for security and proxying

Solution

This simply adds a key to the config and sets the connection option if set.

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

@OTooleMichael OTooleMichael requested a review from a team as a code owner November 10, 2023 15:57
Copy link

cla-bot bot commented Nov 10, 2023

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @OTooleMichael

@OTooleMichael
Copy link
Author

Does anything need to be done to retrigger now that the CLA is signed?

@cla-bot cla-bot bot added the cla:yes label Nov 11, 2023
@orlevii
Copy link

orlevii commented Nov 14, 2023

I need the same functionality,
I started to implement the same feature, I would consider adding support for AnonymousCredentials as well (As it's the way big-query emulator suggests)

See my draft:
https://github.com/dbt-labs/dbt-bigquery/pull/1027/files#diff-e4d9ba3b4b6c6c5709431db14344ec0e23226f700e9250f819247ffbb6b112acR360-R361

@OTooleMichael
Copy link
Author

I need the same functionality, I started to implement the same feature, I would consider adding support for AnonymousCredentials as well (As it's the way big-query emulator suggests)

See my draft: https://github.com/dbt-labs/dbt-bigquery/pull/1027/files#diff-e4d9ba3b4b6c6c5709431db14344ec0e23226f700e9250f819247ffbb6b112acR360-R361

Hey @orlevii - I saw that.
The AnonymousCredentials seem like a larger API change for the DBT (although in the end I'd imagine both are pretty small).

Furthermore the Bigquery emulator works happily with any creds, in the end of the day it just ignores them. I think it suggests that way in its demo code, because in a vacuum where one must pick a Credential type it makes the most sense (DBT though has the other auth's implemented).

And further to that again the emulator you mentioned wouldn't be the only (and is not my only target), so if that is needed it could be a follow up PR.

hopefully the team reviews the PR and I can add or not according to their desires / whatever will get it merged quickest. :)

@CyberHippo
Copy link

Hi, I would love to see this merged !

@mesmacosta
Copy link

Hi, got a really similar use case, looking forward getting this merged!

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 22, 2024

@OTooleMichael Thanks for the PR! @MichelleArk and I tried taking this for a spin alongside goccy/bigquery-emulator.

We found a few issues while using the two together:

  1. Creating schemas: bigquery-emulator does not support creating schemas via StandardSQL (Failed to execute 'CreateSchema' statements goccy/bigquery-emulator#167), only via the Python client method (create_dataset). (A few years ago we switched dbt-bigquery to using StandardSQL (Try using SQL for create_schema #183) instead of the client method for schema creation.)
  2. Uploading seeds: It looks like dbt tries creating a table with UNKNOWN type. I suspect one of the client methods required for BQ seed uploading doesn't work as expected. (Maybe it would work to provide data types for the seeds explicitly; we didn't get a chance to try this.)
  3. Getting post-query metadata: After dbt builds a table or runs a select statement, it asks BigQuery for the number of rows produced via client.get_table. This doesn't seem to be supported by the emulator:
  File "/Users/michelleark/.asdf/installs/python/3.11.0/lib/python3.11/site-packages/dbt/adapters/base/impl.py", line 347, in execute
    return self.connections.execute(sql=sql, auto_begin=auto_begin, fetch=fetch, limit=limit)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/michelleark/src/dbt-bigquery/dbt/adapters/bigquery/connections.py", line 549, in execute
    query_table = client.get_table(query_job.destination)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/michelleark/.asdf/installs/python/3.11.0/lib/python3.11/site-packages/google/cloud/bigquery/client.py", line 1077, in get_table
    path = table_ref.path
           ^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'path'

I don't think (1) + (2) are hard blockers — we could manually create the schema/dataset, and we just avoided using seeds — but I do think (3) makes it impossible to use dbt with the emulator we tried.

Questions:

  • Are you using goccy/bigquery-emulator, or a different emulator? We haven't done extensive research, but that one seems to be the most feature rich, well maintained, and widely used.
  • Does it make sense to merge this PR if there aren't any known emulators that we can document supporting? My current inclination is no, though I'm open to hearing disagreement. If there are a handful of backwards-compatible changes we could make to the dbt-bigquery adapter that would fix the issues outlined above (1+2, maybe 3) to work around the limitations in the existing emulators, we could be open to that, but not if the change-set risks introducing regressions to the standard functionality with real Google BigQuery. If there are a set of changes to make within the emulator so that they more closely mirror the real BQ APIs, that would feel even better.

@OTooleMichael
Copy link
Author

@jtcohen6 and @MichelleArk, thank you both sincerely for taking the time to review the PR, and I apologise for the lingering issue with one of the unit tests. Rest assured, I'll address it promptly.

In response to your queries:

Overall, I believe the PR aligns with our objectives and should be merged. It seems there might be some slight misinterpretation or oversight regarding its purpose. :) This PR essentially enables users to utilise an Emulator; the specifics of the emulator's functionality aren't within DBT's purview. To draw an analogy, consider if BigQuery were to malfunction, like running its SUM() function incorrectly. As another example, which is currently possible, a DBT user employing Postgres could opt for an in-memory PG emulator by changing the host, irrespective of its full functionality (which is often limited). If the emulator fails to replicate BigQuery's behaviour, it's an issue for the emulator's developers to address.

Moreover, there are additional use cases to consider. My primary motivation was facilitating an in-house emulator and proxy setup. Both of these are made achievable with minimal effort through this PR. In my view, the internals of non-DBT server elements don't fall under the direct responsibility of the DBT team.

I've developed a parser that validates SQL post-Jinja processing, effectively identifying references and syntax errors without direct access to BigQuery or data movement. This approach significantly expedites CI/CD processes, often obviating the need for a live connection. At the moment the Snowflake DBT connector's endpoint override feature is serving as workaround - the CI profile is set as Snowflake dialect and the endpoint pointed at the emulator server, then the emulator does the extra work of translating the DBT queries back from the Snowflake dialect to BQ before starting its true validation work.

In a previous project involving Snowflake, I employed similar techniques using in-house SQLFluff rules for security and design linting post-Jinja processing.

Additionally, various proxying use cases emerge, where a server intermediates requests to and from BigQuery, implementing checks for deprecation warnings, security, permissions, and monitoring. For instance:

  • Dynamically migrating column references in-flight based on business rules.
  • Enforcing security and permissions beyond BigQuery's capabilities.
  • Implementing in-flight query approvals or just-in-time decryption designs.
  • Enforcing query pattern checks for cost or security reasons.

In essence, there are numerous reasons to redirect queries to different endpoints, applicable across CI/CD, development, and production environments. Some are directly related to DBT, while others are broader system requirements. Simplifying wider CI setups by patching a single ENV variable (e.g., BQ_URL) for the entire system, including DBT, Airflow, etc., underscores the versatility and value of this PR.

I'm happy to hop on a call / go through more examples / code if needed

@MartinSahlen
Copy link

@jtcohen6 @MichelleArk I'm disappointed that we have not seen this merged, or at least a proper reply to @OTooleMichael 's well-written reply. There seems to be enough desire from the community to get this one through and as pointed out already there are numerous reasons why a proxy would make sense. For connectors that need the hostname specified we can already do this so it's hard to appreciate well the arguments against it.

@kyungsoochoi984
Copy link

Hello, our team needs this feature to perform dbt unit tests using a BigQuery emulator locally. We hope this feature will be released soon !

@nrushforth
Copy link

Hi, Any update on this? When will it be available? We have a use case for using the emulator and are keen to get the updated adapter to allow this to happen.

@MartinSahlen
Copy link

Hi, Any update on this? When will it be available? We have a use case for using the emulator and are keen to get the updated adapter to allow this to happen.

Looking at the tumbleweeds from dbt team's side here I guess we can only hope. But it is getting a bit weird at this stage, I have to say.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Dec 3, 2024

Hey all - I just responded to @OTooleMichael in the dbt Community Slack yesterday:

It's been on my (post-Coalesce) list to get back to you on this one. I haven't forgotten, but I do appreciate the extra ping :)

My previous response to you was: "We tried to get this working with the BQ emulator we know about, and we weren't able to in the way that we expected!" I tried taking it for another spin by myself last Thursday (while the Americans were offline 🦃), but I got stuck along the way. Michelle & I had also identified some work we'd need to do on our end, before we can merge your PR, to make sure this is copacetic from a dbt Cloud security standpoint — but it just hasn't been a priority without an actual full-fledged use case.

To be clear, I'm not at all opposed in principle to adding this functionality (local emulator support) to dbt. But in practice, I don't like situations where we wind up saying: "We'll merge this, and we'll commit to maintaining the functionality going forward... but we won't actually document that this functionality exists, because we can't actually show you how to get it working out-of-the-box in the way that someone using dbt would expect, without a lot of extra wraparound work."
It sounds like you (+ others) have managed to get this working, though — is that with a lot of wraparound code? for dbt unit tests in particular (although I'd think it runs into the same limitation (3) that we called out here)?

Again, if the scope of this PR is not actually "support running against local emulator" and really just "support connecting to proxy URL [whether that's forwarded to real BQ behind proxy, or your home-built emulator, or something else entirely, dbt has no awareness or opinion]" — okay! I'm not opposed in principle, but that's a really important clarification. I'm not sure if all the folks commenting on the PR are aware that, while this would make a lot of 'advanced' use cases possible, it doesn't do any of them out-of-the-box.

I can comment on the PR tomorrow to poll the crowd.

@MartinSahlen @kyungsoochoi984 @nrushforth Have you been able to test this branch locally, in your own environments, along with a BigQuery emulator? Or do you have other concrete use cases for a BQ proxy today?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-1391] [Feature] Add support for BigQuery emulator
10 participants