Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbplyr adds schema to columns copared in joins, that breaks BigQuery #236

Open
javier-gracia-tabuenca-tuni opened this issue May 11, 2023 · 5 comments

Comments

@javier-gracia-tabuenca-tuni

Sorry, this is hard to make a reproducible example bcs is in bigquery

person <- dplyr::tbl(connection, DatabaseConnector::inDatabaseSchema(database_settings$schemas$CDM, "person"))
observation_period <- dplyr::tbl(connection, DatabaseConnector::inDatabaseSchema(database_settings$schemas$CDM, "observation_period"))


person |> dplyr::left_join(
    observation_period |> dplyr::select(person_id, observation_period_start_date, observation_period_end_date),
    by = "person_id"
)

with options("DEBUG_DATABASECONNECTOR_DBPLYR" = TRUE)

SQL in: SELECT TOP 11 *
FROM (
  SELECT
    atlas-development-270609.finngen_omop_r11.person.*,
    observation_period_start_date,
    observation_period_end_date
  FROM atlas-development-270609.finngen_omop_r11.person
  LEFT JOIN atlas-development-270609.finngen_omop_r11.observation_period
    ON (atlas-development-270609.finngen_omop_r11.person.person_id = atlas-development-270609.finngen_omop_r11.observation_period.person_id)
) q01
SQL out: SELECT  *
from (
  select
    atlas-development-270609.finngen_omop_r11.person.*,
    observation_period_start_date,
    observation_period_end_date
  from atlas-development-270609.finngen_omop_r11.person
  left join atlas-development-270609.finngen_omop_r11.observation_period
    on (atlas-development-270609.finngen_omop_r11.person.person_id = atlas-development-270609.finngen_omop_r11.observation_period.person_id)
) q01 LIMIT 11;
Error in collect(x, n = n) : Failed to collect lazy table.
Caused by error in `value[[3L]]()`:
! java.sql.SQLException: [Simba][BigQueryJDBCDriver](100032) Error executing query job. Message: Syntax error: Expected ")" but got "." at [4:46]

image

image

@javier-gracia-tabuenca-tuni
Copy link
Author

Problems seems to come i need to add the "project" to the schema

NOT WORKING
person_table <- dplyr::tbl(connection, DatabaseConnector::inDatabaseSchema("atlas-development-270609.finngen_omop_r11","person"))

WORKING
person_table <- dplyr::tbl(connection, DatabaseConnector::inDatabaseSchema("finngen_omop_r11","person"))

the first make a CATALOG the second a SCHEMA

image

@javier-gracia-tabuenca-tuni
Copy link
Author

this should be easy to fix

inDatabaseSchema <- function(databaseSchema, table) {

but not sure what you think will be the elegant way

may be adding a parameter dbms to inDatabaseSchema , so that inDatabaseSchema behaves differently for bq and potentially others ??

@javier-gracia-tabuenca-tuni
Copy link
Author

i solved it locally as this

tmp_inDatabaseSchema <- function(databaseSchema, table, dbms="") {
  if(dbms=="bigquery"){
    return(dbplyr::in_schema(databaseSchema, table))
  }
  databaseSchema <- strsplit(databaseSchema, "\\.")[[1]]
  if (length(databaseSchema) == 1) {
    return(dbplyr::in_schema(databaseSchema[1], table))
  } else {
    return(dbplyr::in_catalog(databaseSchema[1], databaseSchema[2], table))
  }
}

Im not making a PR bcs Im not sure this is the way you would like to solve it

@javier-gracia-tabuenca-tuni
Copy link
Author

javier-gracia-tabuenca-tuni commented May 22, 2023

this is important bcs in bigquery the path to a table is in the form
<project>.<dataset>.<table>

Some times (data project != billing project ) the needs to be specified in order to work

@schuemie
Copy link
Member

Yeah, I don't like adding the dbms parameter, because then any code written using in_database_schema() would need to provide that.

It seems to me the problem isn't so much that the project get's mistaken for a catalog, but rather that BigQuery doesn't allow the project name to appear in a join ON statement? So in your first example

SELECT  *
from (
  select
    atlas-development-270609.finngen_omop_r11.person.*,
    observation_period_start_date,
    observation_period_end_date
  from atlas-development-270609.finngen_omop_r11.person
  left join atlas-development-270609.finngen_omop_r11.observation_period
    on (atlas-development-270609.finngen_omop_r11.person.person_id = atlas-development-270609.finngen_omop_r11.observation_period.person_id)
) q01 LIMIT 11;

the SQL seems fine, but BigQuery doesn't approve of having the full reference to the tables in the ON statement. Maybe taht is what we should address?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants