Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can I implement bulkLoad for BQ ? #288

Open
javier-gracia-tabuenca-tuni opened this issue Aug 21, 2024 · 12 comments
Open

can I implement bulkLoad for BQ ? #288

javier-gracia-tabuenca-tuni opened this issue Aug 21, 2024 · 12 comments

Comments

@javier-gracia-tabuenca-tuni

Loading a table in BigQuery with DatabaseConnector::insertTable is quite slow

I usually have to use package bigrquery

quick dirty demo
image

If you want I can make insertTable with bulkLoad=TRUE to work also BQ
using bigrquery, or do you prefer other method such command line, may be more complex but nor need an extra package ??

@ablack3
Copy link
Collaborator

ablack3 commented Aug 23, 2024

We could have a DatabaseConnectorDbiConnection that wraps bigrquery connections similar to how DatabaseConnector wraps other DBI drivers. I think this would make sense.

@javier-gracia-tabuenca-tuni
Copy link
Author

Thanks @ablack3,
Now that you mention this, I remember than 3 years ago I made hack to use DatabaseConnector to work with bigquery using DBI, bcs JDBI was not allowed in our research enviroment.

https://github.com/FINNGEN/DatabaseConnector

It worked very well, but the code is very dirty,

I could work on making this the right way if you think that PR will be accepted ???

@javier-gracia-tabuenca-tuni
Copy link
Author

that hack was very simple, just used bigrquery to make the DBI connection following how DatabaseConnector uses SQLite.

https://github.com/FINNGEN/DatabaseConnector/blob/a4125c6643ca3933ecb6ff8d5aa62e9b21fff1d4/R/Connect.R#L578

The dirty part comes, where I had to pass some extra parameters for bigrquery, I just added these paramenters to all the necessary functions, this should be different.

@schuemie
Copy link
Member

@ablack3: If bigrquery has superior performance to the JDBC driver, maybe we should swap them out in DatabaseConnector? (after we've completed DatabaseConnector 7.0.0, when this should be simpler)

@ablack3
Copy link
Collaborator

ablack3 commented Aug 28, 2024

The dirty part comes, where I had to pass some extra parameters for bigrquery, I just added these paramenters to all the necessary functions, this should be different.

@javier-gracia-tabuenca-tuni I think we could follow the pattern used for odbc that uses createDbiConnectionDetails and connectUsingDbi

Otherwise your changes look good to me.

connectSparkUsingOdbc <- function(connectionDetails) {
  inform("Connecting using Spark ODBC driver")
  ensure_installed("odbc")
  dbiConnectionDetails <- createDbiConnectionDetails(
    dbms = connectionDetails$dbms,
    drv = odbc::odbc(),
    Driver = "Simba Spark ODBC Driver",
    Host = connectionDetails$server(),
    uid = connectionDetails$user(),
    pwd = connectionDetails$password(),
    Port = connectionDetails$port()
    # UseNativeQuery=1
  )
  if (!is.null(connectionDetails$extraSettings)) {
    dbiConnectionDetails <- append(
      dbiConnectionDetails,
      connectionDetails$extraSettings
    )
  }
  connection <- connectUsingDbi(dbiConnectionDetails)
  return(connection)
}

@javier-gracia-tabuenca-tuni
Copy link
Author

thanks @ablack3, that is a good guidance, (that was not there when i made my hack)

I will make a fork, add the changes, work with these on my tools, and if don't find major issues,
we can merge these changes after v7.0.0 as suggested by @schuemie

does it sounds like a plan ?

@ablack3
Copy link
Collaborator

ablack3 commented Aug 28, 2024

Hi @javier-gracia-tabuenca-tuni, after talking with @schuemie we were thinking that after v7 you should be able to use DatabaseConnector with bigrquery by using createDbiConnectionDetails.

Ideally DatabaseConnector only uses one driver for each database. Currently that is jdbc for BigQuery but it could be bigrquery in the future if all the tools work well with it. I would say we could give bigrquery a try and compare to the jdbc driver. If bigrquery seems to work better then we can switch to it in a future release. Feel free to make a branch with this change and we can test it out after v7.

@javier-gracia-tabuenca-tuni
Copy link
Author

ok, sounds good

one more question related to BQ.

There is this forced delay for only bigquery
https://github.com/OHDSI/DatabaseConnector/blob/main/R/Sql.R#L307

I coudnt find what is the reason for this.
I'd like to know if this is necessary always for BQ, or ir has to do with the driver, and changing the driver will fix this also.

@schuemie
Copy link
Member

schuemie commented Sep 4, 2024

BigQuery has a rate limit, especially for inserts. If you exceed it, you will get an error. To my knowledge this limit still applies.

@konstjar
Copy link

@javier-gracia-tabuenca-tuni
Here is a reference to documentation of the rate limits:
https://cloud.google.com/bigquery/quotas#standard_tables

@javier-gracia-tabuenca-tuni
Copy link
Author

In the end what I did is to connect to BQ ussing the
createDbiConnectionDetails function

createDbiConnectionDetails <- function(dbms, drv, ...) {

Then I only had to make few changes in the
insertTable and ListTables

https://github.com/javier-gracia-tabuenca-tuni/DatabaseConnector/tree/BQ-DBI

I havent make unit test as I dont have access to your testing enviroment, but we are using it in production of our app (https://github.com/finngen/CohortOperations2) with no problmes.

please fill free to use or reach me for more

@javier-gracia-tabuenca-tuni
Copy link
Author

forgot something important
when crating the IBD connection details, one needs the parameter bigint = "integer64" so that BIGINT are properly fetch.
https://bigrquery.r-dbi.org/reference/bigquery.html

eg

cd <- DatabaseConnector::createDbiConnectionDetails(
  dbms = "bigquery",
  drv = bigrquery::bigquery(),
  project = "<project>",
  billing = "<billing_project>",
  bigint = "integer64"
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants