Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Databricks Connection] Support connection endpoint to databricks environment #11

Open
ssamus opened this issue Oct 20, 2022 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@ssamus
Copy link
Contributor

ssamus commented Oct 20, 2022

Add support in Perseus for Databricks Spark SQL

@ssamus ssamus self-assigned this Oct 20, 2022
@natb1
Copy link
Collaborator

natb1 commented Oct 26, 2022

Are there any details on what the challenges are for working with Databricks? I have a good deal of experience with Databricks, I'm still familiarizing myself with Perseus (and ETL-CDMBuilder which I suspect may be the relevant technology). I'd be happy to lend a hand.

@Zwky26
Copy link

Zwky26 commented Oct 27, 2022

Thank you Nathan! The main long-term goals are to have 1) CDMBuilder and 2) running the actual ETL migrated to Databricks itself, instead of it being ran locally through Perseus's embedded code. As a first step, having an export feature that would generate a DBC file to be sent to Databricks

@natb1
Copy link
Collaborator

natb1 commented Oct 27, 2022

Oh, interesting. Can I help document this? Is there any architecture documentation started? (or, is there a more helpful place to start?)

@ssamus ssamus added the enhancement New feature or request label Oct 31, 2022
@natb1
Copy link
Collaborator

natb1 commented Oct 31, 2022

To be more specific, I can volunteer to put together, for example, something like this . That could reasonably include before and after reference architecture diagrams. With that I could feasibly prototype a POC. If I volunteered to do so, is this the right audience to provide input to those docs? @Zwky26 @ssamus would you like to discuss and I can document any outputs here?

@Zwky26
Copy link

Zwky26 commented Nov 1, 2022

Putting together an SPIP doc would be great! The process for feature request and documentation has been very ad-hoc up to this point, so this would be a good opportunity to increase structure.
As for audiences I'm not sure what the best choice would be. @paulnagy has been a defacto person to talk to for higher-level decisions, so we could potentially have talks with him to discuss outputs

@natb1
Copy link
Collaborator

natb1 commented Nov 1, 2022

(Draft)

Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.

- We are enabling Perseus to scan Delta data sources by running jobs in a Databricks workspace.

  • A solution is required to scan Delta data sources - for Spark, this would seem to be blocked on Add Spark support WhiteRabbit#334
  • We are using the mappings created by the Perseus UI to create ETL jobs in a Databricks workspace using Delta.

Q2. What problem is this proposal NOT designed to solve?

This is not a solution for migrating from one data store to another (ex. Postgres to Delta). For that, a dedicated “lift and shift” data migration solution would be recommended. This would better address concerns like incremental data load with CDC, etc.

Q3. How is it done today, and what are the limits of current practice?

(Apparently, input welcome)

  • The scan is either imported from White Rabbit, or performed by the “source_schema_service” that is part of the “perseus-api”. or the UI calls the White Rabbit API to generate a scan report.
  • The mapping is encoded as a JSON and sent to the ETL-CDMBuilder to execute on the underlying data sources.

Q4. What is new in your approach and why do you think it will be successful?

  • A Databricks job will be written to execute a scan report. The job can be executed by the “perseus-api" just like today.
  • A Databricks job will be written that can take the mapping and execute on Delta using Spark.

Q5. Who cares? If you are successful, what difference will it make?

This will make CDM’s created by Perseus more scalable and give access to the analytic toolkit available using Spark and Delta.

Q6. What are the risks?

  • The existing solution combines integration and ETL functionality. That’s not a solution that will scale very well. The recommendation would be to isolate integration functionality, “lift and shift” to the preferred platform, then perform ETL on that platform. This may cause some confusion with the ETL-CDMBuilder processes.
  • Typically, the ETL job would be triggered from a back end service, not a UI. No analogous endpoint exist today in the Perseus API.

Q7. How long will it take?

1 - 3 person months

  • 1 - 3 person weeks for the scan job
  • 1 - 3 person weeks for the ETL job
  • 1 - 3 person weeks for integration with Perseus UI & back end.

Q8. What are the mid-term and final “exams” to check for success?

  • Scan job in isolation
  • Scan job via Perseus
  • ETL job in isolation
  • ETL job via Perseus

@natb1
Copy link
Collaborator

natb1 commented Nov 2, 2022

After parsing through the code and understanding the system design a bit better, this would seem to be blocked on OHDSI/WhiteRabbit#334 I will look in to that issue more closely. (barring any input on how this could be useful without the ability to scan delta tables)

@natb1
Copy link
Collaborator

natb1 commented Nov 12, 2022

@ssamus @Zwky26 I've been hacking on this. I have a question about the Perseus backend. The Databricks integration requires some support on the backend, but the version of Pandas is a bit outdated, and not supported by the Databricks libraries. Are there thoughts on whether it makes sense to stand up a separate service vs updating pandas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: In Progress
Development

No branches or pull requests

4 participants