[Databricks Connection] Support connection endpoint to databricks environment #11

ssamus · 2022-10-20T09:35:21Z

Add support in Perseus for Databricks Spark SQL

natb1 · 2022-10-26T13:51:07Z

Are there any details on what the challenges are for working with Databricks? I have a good deal of experience with Databricks, I'm still familiarizing myself with Perseus (and ETL-CDMBuilder which I suspect may be the relevant technology). I'd be happy to lend a hand.

Zwky26 · 2022-10-27T01:37:27Z

Thank you Nathan! The main long-term goals are to have 1) CDMBuilder and 2) running the actual ETL migrated to Databricks itself, instead of it being ran locally through Perseus's embedded code. As a first step, having an export feature that would generate a DBC file to be sent to Databricks

natb1 · 2022-10-27T19:20:46Z

Oh, interesting. Can I help document this? Is there any architecture documentation started? (or, is there a more helpful place to start?)

natb1 · 2022-10-31T18:26:40Z

To be more specific, I can volunteer to put together, for example, something like this . That could reasonably include before and after reference architecture diagrams. With that I could feasibly prototype a POC. If I volunteered to do so, is this the right audience to provide input to those docs? @Zwky26 @ssamus would you like to discuss and I can document any outputs here?

Zwky26 · 2022-11-01T03:56:45Z

Putting together an SPIP doc would be great! The process for feature request and documentation has been very ad-hoc up to this point, so this would be a good opportunity to increase structure.
As for audiences I'm not sure what the best choice would be. @paulnagy has been a defacto person to talk to for higher-level decisions, so we could potentially have talks with him to discuss outputs

natb1 · 2022-11-01T19:20:01Z

Thank Zack. This may be naive, but it's something to start with. @Zwky26 @paulnagy (and any other stakeholders) Is this roughly what you have in mind?

https://mermaid.live/edit#pako:eNp9VWtv2jAU_SuWVbVBooxHSBnSNK0EaVQwWFP2EtJkkgtkTWJmOx1ZxX_ftc2rjJIv8eOee869Pk6eacgjoG3acTs8U7BSk4zgo2KVABmBkJBLcg8JPLFMEZ8pRoJCKkilDdQhPHOWemeAqZIERJlM6FiCINfkyiC2G1fO-9KEbrblLr3iJBTAFBCWkeFgOCIdf1CZ0JKluOV5FjFRONPhIpKxhg8_-kHPZhoGAemGXBpRiCHPFqUfq9SRYwPa0hlpPSxWzFgIxBn3NqIeFkDGPTLjwuqJszlJ2XKJb4nSIhLBMuGFXv7Fp0a3QkikS5yy8BGySO5l7wX87K6UI7sq6UTpbR4nkW1R96GvCyWbpR1wfVR2pHs4SphCYanUSNPU3YoV_wXZuZCvdEAjpiIOH3d4O7PYbSpTupZ1ST5kLClUHEriXJIYmzUX2BCe4Zgwgi0JF6XKecmB4gL2eu30pD5_igohUcwE64HVZXB99ggGzOZwqrv-1PZ3xKWaCwh-J6akzexz_wwEYwMQT_Y4MJTY2WFZdoAX4NjiG1PhrZnF8xwrMw7xjUMkGWxss0s1XqJNAPMEqkjgZLILPptJUN_fTeh1s7pDam6z_7-DAu24rUVJaKQQ5y4Yfiq9Rnwy0yHzzSnml_axtP5tB31yhxfhPNchdkP0DYnqRzxHko6O8x5Y9OariBWcR7040ddAh4r2vjuKPqyozwqeK3vUzkXoBgu2hF52z_9gIQ3tVVzcmn-7XsM0tExTECmLI_zEGttPKH4yUqRo43DKJI4m2RrjWK54UGQhbSuRQ5nmhtmPGV68lLZnLJG4umTZD85fzGn7ma5o-6ZV8epuw2u16u5Ns1Zzy7Sg7Wu3UWlU3Va93vI8r-ri1rpM_5oUtUq16jZqzWqt5TUbLe-tV6YQxXjTBvafYH4N63_TePW-

natb1 · 2022-11-01T21:51:14Z

(Draft)

Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.

~~- We are enabling Perseus to scan Delta data sources by running jobs in a Databricks workspace.~~

A solution is required to scan Delta data sources - for Spark, this would seem to be blocked on Add Spark support WhiteRabbit#334
We are using the mappings created by the Perseus UI to create ETL jobs in a Databricks workspace using Delta.

Q2. What problem is this proposal NOT designed to solve?

This is not a solution for migrating from one data store to another (ex. Postgres to Delta). For that, a dedicated “lift and shift” data migration solution would be recommended. This would better address concerns like incremental data load with CDC, etc.

Q3. How is it done today, and what are the limits of current practice?

(Apparently, input welcome)

The scan is either imported from White Rabbit, ~~or performed by the “source_schema_service” that is part of the “perseus-api”.~~ or the UI calls the White Rabbit API to generate a scan report.
The mapping is encoded as a JSON and sent to the ETL-CDMBuilder to execute on the underlying data sources.

Q4. What is new in your approach and why do you think it will be successful?

A Databricks job will be written to execute a scan report. The job can be executed by the “perseus-api" just like today.
A Databricks job will be written that can take the mapping and execute on Delta using Spark.

Q5. Who cares? If you are successful, what difference will it make?

This will make CDM’s created by Perseus more scalable and give access to the analytic toolkit available using Spark and Delta.

Q6. What are the risks?

The existing solution combines integration and ETL functionality. That’s not a solution that will scale very well. The recommendation would be to isolate integration functionality, “lift and shift” to the preferred platform, then perform ETL on that platform. This may cause some confusion with the ETL-CDMBuilder processes.
Typically, the ETL job would be triggered from a back end service, not a UI. No analogous endpoint exist today in the Perseus API.

Q7. How long will it take?

1 - 3 person months

1 - 3 person weeks for the scan job
1 - 3 person weeks for the ETL job
1 - 3 person weeks for integration with Perseus UI & back end.

Q8. What are the mid-term and final “exams” to check for success?

Scan job in isolation
Scan job via Perseus
ETL job in isolation
ETL job via Perseus

natb1 · 2022-11-02T16:13:12Z

After parsing through the code and understanding the system design a bit better, this would seem to be blocked on OHDSI/WhiteRabbit#334 I will look in to that issue more closely. (barring any input on how this could be useful without the ability to scan delta tables)

natb1 · 2022-11-12T15:07:34Z

@ssamus @Zwky26 I've been hacking on this. I have a question about the Perseus backend. The Databricks integration requires some support on the backend, but the version of Pandas is a bit outdated, and not supported by the Databricks libraries. Are there thoughts on whether it makes sense to stand up a separate service vs updating pandas?

ssamus self-assigned this Oct 20, 2022

ssamus added this to Perseus project Oct 20, 2022

ssamus added the enhancement New feature or request label Oct 31, 2022

natb1 mentioned this issue Nov 2, 2022

Add Spark support OHDSI/WhiteRabbit#334

Open

natb1 mentioned this issue Nov 30, 2022

Standard Data Connection Interface & Databricks Catalog Scanning #24

Merged

ssamus moved this to In Progress in Perseus project Dec 16, 2022

ssamus assigned bradanton Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Databricks Connection] Support connection endpoint to databricks environment #11

[Databricks Connection] Support connection endpoint to databricks environment #11

ssamus commented Oct 20, 2022 •

edited

Loading

natb1 commented Oct 26, 2022

Zwky26 commented Oct 27, 2022

natb1 commented Oct 27, 2022 •

edited

Loading

natb1 commented Oct 31, 2022 •

edited

Loading

Zwky26 commented Nov 1, 2022

natb1 commented Nov 1, 2022

natb1 commented Nov 1, 2022 •

edited

Loading

natb1 commented Nov 2, 2022

natb1 commented Nov 12, 2022

[Databricks Connection] Support connection endpoint to databricks environment #11

[Databricks Connection] Support connection endpoint to databricks environment #11

Comments

ssamus commented Oct 20, 2022 • edited Loading

natb1 commented Oct 26, 2022

Zwky26 commented Oct 27, 2022

natb1 commented Oct 27, 2022 • edited Loading

natb1 commented Oct 31, 2022 • edited Loading

Zwky26 commented Nov 1, 2022

natb1 commented Nov 1, 2022

natb1 commented Nov 1, 2022 • edited Loading

natb1 commented Nov 2, 2022

natb1 commented Nov 12, 2022

ssamus commented Oct 20, 2022 •

edited

Loading

natb1 commented Oct 27, 2022 •

edited

Loading

natb1 commented Oct 31, 2022 •

edited

Loading

natb1 commented Nov 1, 2022 •

edited

Loading