-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Databricks Connection] Support connection endpoint to databricks environment #11
Comments
Are there any details on what the challenges are for working with Databricks? I have a good deal of experience with Databricks, I'm still familiarizing myself with Perseus (and ETL-CDMBuilder which I suspect may be the relevant technology). I'd be happy to lend a hand. |
Thank you Nathan! The main long-term goals are to have 1) CDMBuilder and 2) running the actual ETL migrated to Databricks itself, instead of it being ran locally through Perseus's embedded code. As a first step, having an export feature that would generate a DBC file to be sent to Databricks |
Oh, interesting. Can I help document this? Is there any architecture documentation started? (or, is there a more helpful place to start?) |
To be more specific, I can volunteer to put together, for example, something like this . That could reasonably include before and after reference architecture diagrams. With that I could feasibly prototype a POC. If I volunteered to do so, is this the right audience to provide input to those docs? @Zwky26 @ssamus would you like to discuss and I can document any outputs here? |
Putting together an SPIP doc would be great! The process for feature request and documentation has been very ad-hoc up to this point, so this would be a good opportunity to increase structure. |
(Draft) Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.
Q2. What problem is this proposal NOT designed to solve? This is not a solution for migrating from one data store to another (ex. Postgres to Delta). For that, a dedicated “lift and shift” data migration solution would be recommended. This would better address concerns like incremental data load with CDC, etc. Q3. How is it done today, and what are the limits of current practice? (Apparently, input welcome)
Q4. What is new in your approach and why do you think it will be successful?
Q5. Who cares? If you are successful, what difference will it make? This will make CDM’s created by Perseus more scalable and give access to the analytic toolkit available using Spark and Delta. Q6. What are the risks?
Q7. How long will it take? 1 - 3 person months
Q8. What are the mid-term and final “exams” to check for success?
|
After parsing through the code and understanding the system design a bit better, this would seem to be blocked on OHDSI/WhiteRabbit#334 I will look in to that issue more closely. (barring any input on how this could be useful without the ability to scan delta tables) |
@ssamus @Zwky26 I've been hacking on this. I have a question about the Perseus backend. The Databricks integration requires some support on the backend, but the version of Pandas is a bit outdated, and not supported by the Databricks libraries. Are there thoughts on whether it makes sense to stand up a separate service vs updating pandas? |
Add support in Perseus for Databricks Spark SQL
The text was updated successfully, but these errors were encountered: