Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standard Data Connection Interface & Databricks Catalog Scanning #24

Merged
merged 65 commits into from
Dec 19, 2022
Merged

Standard Data Connection Interface & Databricks Catalog Scanning #24

merged 65 commits into from
Dec 19, 2022

Conversation

natb1
Copy link
Collaborator

@natb1 natb1 commented Nov 30, 2022

This PR implements concepts described in #11 as well as this draft design document. In short:

  • It introduces a standard interface for adding new data sources/destinations to Perseus called a "DataConnection".
  • It demonstrates an implementation of that interface by adding a new Databricks data source. New Databricks datasource is implemented up to loading a scan/profile report for any available Databricks data catalogs. The implementation demonstrates concepts that can be extended to deploying ETL pipelines and the final CDM.

1. Out of scope for this PR

  • This solution is specifically designed to be backwards compatible with all existing database sources/destinations. It makes it easier to add new connections, easier to maintain and document those connections, but it does not refactor any code that is already working.
  • This PR only enables scanning of Databricks data sources. Implementation of ETL pipeline deployment is pending additional design - in particular, design of ETL pipeline serialization. The concepts in this PR should be extensible to that use case which will be implemented in a future PR.

2. Limitations Addressed by this Design

  1. Today, adding or modifying how Perseus interacts with a database involves touching many parts of the code. For example, there are many parts of the code that are switch statements that check the "type" of the database and modify the logic or "strategy" depending on that.
  • Switch statements in the templates to determine which templates and which parts of the templates to render.
  • Switch statements in the controller code to determine what strategy to use for testing or scanning a connection.
  • Lookup tables for default ports for each database.
  • etc.

As a result, adding or modifying database integrations involves deep knowledge of the full code base, and parsing/modifying many parts of the code. This is time consuming and error prone, and could result in workarounds such and relying on documentation for dealing with edge cases vs. improving UX.

  1. Existing infrastructure (esp. Perseus backend, and WhiteRabbit) assumes RDBMS backends with constraints that won't be relevant for other DBMS'.
  • Profiling is parameterized to address scalability limitations in RDBMS' that are not relevant for distributed DBMS' and would actually be detrimental to cost/performance.
  • There are no standardized interfaces for DDL. SQL Render is primarily focused on analytic query parsing and it would likely not be feasible to extend to a generalized DBMS interface. (After more research it appears that SQL Render does, in fact, aspire to generalize DDL. The complexity of extending that to a generalized DBMS interface - that includes, for example, Delta optimization - seems extreme)
  • In general, there is little support for non-relational and/or non-sql (ex. REST) integrations with DBMS'

3. New Design Elements

  1. A new DataConnection typescript interface.

This interface encapsulates all of the logic necessary to implement a new data source. Implementations of this interface are registered with a DataConnection Angular service in order to be "injected" into the Perseus app.

I.e. any developer adding a new data source can do so by 1) implementing this interface and 2) registering the implementation with the service WITHOUT the need to learn or modify any other parts of the code.

  1. An OPTIONAL data-connection Node (typescript backend) service to support integration with non-relational DBMS'. DataConnection implementations can integrate via ANY backend including WhiteRabbit/CDM Builder etc. The data-connection backend service is an option that provides additional flexibility for DBMS' not traditionally handled by the existing services. The service is built using an open source Node framework for data access services called loopback. There is native support for OpenAPI - the Angular frontend integration is generated programmatically from the OpenAPI spec. It also has a built in ORM. In short, it is a lightweight option for integrating with arbitrary data backends in a SOA.

For Databricks this means:

  • Flexibility to integrate with EITHER the SQL Warehouse, or Jobs API depending on the configuration and use case.
  • Ability to write queries optimized for Spark.
  • Support for full Databricks DDL.

4. Summary of Improvements

  • Data sources can be added to Perseus with less effort, easier maintainability and less risk of regressions in existing code.
  • Backend data integrations can be designed with greater flexibility using open standards.
  • Databricks data sources can be scanned to use as a source for OMOP ETL.

dependabot bot and others added 30 commits April 10, 2022 04:21
Bumps [minimist](https://github.com/substack/minimist) from 1.2.5 to 1.2.6.
- [Release notes](https://github.com/substack/minimist/releases)
- [Commits](https://github.com/substack/minimist/compare/1.2.5...1.2.6)

---
updated-dependencies:
- dependency-name: minimist
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Bumps [lodash-es](https://github.com/lodash/lodash) from 4.17.15 to 4.17.21.
- [Release notes](https://github.com/lodash/lodash/releases)
- [Commits](lodash/lodash@4.17.15...4.17.21)

---
updated-dependencies:
- dependency-name: lodash-es
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Bumps [url-parse](https://github.com/unshiftio/url-parse) from 1.5.3 to 1.5.10.
- [Release notes](https://github.com/unshiftio/url-parse/releases)
- [Commits](unshiftio/url-parse@1.5.3...1.5.10)

---
updated-dependencies:
- dependency-name: url-parse
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Bumps [pyjwt](https://github.com/jpadilla/pyjwt) from 2.3.0 to 2.4.0.
- [Release notes](https://github.com/jpadilla/pyjwt/releases)
- [Changelog](https://github.com/jpadilla/pyjwt/blob/master/CHANGELOG.rst)
- [Commits](jpadilla/pyjwt@2.3.0...2.4.0)

---
updated-dependencies:
- dependency-name: pyjwt
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Bumps [waitress](https://github.com/Pylons/waitress) from 2.1.1 to 2.1.2.
- [Release notes](https://github.com/Pylons/waitress/releases)
- [Changelog](https://github.com/Pylons/waitress/blob/v2.1.2/CHANGES.txt)
- [Commits](Pylons/waitress@v2.1.1...v2.1.2)

---
updated-dependencies:
- dependency-name: waitress
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Bumps [terser](https://github.com/terser/terser) and [@angular-devkit/build-angular](https://github.com/angular/angular-cli). These dependencies needed to be updated together.

Updates `terser` from 5.7.0 to 5.14.2
- [Release notes](https://github.com/terser/terser/releases)
- [Changelog](https://github.com/terser/terser/blob/master/CHANGELOG.md)
- [Commits](terser/terser@v5.7.0...v5.14.2)

Updates `@angular-devkit/build-angular` from 12.1.0 to 12.2.18
- [Release notes](https://github.com/angular/angular-cli/releases)
- [Changelog](https://github.com/angular/angular-cli/blob/12.2.18/CHANGELOG.md)
- [Commits](angular/angular-cli@v12.1.0...12.2.18)

---
updated-dependencies:
- dependency-name: terser
  dependency-type: indirect
- dependency-name: "@angular-devkit/build-angular"
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <[email protected]>
Bumps [loader-utils](https://github.com/webpack/loader-utils) from 1.4.0 to 1.4.2.
- [Release notes](https://github.com/webpack/loader-utils/releases)
- [Changelog](https://github.com/webpack/loader-utils/blob/v1.4.2/CHANGELOG.md)
- [Commits](webpack/loader-utils@v1.4.0...v1.4.2)

---
updated-dependencies:
- dependency-name: loader-utils
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Bumps [engine.io](https://github.com/socketio/engine.io) from 6.2.0 to 6.2.1.
- [Release notes](https://github.com/socketio/engine.io/releases)
- [Changelog](https://github.com/socketio/engine.io/blob/main/CHANGELOG.md)
- [Commits](socketio/engine.io@6.2.0...6.2.1)

---
updated-dependencies:
- dependency-name: engine.io
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
kostiushenko and others added 8 commits November 30, 2022 12:34
…UI/loader-utils-1.4.2

Bump loader-utils from 1.4.0 to 1.4.2 in /UI
…UI/engine.io-6.2.1

Bump engine.io from 6.2.0 to 6.2.1 in /UI
Bumps [node-forge](https://github.com/digitalbazaar/forge) to 1.3.1 and updates ancestor dependency [@angular-devkit/build-angular](https://github.com/angular/angular-cli). These dependencies need to be updated together.


Updates `node-forge` from 0.10.0 to 1.3.1
- [Release notes](https://github.com/digitalbazaar/forge/releases)
- [Changelog](https://github.com/digitalbazaar/forge/blob/main/CHANGELOG.md)
- [Commits](digitalbazaar/forge@0.10.0...v1.3.1)

Updates `@angular-devkit/build-angular` from 12.2.18 to 15.0.1
- [Release notes](https://github.com/angular/angular-cli/releases)
- [Changelog](https://github.com/angular/angular-cli/blob/main/CHANGELOG.md)
- [Commits](angular/angular-cli@12.2.18...15.0.1)

---
updated-dependencies:
- dependency-name: node-forge
  dependency-type: indirect
- dependency-name: "@angular-devkit/build-angular"
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <[email protected]>
Bumps [glob-parent](https://github.com/gulpjs/glob-parent) and [@angular-devkit/build-angular](https://github.com/angular/angular-cli). These dependencies needed to be updated together.

Updates `glob-parent` from 3.1.0 to 5.1.2
- [Release notes](https://github.com/gulpjs/glob-parent/releases)
- [Changelog](https://github.com/gulpjs/glob-parent/blob/main/CHANGELOG.md)
- [Commits](gulpjs/glob-parent@v3.1.0...v5.1.2)

Updates `@angular-devkit/build-angular` from 12.1.0 to 14.2.3
- [Release notes](https://github.com/angular/angular-cli/releases)
- [Changelog](https://github.com/angular/angular-cli/blob/main/CHANGELOG.md)
- [Commits](angular/angular-cli@v12.1.0...14.2.3)

---
updated-dependencies:
- dependency-name: glob-parent
  dependency-type: indirect
- dependency-name: "@angular-devkit/build-angular"
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <[email protected]>
@natb1
Copy link
Collaborator Author

natb1 commented Dec 1, 2022

Updated section 2.2 with additional context on SQLRender

@leeevans leeevans requested review from chMatvey and ssamus December 5, 2022 12:01
natb1 and others added 8 commits December 6, 2022 19:19
…UI/glob-parent-and-angular-devkit/build-angular-5.1.2

Bump glob-parent and @angular-devkit/build-angular in /UI
…UI/node-forge-and-angular-devkit/build-angular-1.3.1

Bump node-forge and @angular-devkit/build-angular in /UI
Bumps [minimatch](https://github.com/isaacs/minimatch) from 3.0.4 to 3.1.2.
- [Release notes](https://github.com/isaacs/minimatch/releases)
- [Changelog](https://github.com/isaacs/minimatch/blob/main/changelog.md)
- [Commits](isaacs/minimatch@v3.0.4...v3.1.2)

---
updated-dependencies:
- dependency-name: minimatch
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Bumps [json-schema](https://github.com/kriszyp/json-schema) and [jsprim](https://github.com/joyent/node-jsprim). These dependencies needed to be updated together.

Updates `json-schema` from 0.2.3 to 0.4.0
- [Release notes](https://github.com/kriszyp/json-schema/releases)
- [Commits](kriszyp/json-schema@v0.2.3...v0.4.0)

Updates `jsprim` from 1.4.1 to 1.4.2
- [Release notes](https://github.com/joyent/node-jsprim/releases)
- [Changelog](https://github.com/TritonDataCenter/node-jsprim/blob/v1.4.2/CHANGES.md)
- [Commits](TritonDataCenter/node-jsprim@v1.4.1...v1.4.2)

---
updated-dependencies:
- dependency-name: json-schema
  dependency-type: indirect
- dependency-name: jsprim
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
…UI/minimatch-3.1.2

Bump minimatch from 3.0.4 to 3.1.2 in /UI
…UI/json-schema-and-jsprim-0.4.0

Bump json-schema and jsprim in /UI
@ssamus ssamus requested a review from bradanton December 14, 2022 13:18
@ssamus
Copy link
Contributor

ssamus commented Dec 16, 2022

Hi @natb1 @leeevans we have completed the review of this pull request. The feature is not fully implemented and cannot merge into the master branch. The feature must be implemented and tested before it merges with Master. Please re-create the pull request in the Dev branch.

@natb1
Copy link
Collaborator Author

natb1 commented Dec 16, 2022

@ssamus I will resubmit to develop.

Can you please provide more context? Can we have a conversation with your developer? For example:

  • The PR DOES include automated testing. It also includes a framework for QA of the new UI elements.
  • There is apparently no pre-existing framework for testing. Is there any additional context you could provide on the standard I could work towards for the code to be considered "tested".
  • The code DOES implement the functionality described. It does include fully functioning interface that when implemented does extend the UI with additional data sources. It does enable the scanning of a Databricks workspace in production. Is there more information you can provide on the level of functionality you would like to see implemented?

Mostly, is there a way for us to have a technical dialog about contributing to the codebase?

@natb1 natb1 changed the base branch from master to development December 16, 2022 15:48
@bradanton bradanton merged commit 07fcf59 into OHDSI:development Dec 19, 2022
@natb1 natb1 deleted the data-connection-interface branch December 19, 2022 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants