Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthea to OMOP transformation using dbt #4

Merged
merged 22 commits into from
Jun 11, 2024
Merged

Conversation

vvcb
Copy link
Collaborator

@vvcb vvcb commented Feb 17, 2024

This PR includes a working (except drug era which takes forever likely due to lack of indices) implementation of the Synthea to OMOP transformation using dbt to replicate the workflow in https://github.com/OHDSI/ETL-Synthea.

dbt is a transformation tool and expects that the extract and load parts of the ELT have already been completed. While the repo has the Synthea1k dataset as seed files, users are strongly discouraged from using this and should download their Synthea dataset of choice and Athena vocabulary and load it into their data warehouse.

This implementation assumes that this data is already in the database and provides easily configurable sources.yml files to point dbt to these.

In addition to dbt models, this PR also includes the following additional 'features':

  • pre-commit hooks
  • sqlfluff for linting jinja-sql
  • requirements.txt for installing the necessary dependencies

Development was made infinitely easier using the vscode-dbt-power-user extension which allows real time lineage (including column-level lineage) as well as execution of models against the target warehouse.

ToDo:

  • Update Readme to help users target different databases - current docs are for Postgres while this PR is for SQL Server
  • Replace T-SQL specific code with an equivalent macro from dbt-utils or one of many excellent dbt packages
  • Add indices as post-hooks
  • Consider removing all intermediary models at the end of a successful run - easy to implement with a on-run-end hook
  • Alternatively, keep the intermediary models but document them well to enable users get a better understanding of how the transformation is done.

@katy-sadowski , FYI.

vvcb added 11 commits February 17, 2024 10:52
Prefixing the Synthea seed files (which we should not be using aside for a quick demo) allows us to avoid any name conflicts with any other models that we may create within the project.

It also allows us to identify these model references more easily in the code and DAG.
These are the SQL scripts from <https://github.com/OHDSI/ETL-Synthea/tree/main/inst/sql/sql_server/cdm_version/v540>.

This dbt project only targets CDM v5.4 and Synthea V3 datasets.
This is a view layer on top of existing vocabulary models (referred to in `./models/vocab/sources.yml`) as well as the derived models = `source_to_source_vocab_map` and `source_to_standard_vocab_map`.
This commit has the bulk of the work which has involved transferring a lot of the code in the original R library into dbt models.
Please refer to the dbt docs and DAG for further information.

This can be refined further both from a performance and explainability POV.

There is also room for making the SQL more dialect agnostic by using dbt macros.
@katy-sadowski
Copy link
Collaborator

@vvcb wow, thank you for the work you put in here!

As I recommended in my other comment, please do email me at [email protected] if you'd like to get involved in our project group. We haven't even had our first meeting yet (but we're working on scheduling time!) so we are in the very early stages here.

I see that you've essentially replicated the ETL-Synthea queries verbatim in dbt models, which is certainly one approach to rebuilding ETL-Synthea in dbt; however, my vision for this project is to take a bit more of a "first principles" approach where we consider the optimal modelling strategy that also leverages the full power of dbt and its breadth of features. All with an eye to building something generalizable across sources and database systems.

A note regarding the seeds: These are absolutely not intended to be part of the final project. The intention is to use these files solely for the purpose of collaborating on development of this project, so we can all ensure we're using the exact same dataset, and so that that dataset can be version controlled if we need to make any changes for the purpose of improving our development workflows. This approach was inspired by dbt's Jaffle Shop tutorial, which stores its raw data as seeds: https://github.com/dbt-labs/jaffle_shop?tab=readme-ov-file#whats-in-this-repo

A note regarding cross-DB support: We are starting with Postgres because it's easy for all of us to download and work with the same database system as we come up with the initial implementation. Once we've got this up and running we'll have a whole phase of the project dedicated to cross-DB support (leveraging macros, other packages, etc.). We have access to testing databases via OHDSI that'll allow us to cover most if not all OHDSI-supported DBMS.

I really do appreciate what you've done here and would love if you could come share your experience with our group as we get started. I know we are going to learn a ton from you if you have the time and interest to join us 😄

@vvcb
Copy link
Collaborator Author

vvcb commented Feb 18, 2024

Thank you @katy-sadowski for taking a look at this. Agree with all your comments, especially going back to first principles. I have taken the reverse engineering route which I found quicker (lazier) but also allowed me to quickly build the DAG to understand how it all works before I can find another Saturday to start taking things apart and rebuilding - reusing models, parameterising better, etc.

This only took a day but will drop you an email regarding future meetings.

@katy-sadowski
Copy link
Collaborator

I refactored the code to connect to the newly added stg models, rename some models to "int", update column references, and add vocabulary seeds for a minimal vocab.

@burrowse
Copy link
Collaborator

burrowse commented Jun 5, 2024

@katy-sadowski @vvcb This is awesome work! I will try to review this week!

While I had it handy, I wanted to link something Martijn put together for the hades-vocabulary tutorial for the ohdsi EU symposium to filter the vocabulary: https://github.com/OHDSI/Tutorial-Hades/blob/main/extras/FilterVocabulary.R from a converted synthea dataset that we could potential refactor to reference the concepts that are present in the synthea source data

@katy-sadowski
Copy link
Collaborator

@katy-sadowski @vvcb This is awesome work! I will try to review this week!

Thank you @burrowse !

While I had it handy, I wanted to link something Martijn put together for the hades-vocabulary tutorial for the ohdsi EU symposium to filter the vocabulary: https://github.com/OHDSI/Tutorial-Hades/blob/main/extras/FilterVocabulary.R from a converted synthea dataset that we could potential refactor to reference the concepts that are present in the synthea source data

This is fantastic - thanks for sharing. I will try regenerating the seeds using this (against full copy of vocab I have downloaded). I will also add the setup steps @vvcb has in sqlmesh_synthea for using the full vocab. I think it's useful to provide users with both options.

@katy-sadowski
Copy link
Collaborator

I just committed the new vocab shards generated using that script as seeds - thanks so much for sharing that @burrowse ! dbt run now populates the CDM tables as expected 😄 I also included the Python scripts I ran to generate the seeds.

This change also includes added support for duckdb - thanks @vvcb for the suggestion, and the inspiration from your SQLMesh repo. There's now a quickstart mode in duckdb that should make it super easy for people to bring their own Synthea dataset if they choose. I added create table scripts for Postgres too, but for now deferred the actual load step to the user.

@@ -20,7 +20,10 @@ models:

seeds:
synthea_omop_etl:
vocabulary:
+enabled: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i want to figure out how to disable these dynamically if someone is doing BYO data mode. unfortunately this is not possible using vars - dbt-labs/dbt-core#4873

@katy-sadowski katy-sadowski merged commit f2f2d68 into OHDSI:main Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants