-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synthea to OMOP transformation using dbt #4
Conversation
Prefixing the Synthea seed files (which we should not be using aside for a quick demo) allows us to avoid any name conflicts with any other models that we may create within the project. It also allows us to identify these model references more easily in the code and DAG.
These are the SQL scripts from <https://github.com/OHDSI/ETL-Synthea/tree/main/inst/sql/sql_server/cdm_version/v540>. This dbt project only targets CDM v5.4 and Synthea V3 datasets.
This is a view layer on top of existing vocabulary models (referred to in `./models/vocab/sources.yml`) as well as the derived models = `source_to_source_vocab_map` and `source_to_standard_vocab_map`.
This commit has the bulk of the work which has involved transferring a lot of the code in the original R library into dbt models. Please refer to the dbt docs and DAG for further information. This can be refined further both from a performance and explainability POV. There is also room for making the SQL more dialect agnostic by using dbt macros.
@vvcb wow, thank you for the work you put in here! As I recommended in my other comment, please do email me at [email protected] if you'd like to get involved in our project group. We haven't even had our first meeting yet (but we're working on scheduling time!) so we are in the very early stages here. I see that you've essentially replicated the ETL-Synthea queries verbatim in dbt models, which is certainly one approach to rebuilding ETL-Synthea in dbt; however, my vision for this project is to take a bit more of a "first principles" approach where we consider the optimal modelling strategy that also leverages the full power of dbt and its breadth of features. All with an eye to building something generalizable across sources and database systems. A note regarding the seeds: These are absolutely not intended to be part of the final project. The intention is to use these files solely for the purpose of collaborating on development of this project, so we can all ensure we're using the exact same dataset, and so that that dataset can be version controlled if we need to make any changes for the purpose of improving our development workflows. This approach was inspired by dbt's Jaffle Shop tutorial, which stores its raw data as seeds: https://github.com/dbt-labs/jaffle_shop?tab=readme-ov-file#whats-in-this-repo A note regarding cross-DB support: We are starting with Postgres because it's easy for all of us to download and work with the same database system as we come up with the initial implementation. Once we've got this up and running we'll have a whole phase of the project dedicated to cross-DB support (leveraging macros, other packages, etc.). We have access to testing databases via OHDSI that'll allow us to cover most if not all OHDSI-supported DBMS. I really do appreciate what you've done here and would love if you could come share your experience with our group as we get started. I know we are going to learn a ton from you if you have the time and interest to join us 😄 |
Thank you @katy-sadowski for taking a look at this. Agree with all your comments, especially going back to first principles. I have taken the reverse engineering route which I found quicker (lazier) but also allowed me to quickly build the DAG to understand how it all works before I can find another Saturday to start taking things apart and rebuilding - reusing models, parameterising better, etc. This only took a day but will drop you an email regarding future meetings. |
I refactored the code to connect to the newly added stg models, rename some models to "int", update column references, and add vocabulary seeds for a minimal vocab. |
@katy-sadowski @vvcb This is awesome work! I will try to review this week! While I had it handy, I wanted to link something Martijn put together for the hades-vocabulary tutorial for the ohdsi EU symposium to filter the vocabulary: https://github.com/OHDSI/Tutorial-Hades/blob/main/extras/FilterVocabulary.R from a converted synthea dataset that we could potential refactor to reference the concepts that are present in the synthea source data |
Thank you @burrowse !
This is fantastic - thanks for sharing. I will try regenerating the seeds using this (against full copy of vocab I have downloaded). I will also add the setup steps @vvcb has in sqlmesh_synthea for using the full vocab. I think it's useful to provide users with both options. |
I just committed the new vocab shards generated using that script as seeds - thanks so much for sharing that @burrowse ! This change also includes added support for duckdb - thanks @vvcb for the suggestion, and the inspiration from your SQLMesh repo. There's now a quickstart mode in duckdb that should make it super easy for people to bring their own Synthea dataset if they choose. I added create table scripts for Postgres too, but for now deferred the actual load step to the user. |
@@ -20,7 +20,10 @@ models: | |||
|
|||
seeds: | |||
synthea_omop_etl: | |||
vocabulary: | |||
+enabled: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i want to figure out how to disable these dynamically if someone is doing BYO data mode. unfortunately this is not possible using vars
- dbt-labs/dbt-core#4873
This PR includes a working (except drug era which takes forever likely due to lack of indices) implementation of the Synthea to OMOP transformation using
dbt
to replicate the workflow in https://github.com/OHDSI/ETL-Synthea.dbt
is a transformation tool and expects that the extract and load parts of the ELT have already been completed. While the repo has the Synthea1k dataset as seed files, users are strongly discouraged from using this and should download their Synthea dataset of choice and Athena vocabulary and load it into their data warehouse.This implementation assumes that this data is already in the database and provides easily configurable sources.yml files to point dbt to these.
In addition to dbt models, this PR also includes the following additional 'features':
Development was made infinitely easier using the vscode-dbt-power-user extension which allows real time lineage (including column-level lineage) as well as execution of models against the target warehouse.
ToDo:
@katy-sadowski , FYI.