Synthea to OMOP transformation using dbt #4

vvcb · 2024-02-17T20:37:49Z

This PR includes a working (except drug era which takes forever likely due to lack of indices) implementation of the Synthea to OMOP transformation using dbt to replicate the workflow in https://github.com/OHDSI/ETL-Synthea.

dbt is a transformation tool and expects that the extract and load parts of the ELT have already been completed. While the repo has the Synthea1k dataset as seed files, users are strongly discouraged from using this and should download their Synthea dataset of choice and Athena vocabulary and load it into their data warehouse.

This implementation assumes that this data is already in the database and provides easily configurable sources.yml files to point dbt to these.

In addition to dbt models, this PR also includes the following additional 'features':

pre-commit hooks
sqlfluff for linting jinja-sql
requirements.txt for installing the necessary dependencies

Development was made infinitely easier using the vscode-dbt-power-user extension which allows real time lineage (including column-level lineage) as well as execution of models against the target warehouse.

ToDo:

Update Readme to help users target different databases - current docs are for Postgres while this PR is for SQL Server
Replace T-SQL specific code with an equivalent macro from dbt-utils or one of many excellent dbt packages
Add indices as post-hooks
Consider removing all intermediary models at the end of a successful run - easy to implement with a on-run-end hook
Alternatively, keep the intermediary models but document them well to enable users get a better understanding of how the transformation is done.

@katy-sadowski , FYI.

Prefixing the Synthea seed files (which we should not be using aside for a quick demo) allows us to avoid any name conflicts with any other models that we may create within the project. It also allows us to identify these model references more easily in the code and DAG.

These are the SQL scripts from <https://github.com/OHDSI/ETL-Synthea/tree/main/inst/sql/sql_server/cdm_version/v540>. This dbt project only targets CDM v5.4 and Synthea V3 datasets.

This is a view layer on top of existing vocabulary models (referred to in `./models/vocab/sources.yml`) as well as the derived models = `source_to_source_vocab_map` and `source_to_standard_vocab_map`.

This commit has the bulk of the work which has involved transferring a lot of the code in the original R library into dbt models. Please refer to the dbt docs and DAG for further information. This can be refined further both from a performance and explainability POV. There is also room for making the SQL more dialect agnostic by using dbt macros.

katy-sadowski · 2024-02-18T01:38:13Z

@vvcb wow, thank you for the work you put in here!

As I recommended in my other comment, please do email me at [email protected] if you'd like to get involved in our project group. We haven't even had our first meeting yet (but we're working on scheduling time!) so we are in the very early stages here.

I see that you've essentially replicated the ETL-Synthea queries verbatim in dbt models, which is certainly one approach to rebuilding ETL-Synthea in dbt; however, my vision for this project is to take a bit more of a "first principles" approach where we consider the optimal modelling strategy that also leverages the full power of dbt and its breadth of features. All with an eye to building something generalizable across sources and database systems.

A note regarding the seeds: These are absolutely not intended to be part of the final project. The intention is to use these files solely for the purpose of collaborating on development of this project, so we can all ensure we're using the exact same dataset, and so that that dataset can be version controlled if we need to make any changes for the purpose of improving our development workflows. This approach was inspired by dbt's Jaffle Shop tutorial, which stores its raw data as seeds: https://github.com/dbt-labs/jaffle_shop?tab=readme-ov-file#whats-in-this-repo

A note regarding cross-DB support: We are starting with Postgres because it's easy for all of us to download and work with the same database system as we come up with the initial implementation. Once we've got this up and running we'll have a whole phase of the project dedicated to cross-DB support (leveraging macros, other packages, etc.). We have access to testing databases via OHDSI that'll allow us to cover most if not all OHDSI-supported DBMS.

I really do appreciate what you've done here and would love if you could come share your experience with our group as we get started. I know we are going to learn a ton from you if you have the time and interest to join us 😄

vvcb · 2024-02-18T05:20:14Z

Thank you @katy-sadowski for taking a look at this. Agree with all your comments, especially going back to first principles. I have taken the reverse engineering route which I found quicker (lazier) but also allowed me to quickly build the DAG to understand how it all works before I can find another Saturday to start taking things apart and rebuilding - reusing models, parameterising better, etc.

This only took a day but will drop you an email regarding future meetings.

katy-sadowski · 2024-06-05T00:53:50Z

I refactored the code to connect to the newly added stg models, rename some models to "int", update column references, and add vocabulary seeds for a minimal vocab.

burrowse · 2024-06-05T10:03:12Z

@katy-sadowski @vvcb This is awesome work! I will try to review this week!

While I had it handy, I wanted to link something Martijn put together for the hades-vocabulary tutorial for the ohdsi EU symposium to filter the vocabulary: https://github.com/OHDSI/Tutorial-Hades/blob/main/extras/FilterVocabulary.R from a converted synthea dataset that we could potential refactor to reference the concepts that are present in the synthea source data

katy-sadowski · 2024-06-06T00:27:18Z

@katy-sadowski @vvcb This is awesome work! I will try to review this week!

Thank you @burrowse !

While I had it handy, I wanted to link something Martijn put together for the hades-vocabulary tutorial for the ohdsi EU symposium to filter the vocabulary: https://github.com/OHDSI/Tutorial-Hades/blob/main/extras/FilterVocabulary.R from a converted synthea dataset that we could potential refactor to reference the concepts that are present in the synthea source data

This is fantastic - thanks for sharing. I will try regenerating the seeds using this (against full copy of vocab I have downloaded). I will also add the setup steps @vvcb has in sqlmesh_synthea for using the full vocab. I think it's useful to provide users with both options.

katy-sadowski · 2024-06-11T01:24:09Z

I just committed the new vocab shards generated using that script as seeds - thanks so much for sharing that @burrowse ! dbt run now populates the CDM tables as expected 😄 I also included the Python scripts I ran to generate the seeds.

This change also includes added support for duckdb - thanks @vvcb for the suggestion, and the inspiration from your SQLMesh repo. There's now a quickstart mode in duckdb that should make it super easy for people to bring their own Synthea dataset if they choose. I added create table scripts for Postgres too, but for now deferred the actual load step to the user.

katy-sadowski · 2024-06-11T01:33:06Z

dbt_project.yml

@@ -20,7 +20,10 @@ models:

 seeds:
  synthea_omop_etl:
+    vocabulary:
+      +enabled: true


i want to figure out how to disable these dynamically if someone is doing BYO data mode. unfortunately this is not possible using vars - dbt-labs/dbt-core#4873

vvcb added 11 commits February 17, 2024 10:52

Refactor structure and boilerplate

8346f0e

Add a view layer for all Synthea sources

656c24a

Add reference SQL Scripts from R library

e0a7f35

These are the SQL scripts from <https://github.com/OHDSI/ETL-Synthea/tree/main/inst/sql/sql_server/cdm_version/v540>. This dbt project only targets CDM v5.4 and Synthea V3 datasets.

Add models for Vocabulary

0b69169

This is a view layer on top of existing vocabulary models (referred to in `./models/vocab/sources.yml`) as well as the derived models = `source_to_source_vocab_map` and `source_to_standard_vocab_map`.

Add README content to dbt docs as a starter

027ad09

Linting and formatting changes

1850748

Fix cast to date bug

4a60387

Remove trailing semicolon

afb8ca3

Fix wrong table alias

a650d9e

vvcb and others added 7 commits March 30, 2024 20:49

Split drug exposure into separate models

5ae43e1

Create stg__encounter_provider.sql

d7a860a

Simplify/split Observation lineage

2b7f9f7

SQLFluff formattign

3533163

hook up to stg models, rename models and columns, catch up with main

75a1e12

sqlfluff fixes

dbf38de

Merge branch 'main' into vc/main

ebba369

katy-sadowski mentioned this pull request Jun 5, 2024

Decide on and implement lightweight "dev" vocabulary approach #9

Closed

Katy Sadowski added 3 commits June 10, 2024 21:10

fix vocab shard, add duckdb support, add BYO data support

b6a3301

remove dummy model

6ce5b6b

small fixes

d9c8189

katy-sadowski reviewed Jun 11, 2024

View reviewed changes

fix script name

70fcd6a

katy-sadowski merged commit f2f2d68 into OHDSI:main Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthea to OMOP transformation using dbt #4

Synthea to OMOP transformation using dbt #4

vvcb commented Feb 17, 2024 •

edited

Loading

katy-sadowski commented Feb 18, 2024

vvcb commented Feb 18, 2024

katy-sadowski commented Jun 5, 2024

burrowse commented Jun 5, 2024

katy-sadowski commented Jun 6, 2024

katy-sadowski commented Jun 11, 2024

katy-sadowski Jun 11, 2024

Synthea to OMOP transformation using dbt #4

Synthea to OMOP transformation using dbt #4

Conversation

vvcb commented Feb 17, 2024 • edited Loading

katy-sadowski commented Feb 18, 2024

vvcb commented Feb 18, 2024

katy-sadowski commented Jun 5, 2024

burrowse commented Jun 5, 2024

katy-sadowski commented Jun 6, 2024

katy-sadowski commented Jun 11, 2024

katy-sadowski Jun 11, 2024

Choose a reason for hiding this comment

vvcb commented Feb 17, 2024 •

edited

Loading