[DC-2692] synthetic dataset script version 1 #1369

lrwb-aou · 2022-08-29T18:45:41Z

write a script that automates the largest pain points of generating synthetic data from the RDR team

* altering the base class to have another attribute which defaults to false * adding True attributes to those classes we want to run on a synthetic data set * altering the clean engine to only run those rules for synthetic datasets when synthetic is selected and the rule says it should be executed * works with listing the queries as well * need to work on the opentelemetry implemenetation to not throw errors when running something from the command line locally

* fixing two unit tests

* ignoring the redefinition here. * redefinition will be removed once all classes are base classed

* alter the import script to warn curation whenever we see a table in the bucket that we do not process

* fixes failing unit tests impacted by the addtion of the run_synthetic parameter to infer_rule()

* loads data from a bucket into a raw dataset * creates a synthetic dataset and it's appropriate versions (staging, sandbox, and clean) * runs synthetic pipeline data stage on the data in the staging dataset * TODO: add publishing guidelines to script.

* leverage function in `create_combined_backup_dataset.py` to create rudimentary rdr mapping tables. * update the synthetic data stage to leverage the Registered Tier dataset cleaning rules * allow extension table generation and cope survey versioning to run on synthetic data. * TODO: "publish" data to an internal dataset.

* making sure person table columns are appended

* The txt file was not meant for inclusion.

* some changes to the script while trying to run it initially * adding vocab_dataset parameter

* changes required when running the synthetic script all the way through * the script did finish * more changes are expected

* sets some run_for_synthetic rules to False to avoid dropping too much test data

* changed f-string usage to jinja2 templates * used pre-defined variable for constant value * removed redundant code to reuse existing dataset copy utility * removed conflict code

…sets

* uses cleaning rules to clean survey_conduct table data * removes duplicated code to create cleaned survey_conduct table data * prepares to potentially run all rules from RDR ingest to RT clean dataset * still only runs a subset of rules marked as run_for_synthetic

ksdkalluri self-requested a review September 1, 2022 18:12

lrwb-aou force-pushed the lb/synthetic_on_stable branch 2 times, most recently from b7f6657 to 55bd1a6 Compare September 21, 2022 22:12

lrwb-aou force-pushed the lb/synthetic_on_stable branch from 55bd1a6 to 7843d34 Compare September 30, 2022 21:51

lrwb-aou force-pushed the lb/synthetic_on_stable branch from edc91ca to b60d1b3 Compare October 10, 2022 22:22

lrwb-aou force-pushed the lb/synthetic_on_stable branch from b60d1b3 to d878b40 Compare November 11, 2022 19:23

lrwb-aou force-pushed the lb/synthetic_on_stable branch from 7cdee18 to 408b85f Compare February 27, 2023 22:53

lrwb-aou force-pushed the lb/synthetic_on_stable branch from 5f1e99b to 7daee0f Compare March 30, 2023 17:28

lrwb-aou force-pushed the lb/synthetic_on_stable branch from 7daee0f to e023eef Compare April 24, 2023 17:49

lrwb-aou added 19 commits August 28, 2023 16:18

[DC-2692] unit test fixes

f7693b6

* fixing two unit tests

[DC-2692] pylint ignore redefinition

f3adc92

* ignoring the redefinition here. * redefinition will be removed once all classes are base classed

[DC-2692] rdr import modifications

1e13113

* alter the import script to warn curation whenever we see a table in the bucket that we do not process

[DC-2692] unit test alterations

19d32c8

* fixes failing unit tests impacted by the addtion of the run_synthetic parameter to infer_rule()

[DC-2692] add extra columns to the person table

12daeba

* making sure person table columns are appended

[DC-2692] removing accidentally committed file

e9e515b

* The txt file was not meant for inclusion.

[DC-2692]

02f6be2

* some changes to the script while trying to run it initially * adding vocab_dataset parameter

[DC-2692] synthetic script

92f9a94

* changes required when running the synthetic script all the way through * the script did finish * more changes are expected

[DC-2692] adding changes based on stashed files

416c3f5

* sets some run_for_synthetic rules to False to avoid dropping too much test data

[DC-2692] removing unnecessary query

9b3dc70

[DC-2692] more changes

d20e2a5

* changed f-string usage to jinja2 templates * used pre-defined variable for constant value * removed redundant code to reuse existing dataset copy utility * removed conflict code

[DC-2692] adding base classed rule to rules to run for synthetic data…

ca916f8

…sets

[DC-2692] adding updated parameter to script

4fa6b40

[DC-2692] syntax error due to conflict changes

6661f31

[DC-2692] conflict changes

7b820f7

lrwb-aou force-pushed the lb/synthetic_on_stable branch from 8ed9dcd to 70ed906 Compare August 28, 2023 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DC-2692] synthetic dataset script version 1 #1369

[DC-2692] synthetic dataset script version 1 #1369

lrwb-aou commented Aug 29, 2022

[DC-2692] synthetic dataset script version 1 #1369

Are you sure you want to change the base?

[DC-2692] synthetic dataset script version 1 #1369

Conversation

lrwb-aou commented Aug 29, 2022