Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DC-2692] synthetic dataset script version 1 #1369

Open
wants to merge 19 commits into
base: develop
Choose a base branch
from

Conversation

lrwb-aou
Copy link
Contributor

  • write a script that automates the largest pain points of generating synthetic data from the RDR team

* altering the base class to have another attribute which defaults to false
* adding True attributes to those classes we want to run on a synthetic data set
* altering the clean engine to only run those rules for synthetic datasets when synthetic is selected and the rule says it should be executed
* works with listing the queries as well
* need to work on the opentelemetry implemenetation to not throw errors when running something from the command line locally
* fixing two unit tests
* ignoring the redefinition here.
* redefinition will be removed once all classes are base classed
* alter the import script to warn curation whenever we see a table in the bucket that we do not process
* fixes failing unit tests impacted by the addtion of the run_synthetic parameter to infer_rule()
* loads data from a bucket into a raw dataset
* creates a synthetic dataset and it's appropriate versions (staging, sandbox, and clean)
* runs synthetic pipeline data stage on the data in the staging dataset
* TODO:  add publishing guidelines to script.
* leverage function in `create_combined_backup_dataset.py` to create rudimentary rdr mapping tables.
* update the synthetic data stage to leverage the Registered Tier dataset cleaning rules
* allow extension table generation and cope survey versioning to run on synthetic data.
* TODO:  "publish" data to an internal dataset.
* making sure person table columns are appended
* The txt file was not meant for inclusion.
* some changes to the script while trying to run it initially
* adding vocab_dataset parameter
* changes required when running the synthetic script all the way through
* the script did finish
* more changes are expected
* sets some run_for_synthetic rules to False to avoid dropping too much test data
* changed f-string usage to jinja2 templates
* used pre-defined variable for constant value
* removed redundant code to reuse existing dataset copy utility
* removed conflict code
* uses cleaning rules to clean survey_conduct table data
* removes duplicated code to create cleaned survey_conduct table data
* prepares to potentially run all rules from RDR ingest to RT clean dataset
* still only runs a subset of rules marked as run_for_synthetic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant