-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DC-2692] synthetic dataset script version 1 #1369
Open
lrwb-aou
wants to merge
19
commits into
develop
Choose a base branch
from
lb/synthetic_on_stable
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
lrwb-aou
commented
Aug 29, 2022
- write a script that automates the largest pain points of generating synthetic data from the RDR team
lrwb-aou
force-pushed
the
lb/synthetic_on_stable
branch
2 times, most recently
from
September 21, 2022 22:12
b7f6657
to
55bd1a6
Compare
lrwb-aou
force-pushed
the
lb/synthetic_on_stable
branch
from
September 30, 2022 21:51
55bd1a6
to
7843d34
Compare
lrwb-aou
force-pushed
the
lb/synthetic_on_stable
branch
from
October 10, 2022 22:22
edc91ca
to
b60d1b3
Compare
lrwb-aou
force-pushed
the
lb/synthetic_on_stable
branch
from
November 11, 2022 19:23
b60d1b3
to
d878b40
Compare
lrwb-aou
force-pushed
the
lb/synthetic_on_stable
branch
from
February 27, 2023 22:53
7cdee18
to
408b85f
Compare
lrwb-aou
force-pushed
the
lb/synthetic_on_stable
branch
from
March 30, 2023 17:28
5f1e99b
to
7daee0f
Compare
lrwb-aou
force-pushed
the
lb/synthetic_on_stable
branch
from
April 24, 2023 17:49
7daee0f
to
e023eef
Compare
* altering the base class to have another attribute which defaults to false * adding True attributes to those classes we want to run on a synthetic data set * altering the clean engine to only run those rules for synthetic datasets when synthetic is selected and the rule says it should be executed * works with listing the queries as well * need to work on the opentelemetry implemenetation to not throw errors when running something from the command line locally
* fixing two unit tests
* ignoring the redefinition here. * redefinition will be removed once all classes are base classed
* alter the import script to warn curation whenever we see a table in the bucket that we do not process
* fixes failing unit tests impacted by the addtion of the run_synthetic parameter to infer_rule()
* loads data from a bucket into a raw dataset * creates a synthetic dataset and it's appropriate versions (staging, sandbox, and clean) * runs synthetic pipeline data stage on the data in the staging dataset * TODO: add publishing guidelines to script.
* leverage function in `create_combined_backup_dataset.py` to create rudimentary rdr mapping tables. * update the synthetic data stage to leverage the Registered Tier dataset cleaning rules * allow extension table generation and cope survey versioning to run on synthetic data. * TODO: "publish" data to an internal dataset.
* making sure person table columns are appended
* The txt file was not meant for inclusion.
* changes required when running the synthetic script all the way through * the script did finish * more changes are expected
* sets some run_for_synthetic rules to False to avoid dropping too much test data
* changed f-string usage to jinja2 templates * used pre-defined variable for constant value * removed redundant code to reuse existing dataset copy utility * removed conflict code
* uses cleaning rules to clean survey_conduct table data * removes duplicated code to create cleaned survey_conduct table data * prepares to potentially run all rules from RDR ingest to RT clean dataset * still only runs a subset of rules marked as run_for_synthetic
lrwb-aou
force-pushed
the
lb/synthetic_on_stable
branch
from
August 28, 2023 21:29
8ed9dcd
to
70ed906
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.