-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IBCDPE-688] Great Expectations Implementation for Metabolomics Data #96
Conversation
Your expectations line up really well with what I came up with. The additional regexes are a nice bonus too! Here's a useful multifield unique for metabolomics:
General thoughts/feedback (can be addressed later, as appropriate):
|
@JessterB Thanks for the feedback! I will add that multicolumn unique expectation. Edit: The As part of this work, I can add an expectation that checks the values within the lists. I can also add an expectation that checks for a particular data type to exist within the lists (for metabolomics this would probably be string, int, or list as the case may be). For nested fields and cross-file validation, I think those should be separate tasks to be tackled later because of their potential complexity. For the |
This looks amazing! To echo Jess, adding the regexes is a great idea. However I see your note above that the check for I agree that being able to check if values inside lists are numeric, strings, etc as expected would be really useful. |
great_expectations/gx/plugins/custom_data_docs/styles/data_docs_custom_styles.css
Outdated
Show resolved
Hide resolved
great_expectations/gx/plugins/expectations/expect_column_values_to_have_list_members.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Amazing amount of work, and thank you for adding documentation and thorough testing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 LGTM! This is wonderful
…uired [IBCDPE-527] Makes `syn` always required
Request for Initial Feedback (EDIT: This description is now out-of-date):
I started off by looking through the data validation spreadsheet and the example
metabolomics.json
data from Synapse and applying the expectations that I felt made sense for each column. I have also set up ourgreat_expectations
environment in this repo and created the custom expectationexpect_column_values_to_have_list_length
to evaluate the length of lists in columns containing `list' objects.I have created an example script,
gx_metabolomics.py
, which includes all of the expectations I have tested against the dataset and the example data itself so that others can run the script to test the expectations.The final version of this work will look very different. Expectations running will be implemented seamlessly with the data processing. The example dataset will not be present, with the output of data processing being run through the Great Expectations tests immediately following processing.
With all that being said, I would appreciate feedback on the following:
particular. For example, are there any sets of two columns whose values should be unique from
other rows when taken in pairs?
In the meantime, I will continue to work on the overall implementation.
Note: If you want to run the example script yourself, you will need to checkout this branch of the repository, run
pipenv install
to getgreat_expecations
installed, and then runpython gx_metabolomics.py
.EDIT:
Final Description of Work:
Problem:
We currently have no way to ensure that the data produced by
agora-data-tools
is valid and meets our "expectations" of what the data should be.Solution:
Implement
great_expectations
(GX) for post-processing data validation which runs for each supported dataset when you run theadt <config_file>
command.I played around with a few different ways to implement GX. What I eventually settled on is:
(everything in the
src/agoradatatools/great_expectations
directory within this repo).gx_suite_definitions
directory..py
script so that trying out new expectations is easier.save_expectation_suite
stores the previously executed expectations as a suite in a JSON file, allowing a muchsimpler process of executing all of the expectations for a dataset later.
metabolomics.ipynb
can be used as a template for future expectation suite creationGreatExpectationsRunner
class which is instantiated withinprocess.process_dataset
whenever a dataset has a GX report upload folder (gx_folder
) specified in its configuration.process.process_dataset
function.test_config.yaml
vs.config.yaml
. Subfolders for each dataset will be created in these locations as we add more expectation suites.- Synapse folder for test data validation reports.
- Synapse folder for production data validation reports.
Note: The multi-column unique expectation discussed below is not currently implemented because it was not passing on the example
metabolomics
dataset.Custom Expectations:
During this work, I also created three custom expectations for us to use on our data which contain lists.
ExpectColumnValuesToHaveListLength
: checks to see if the lists in a particular column are the length that we expect.ExpectColumnValuesToHaveListMembers
: checks to see if the lists in a particular column contain only values that we expect.ExpectColumnValuesToHaveListMembersOfType
: checks to see if the lists in a particular column contain members of the type we expect.Note: None of these custom expectations go beyond the first "layer" of nested data. I.e. You cannot currently test the length of a list within a list.
These changes have been successfully tested in the Nextflow Workflow
nf-agora
locally, and I on Tower.