[IBCDPE-688] Great Expectations Implementation for Metabolomics Data #96

BWMac · 2023-11-08T22:27:13Z

Request for Initial Feedback (EDIT: This description is now out-of-date):

I started off by looking through the data validation spreadsheet and the example metabolomics.json data from Synapse and applying the expectations that I felt made sense for each column. I have also set up our great_expectations environment in this repo and created the custom expectation expect_column_values_to_have_list_length to evaluate the length of lists in columns containing `list' objects.

I have created an example script,gx_metabolomics.py, which includes all of the expectations I have tested against the dataset and the example data itself so that others can run the script to test the expectations.

The final version of this work will look very different. Expectations running will be implemented seamlessly with the data processing. The example dataset will not be present, with the output of data processing being run through the Great Expectations tests immediately following processing.

With all that being said, I would appreciate feedback on the following:

Are the expectations I have selected for each column appropriate?
Are there any expectations missing?
- I am particularly interested in potential multi-column expectations as I was not sure about this in
  particular. For example, are there any sets of two columns whose values should be unique from
  other rows when taken in pairs?

In the meantime, I will continue to work on the overall implementation.

Note: If you want to run the example script yourself, you will need to checkout this branch of the repository, run pipenv install to get great_expecations installed, and then run python gx_metabolomics.py.

EDIT:

Final Description of Work:

Problem:

We currently have no way to ensure that the data produced by agora-data-tools is valid and meets our "expectations" of what the data should be.

Solution:

Implement great_expectations (GX) for post-processing data validation which runs for each supported dataset when you run the adt <config_file> command.

I played around with a few different ways to implement GX. What I eventually settled on is:

We will use a file system-based data context for our GX implementation.
- This allows us to have version control on all GX configuration files and expectation suites
  (everything in the src/agoradatatools/great_expectations directory within this repo).
Expectation suites will be created by running Jupyter Notebooks which are stored in the gx_suite_definitions directory.
- I chose to use Jupyter Notebooks over a plain .py script so that trying out new expectations is easier.
- save_expectation_suite stores the previously executed expectations as a suite in a JSON file, allowing a much
  simpler process of executing all of the expectations for a dataset later.
- Editing an expectation suite is as simple as updating the Notebook and re-running it.
- metabolomics.ipynb can be used as a template for future expectation suite creation
In order to perform the data validation within a data processing run, I created the GreatExpectationsRunner class which is instantiated within process.process_dataset whenever a dataset has a GX report upload folder (gx_folder) specified in its configuration.
- The class is instantiated with objects that are already available within the scope of the
  process.process_dataset function.
- It checks if the requested expectation suite exists before attempting the data validation.
- It logs its progress as it:
  - Checks if the suite exists
  - Performs data validation on a given dataset
  - Uploads the resulting HTML report to Synapse
I created separate locations in Synapse for data validation performed during a run with test_config.yaml vs. config.yaml. Subfolders for each dataset will be created in these locations as we add more expectation suites.
- Synapse folder for test data validation reports.
- Synapse folder for production data validation reports.

Note: The multi-column unique expectation discussed below is not currently implemented because it was not passing on the example metabolomics dataset.

Custom Expectations:

During this work, I also created three custom expectations for us to use on our data which contain lists.

ExpectColumnValuesToHaveListLength: checks to see if the lists in a particular column are the length that we expect.
ExpectColumnValuesToHaveListMembers: checks to see if the lists in a particular column contain only values that we expect.
ExpectColumnValuesToHaveListMembersOfType: checks to see if the lists in a particular column contain members of the type we expect.

Note: None of these custom expectations go beyond the first "layer" of nested data. I.e. You cannot currently test the length of a list within a list.

These changes have been successfully tested in the Nextflow Workflow nf-agora locally, and I on Tower.

JessterB · 2023-11-08T23:22:13Z

Your expectations line up really well with what I came up with. The additional regexes are a nice bonus too!

Here's a useful multifield unique for metabolomics:

# multifield uniques
  is_unique(associated_gene_name, metabolite_id),

General thoughts/feedback (can be addressed later, as appropriate):

It would be nice to have an example implementation for validating list items in list columns.

A somewhat artificial check could be to validate that each value in boxplot_group_names[] is one of the two valid values: "CN", "AD".
Another option would be to make a more useful check against the genes_biodomains file, but this gets a little complicated: genes_biodomains.gene_biodomains is a nested object that contains the biodomain[] list field. The string values in biodomains[] should always match one of the values in biodomain_info.name, which is a cross-file check.

We are going to need the same set of 5 ensembl_gene_id rules for every dataset. Is there a better way to handle that kind of scenario than copy/pasting them all over?

BWMac · 2023-11-09T16:41:59Z

@JessterB Thanks for the feedback! I will add that multicolumn unique expectation.

Edit: The metabolomics.json file fails the multi-column/compound unique expectation for the associated_gene_name and metabolite_id columns with ~1.5% of the rows being non-unique.

As part of this work, I can add an expectation that checks the values within the lists. I can also add an expectation that checks for a particular data type to exist within the lists (for metabolomics this would probably be string, int, or list as the case may be). For nested fields and cross-file validation, I think those should be separate tasks to be tackled later because of their potential complexity.

For the ensembl_gene_id rules which will be reused, I can see what this can look like in the final implementation, but it is also possible that the better option is to be verbose and readable so that it is crystal clear what is being checked in each dataset. I will probably have more thoughts on this after the initial implementation is done.

gx_metabolomics.py

jaclynbeck-sage · 2023-11-09T22:14:12Z

This looks amazing! To echo Jess, adding the regexes is a great idea.
I compared Jess's validation script to this one, and the only real difference I see is that Jess's also checks for is_unique(associated_gene_name).

However I see your note above that the check for unique(associated_gene_name, metabolite_id) fails, which means that there must be duplicate associated_gene_name values... @JessterB, has the metabolomics.json file we use for Agora been passing this validation rule with your code? I'm confused.

I agree that being able to check if values inside lists are numeric, strings, etc as expected would be really useful.

great_expectations/gx/plugins/custom_data_docs/styles/data_docs_custom_styles.css

great_expectations/gx/plugins/expectations/expect_column_values_to_have_list_members.py

jaclynbeck-sage

This looks great! Amazing amount of work, and thank you for adding documentation and thorough testing!

thomasyu888

💯 LGTM! This is wonderful

gx_metabolomics.py

…uired [IBCDPE-527] Makes `syn` always required

BWMac added 3 commits November 8, 2023 15:06

adds great expectations to repo

bc729d4

adds great expectations workspace

28c0168

adds great expectations metabolomics example and data

62ab669

BWMac requested review from thomasyu888, JessterB and jaclynbeck-sage November 8, 2023 22:27

remove python 3.10 and up

4fcad25

jaclynbeck-sage reviewed Nov 9, 2023

View reviewed changes

gx_metabolomics.py Outdated Show resolved Hide resolved

jaclynbeck-sage reviewed Nov 9, 2023

View reviewed changes

gx_metabolomics.py Outdated Show resolved Hide resolved

jaclynbeck-sage reviewed Nov 9, 2023

View reviewed changes

gx_metabolomics.py Outdated Show resolved Hide resolved

BWMac added 2 commits November 9, 2023 15:01

removes jupyter notebooks from gitignore

6e0449d

adds metabolomics JSON gx suite

5de025f

BWMac added 2 commits November 9, 2023 15:53

adds new custom expecations

5987944

updates metabolomics expectation suite

4530a14

thomasyu888 reviewed Nov 10, 2023

View reviewed changes

great_expectations/gx/plugins/custom_data_docs/styles/data_docs_custom_styles.css Outdated Show resolved Hide resolved

thomasyu888 reviewed Nov 10, 2023

View reviewed changes

great_expectations/gx/plugins/expectations/expect_column_values_to_have_list_members.py Outdated Show resolved Hide resolved

BWMac added 11 commits November 10, 2023 09:53

hard-pins gx version

50538a4

removes metabolomics script and sample data

eee3f96

updates metabolomics Jupyter Notebook

1451b3f

adds great expectations class

11b7e2b

implement great exepctations in processing

4cdee4d

adds missing docstrings

379814c

updates GX execution

46d1997

adds missing type hints

8ae53ce

clean jupyter notebook

8c209d8

updates runner class

73f128d

fix relative path

19e26db

BWMac added 14 commits November 16, 2023 10:37

adds include to setup

9645b05

explicitly add great_expectations

e1c40d0

try as package data

9535744

make it a submodule

39e455a

try manifest.in

d28c424

change path

5f045ce

try less specific path test

8943c94

removes ls's

d88d726

remove src

131688e

remove -s from pytest in CI

808ad40

updates notebook instructions

7e7e9dc

removes print statements

c3cc759

updates contributing guide

5cd68d4

adds comment to MANIFEST.in

999d8ce

BWMac requested a review from jaclynbeck-sage November 16, 2023 21:39

bumps synapse python client version

47dc9c9

jaclynbeck-sage approved these changes Nov 21, 2023

View reviewed changes

BWMac added 4 commits November 22, 2023 09:53

updates process and tests

fb2e69e

updates extract

fb7b5fa

updates process docstring

bf1b583

updates load and tests

56179b3

BWMac marked this pull request as ready for review November 22, 2023 17:03

updates docstring

57df208

thomasyu888 approved these changes Nov 22, 2023

View reviewed changes

updates docstring

db265af

JessterB approved these changes Nov 22, 2023

View reviewed changes

gx_metabolomics.py Outdated Show resolved Hide resolved

gx_metabolomics.py Outdated Show resolved Hide resolved

BWMac and others added 2 commits November 22, 2023 12:59

Merge pull request #97 from Sage-Bionetworks/bwmac/IBCDPE-527/syn_req…

8112d98

…uired [IBCDPE-527] Makes `syn` always required

updates min_values to strict_mins

0c680cb

BWMac merged commit d4d76fa into dev Nov 22, 2023
7 checks passed

BWMac deleted the bwmac/IBCDPE-688/gx-metabolomics branch November 22, 2023 21:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IBCDPE-688] Great Expectations Implementation for Metabolomics Data #96

[IBCDPE-688] Great Expectations Implementation for Metabolomics Data #96

BWMac commented Nov 8, 2023 •

edited

Loading

JessterB commented Nov 8, 2023 •

edited

Loading

BWMac commented Nov 9, 2023 •

edited

Loading

jaclynbeck-sage commented Nov 9, 2023

jaclynbeck-sage left a comment

thomasyu888 left a comment

[IBCDPE-688] Great Expectations Implementation for Metabolomics Data #96

[IBCDPE-688] Great Expectations Implementation for Metabolomics Data #96

Conversation

BWMac commented Nov 8, 2023 • edited Loading

JessterB commented Nov 8, 2023 • edited Loading

BWMac commented Nov 9, 2023 • edited Loading

jaclynbeck-sage commented Nov 9, 2023

jaclynbeck-sage left a comment

Choose a reason for hiding this comment

thomasyu888 left a comment

Choose a reason for hiding this comment

BWMac commented Nov 8, 2023 •

edited

Loading

JessterB commented Nov 8, 2023 •

edited

Loading

BWMac commented Nov 9, 2023 •

edited

Loading