Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IBCDPE-688] Great Expectations Implementation for Metabolomics Data #96

Merged
merged 74 commits into from
Nov 22, 2023

Conversation

BWMac
Copy link
Contributor

@BWMac BWMac commented Nov 8, 2023

Request for Initial Feedback (EDIT: This description is now out-of-date):

I started off by looking through the data validation spreadsheet and the example metabolomics.json data from Synapse and applying the expectations that I felt made sense for each column. I have also set up our great_expectations environment in this repo and created the custom expectation expect_column_values_to_have_list_length to evaluate the length of lists in columns containing `list' objects.

I have created an example script,gx_metabolomics.py, which includes all of the expectations I have tested against the dataset and the example data itself so that others can run the script to test the expectations.

The final version of this work will look very different. Expectations running will be implemented seamlessly with the data processing. The example dataset will not be present, with the output of data processing being run through the Great Expectations tests immediately following processing.

With all that being said, I would appreciate feedback on the following:

  1. Are the expectations I have selected for each column appropriate?
  2. Are there any expectations missing?
    • I am particularly interested in potential multi-column expectations as I was not sure about this in
      particular. For example, are there any sets of two columns whose values should be unique from
      other rows when taken in pairs?

In the meantime, I will continue to work on the overall implementation.

Note: If you want to run the example script yourself, you will need to checkout this branch of the repository, run pipenv install to get great_expecations installed, and then run python gx_metabolomics.py.

EDIT:

Final Description of Work:

Problem:

We currently have no way to ensure that the data produced by agora-data-tools is valid and meets our "expectations" of what the data should be.

Solution:

Implement great_expectations (GX) for post-processing data validation which runs for each supported dataset when you run the adt <config_file> command.

I played around with a few different ways to implement GX. What I eventually settled on is:

  • We will use a file system-based data context for our GX implementation.
    • This allows us to have version control on all GX configuration files and expectation suites
      (everything in the src/agoradatatools/great_expectations directory within this repo).
  • Expectation suites will be created by running Jupyter Notebooks which are stored in the gx_suite_definitions directory.
    • I chose to use Jupyter Notebooks over a plain .py script so that trying out new expectations is easier.
    • save_expectation_suite stores the previously executed expectations as a suite in a JSON file, allowing a much
      simpler process of executing all of the expectations for a dataset later.
    • Editing an expectation suite is as simple as updating the Notebook and re-running it.
    • metabolomics.ipynb can be used as a template for future expectation suite creation
  • In order to perform the data validation within a data processing run, I created the GreatExpectationsRunner class which is instantiated within process.process_dataset whenever a dataset has a GX report upload folder (gx_folder) specified in its configuration.
    • The class is instantiated with objects that are already available within the scope of the
      process.process_dataset function.
    • It checks if the requested expectation suite exists before attempting the data validation.
    • It logs its progress as it:
      • Checks if the suite exists
      • Performs data validation on a given dataset
      • Uploads the resulting HTML report to Synapse
  • I created separate locations in Synapse for data validation performed during a run with test_config.yaml vs. config.yaml. Subfolders for each dataset will be created in these locations as we add more expectation suites.
    - Synapse folder for test data validation reports.
    - Synapse folder for production data validation reports.

Note: The multi-column unique expectation discussed below is not currently implemented because it was not passing on the example metabolomics dataset.

Custom Expectations:

During this work, I also created three custom expectations for us to use on our data which contain lists.

  • ExpectColumnValuesToHaveListLength: checks to see if the lists in a particular column are the length that we expect.
  • ExpectColumnValuesToHaveListMembers: checks to see if the lists in a particular column contain only values that we expect.
  • ExpectColumnValuesToHaveListMembersOfType: checks to see if the lists in a particular column contain members of the type we expect.

Note: None of these custom expectations go beyond the first "layer" of nested data. I.e. You cannot currently test the length of a list within a list.

These changes have been successfully tested in the Nextflow Workflow nf-agora locally, and I on Tower.

@JessterB
Copy link
Contributor

JessterB commented Nov 8, 2023

Your expectations line up really well with what I came up with. The additional regexes are a nice bonus too!

Here's a useful multifield unique for metabolomics:

# multifield uniques
  is_unique(associated_gene_name, metabolite_id),

General thoughts/feedback (can be addressed later, as appropriate):

  1. It would be nice to have an example implementation for validating list items in list columns.
  • A somewhat artificial check could be to validate that each value in boxplot_group_names[] is one of the two valid values: "CN", "AD".

  • Another option would be to make a more useful check against the genes_biodomains file, but this gets a little complicated: genes_biodomains.gene_biodomains is a nested object that contains the biodomain[] list field. The string values in biodomains[] should always match one of the values in biodomain_info.name, which is a cross-file check.

  1. We are going to need the same set of 5 ensembl_gene_id rules for every dataset. Is there a better way to handle that kind of scenario than copy/pasting them all over?

@BWMac
Copy link
Contributor Author

BWMac commented Nov 9, 2023

@JessterB Thanks for the feedback! I will add that multicolumn unique expectation.

Edit: The metabolomics.json file fails the multi-column/compound unique expectation for the associated_gene_name and metabolite_id columns with ~1.5% of the rows being non-unique.

As part of this work, I can add an expectation that checks the values within the lists. I can also add an expectation that checks for a particular data type to exist within the lists (for metabolomics this would probably be string, int, or list as the case may be). For nested fields and cross-file validation, I think those should be separate tasks to be tackled later because of their potential complexity.

For the ensembl_gene_id rules which will be reused, I can see what this can look like in the final implementation, but it is also possible that the better option is to be verbose and readable so that it is crystal clear what is being checked in each dataset. I will probably have more thoughts on this after the initial implementation is done.

gx_metabolomics.py Outdated Show resolved Hide resolved
gx_metabolomics.py Outdated Show resolved Hide resolved
gx_metabolomics.py Outdated Show resolved Hide resolved
@jaclynbeck-sage
Copy link
Contributor

This looks amazing! To echo Jess, adding the regexes is a great idea.
I compared Jess's validation script to this one, and the only real difference I see is that Jess's also checks for is_unique(associated_gene_name).

However I see your note above that the check for unique(associated_gene_name, metabolite_id) fails, which means that there must be duplicate associated_gene_name values... @JessterB, has the metabolomics.json file we use for Agora been passing this validation rule with your code? I'm confused.

I agree that being able to check if values inside lists are numeric, strings, etc as expected would be really useful.

Copy link
Contributor

@jaclynbeck-sage jaclynbeck-sage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Amazing amount of work, and thank you for adding documentation and thorough testing!

@BWMac BWMac marked this pull request as ready for review November 22, 2023 17:03
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 LGTM! This is wonderful

gx_metabolomics.py Outdated Show resolved Hide resolved
gx_metabolomics.py Outdated Show resolved Hide resolved
@BWMac BWMac merged commit d4d76fa into dev Nov 22, 2023
7 checks passed
@BWMac BWMac deleted the bwmac/IBCDPE-688/gx-metabolomics branch November 22, 2023 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants