[IBCDPE-712] Implement Great Expectations for the `genes_biodomains` Datset #103

BWMac · 2023-12-06T22:52:34Z

Problem:

We do not have Great Expectations (GX) data validation set up for the genes_biodomains dataset.

Solution:

Made the following changes to support GX data validation for genes_biodomains:

Created new custom expectations ExpectColumnValuesToHaveListLengthInRange and ExpectColumnValuesToHaveListOfDictWithExpectedValues.
Created expectation suite saved as JSON file via a new Jupyter Notebook.
Created folders for genes_biodomains GX report uploads in Synapse.
Added gx_folder to the genes_biodomains dataset in both configuration files.

Please let me know if there are any expectations that I should add to this suite, or if there are any modifications to the expectations I chose that would make the validation more robust.

Note:

The custom ExpectColumnValuesToHaveListOfDictWithExpectedValues expectation was our first shot at validating nested data. If we are not careful we could end up with many narrow expectations which may not generalize well to other contexts (like this one). We may want to take some time to think about whether we should be using GX for nested data validation at all, and if we choose to do so, we should try to come up with some expectations that can generalize better.

src/agoradatatools/great_expectations/gx/checkpoints/agora-test-checkpoint.yml

gx_suite_definitions/genes_biodomains.ipynb

jaclynbeck-sage

I just have a small nitpick and a question. Otherwise this looks good.

I do feel like expect_column_values_to_have_list_of_dict_with_expected_values might be a little over-specific to this dataset and isn't scalable to large lists of possible values since it has to be hand-entered. For example some of the nested data in the gene_info dataset has too many options. We also couldn't validate the go_terms field of the genes_biodomains dataset even though the possible values are from a fixed list, there are just too many.

This is usable for the isprimaryinvestigator field (which is true/false) in the nested team_info data, and potentially some of the nested fields in gene_info that have a small fixed number of possible values. So I'm not sure what to say on whether we should keep it as-is, refine/generalize it, or remove it.

BWMac · 2023-12-07T23:22:50Z

@jaclynbeck-sage Yeah I agree about it potentially being too specific. I think this is sort of beyond the limit of what GX is meant to do. We could potentially design expectations that have more logic to make them more generalized, but then I think that raises the question of whether we should do that (validate nested fields) with GX at all.

JessterB · 2023-12-08T19:43:55Z

I also agree that expect_column_values_to_have_list_of_dict_with_expected_values is not a generally useful expectation. I'm not sure if the fact that it's a nested field is the problem though. For example, if we wanted to validate that e.g. the nested n_biodomain_terms is numeric, the expectation would be more generally useful, as we could validate that other nested fields are numeric as well.

We do need a way to validate nested objects somehow, as a lot of our data is nested. One strategy to generalize this particular expectation could be to allow the list of expected values to be passed in per test, if there is a way to do that. If not, we may have to live with having some of these narrow expectations.

Another more global strategy might be to extract nested objects prior to passing them to GE so that we can apply generic expectations to them one by one, rather than trying to validate the entire nested structure with a single expectation.

JessterB

One nitpick on the gene_biodomains list length expectation. Otherwise, this covers all of the validation I'd expect for the unnested data in this dataset.

src/agoradatatools/great_expectations/gx/expectations/genes_biodomains.json

thomasyu888

👍🏼 LGTM! Very straightforward with the tiny exception of the nested data validation.

jaclynbeck-sage

LGTM!

BWMac added 9 commits December 6, 2023 14:22

adds list length less than/greater than expectations

d8fbcc3

create in range expectation instead

ae63679

adds bidomain checking expectation

2ebc268

adds genes_biodomains expectation suite

4abff76

adds gx_folders to genes_biodomains configs

5264eca

adds new custom expectations to gx.py

0a49691

checkpoint created after good run

986c6b0

changing type hints for 3.8 compatibility

d48d53f

clear outputs from notebook

85f02b2

BWMac marked this pull request as ready for review December 7, 2023 17:34

BWMac requested review from thomasyu888, JessterB and jaclynbeck-sage December 7, 2023 17:35

jaclynbeck-sage reviewed Dec 7, 2023

View reviewed changes

src/agoradatatools/great_expectations/gx/checkpoints/agora-test-checkpoint.yml Outdated Show resolved Hide resolved

jaclynbeck-sage reviewed Dec 7, 2023

View reviewed changes

gx_suite_definitions/genes_biodomains.ipynb Outdated Show resolved Hide resolved

jaclynbeck-sage requested changes Dec 7, 2023

View reviewed changes

BWMac added 2 commits December 7, 2023 16:19

fix title in notebook

d78f638

remove duplicate file

e9584f9

JessterB requested changes Dec 8, 2023

View reviewed changes

src/agoradatatools/great_expectations/gx/expectations/genes_biodomains.json Show resolved Hide resolved

BWMac added 2 commits December 8, 2023 12:53

updates range to 1,19

ff6c6e0

updates notebook

c983e37

JessterB approved these changes Dec 8, 2023

View reviewed changes

BWMac requested a review from jaclynbeck-sage December 11, 2023 15:18

thomasyu888 approved these changes Dec 13, 2023

View reviewed changes

jaclynbeck-sage approved these changes Dec 13, 2023

View reviewed changes

BWMac merged commit 91e65ab into dev Dec 13, 2023
7 checks passed

BWMac deleted the bwmac/IBCDPE-712/gx_genes_biodomains branch December 13, 2023 17:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IBCDPE-712] Implement Great Expectations for the `genes_biodomains` Datset #103

[IBCDPE-712] Implement Great Expectations for the `genes_biodomains` Datset #103

BWMac commented Dec 6, 2023 •

edited

Loading

jaclynbeck-sage left a comment •

edited

Loading

BWMac commented Dec 7, 2023

JessterB commented Dec 8, 2023

JessterB left a comment

thomasyu888 left a comment

jaclynbeck-sage left a comment

[IBCDPE-712] Implement Great Expectations for the genes_biodomains Datset #103

[IBCDPE-712] Implement Great Expectations for the genes_biodomains Datset #103

Conversation

BWMac commented Dec 6, 2023 • edited Loading

jaclynbeck-sage left a comment • edited Loading

Choose a reason for hiding this comment

BWMac commented Dec 7, 2023

JessterB commented Dec 8, 2023

JessterB left a comment

Choose a reason for hiding this comment

thomasyu888 left a comment

Choose a reason for hiding this comment

jaclynbeck-sage left a comment

Choose a reason for hiding this comment

[IBCDPE-712] Implement Great Expectations for the `genes_biodomains` Datset #103

[IBCDPE-712] Implement Great Expectations for the `genes_biodomains` Datset #103

BWMac commented Dec 6, 2023 •

edited

Loading

jaclynbeck-sage left a comment •

edited

Loading