Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IBCDPE-712] Implement Great Expectations for the genes_biodomains Datset #103

Merged
merged 13 commits into from
Dec 13, 2023

Conversation

BWMac
Copy link
Contributor

@BWMac BWMac commented Dec 6, 2023

Problem:

We do not have Great Expectations (GX) data validation set up for the genes_biodomains dataset.

Solution:

Made the following changes to support GX data validation for genes_biodomains:

  • Created new custom expectations ExpectColumnValuesToHaveListLengthInRange and ExpectColumnValuesToHaveListOfDictWithExpectedValues.
  • Created expectation suite saved as JSON file via a new Jupyter Notebook.
  • Created folders for genes_biodomains GX report uploads in Synapse.
  • Added gx_folder to the genes_biodomains dataset in both configuration files.

Please let me know if there are any expectations that I should add to this suite, or if there are any modifications to the expectations I chose that would make the validation more robust.

Note:

The custom ExpectColumnValuesToHaveListOfDictWithExpectedValues expectation was our first shot at validating nested data. If we are not careful we could end up with many narrow expectations which may not generalize well to other contexts (like this one). We may want to take some time to think about whether we should be using GX for nested data validation at all, and if we choose to do so, we should try to come up with some expectations that can generalize better.

@BWMac BWMac marked this pull request as ready for review December 7, 2023 17:34
Copy link
Contributor

@jaclynbeck-sage jaclynbeck-sage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have a small nitpick and a question. Otherwise this looks good.

I do feel like expect_column_values_to_have_list_of_dict_with_expected_values might be a little over-specific to this dataset and isn't scalable to large lists of possible values since it has to be hand-entered. For example some of the nested data in the gene_info dataset has too many options. We also couldn't validate the go_terms field of the genes_biodomains dataset even though the possible values are from a fixed list, there are just too many.

This is usable for the isprimaryinvestigator field (which is true/false) in the nested team_info data, and potentially some of the nested fields in gene_info that have a small fixed number of possible values. So I'm not sure what to say on whether we should keep it as-is, refine/generalize it, or remove it.

@BWMac
Copy link
Contributor Author

BWMac commented Dec 7, 2023

@jaclynbeck-sage Yeah I agree about it potentially being too specific. I think this is sort of beyond the limit of what GX is meant to do. We could potentially design expectations that have more logic to make them more generalized, but then I think that raises the question of whether we should do that (validate nested fields) with GX at all.

@JessterB
Copy link
Contributor

JessterB commented Dec 8, 2023

I also agree that expect_column_values_to_have_list_of_dict_with_expected_values is not a generally useful expectation. I'm not sure if the fact that it's a nested field is the problem though. For example, if we wanted to validate that e.g. the nested n_biodomain_terms is numeric, the expectation would be more generally useful, as we could validate that other nested fields are numeric as well.

We do need a way to validate nested objects somehow, as a lot of our data is nested. One strategy to generalize this particular expectation could be to allow the list of expected values to be passed in per test, if there is a way to do that. If not, we may have to live with having some of these narrow expectations.

Another more global strategy might be to extract nested objects prior to passing them to GE so that we can apply generic expectations to them one by one, rather than trying to validate the entire nested structure with a single expectation.

Copy link
Contributor

@JessterB JessterB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nitpick on the gene_biodomains list length expectation. Otherwise, this covers all of the validation I'd expect for the unnested data in this dataset.

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼 LGTM! Very straightforward with the tiny exception of the nested data validation.

Copy link
Contributor

@jaclynbeck-sage jaclynbeck-sage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@BWMac BWMac merged commit 91e65ab into dev Dec 13, 2023
7 checks passed
@BWMac BWMac deleted the bwmac/IBCDPE-712/gx_genes_biodomains branch December 13, 2023 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants