-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IBCDPE-712] Implement Great Expectations for the genes_biodomains
Datset
#103
Conversation
src/agoradatatools/great_expectations/gx/checkpoints/agora-test-checkpoint.yml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have a small nitpick and a question. Otherwise this looks good.
I do feel like expect_column_values_to_have_list_of_dict_with_expected_values
might be a little over-specific to this dataset and isn't scalable to large lists of possible values since it has to be hand-entered. For example some of the nested data in the gene_info
dataset has too many options. We also couldn't validate the go_terms
field of the genes_biodomains
dataset even though the possible values are from a fixed list, there are just too many.
This is usable for the isprimaryinvestigator
field (which is true/false) in the nested team_info
data, and potentially some of the nested fields in gene_info
that have a small fixed number of possible values. So I'm not sure what to say on whether we should keep it as-is, refine/generalize it, or remove it.
@jaclynbeck-sage Yeah I agree about it potentially being too specific. I think this is sort of beyond the limit of what GX is meant to do. We could potentially design expectations that have more logic to make them more generalized, but then I think that raises the question of whether we should do that (validate nested fields) with GX at all. |
I also agree that We do need a way to validate nested objects somehow, as a lot of our data is nested. One strategy to generalize this particular expectation could be to allow the list of expected values to be passed in per test, if there is a way to do that. If not, we may have to live with having some of these narrow expectations. Another more global strategy might be to extract nested objects prior to passing them to GE so that we can apply generic expectations to them one by one, rather than trying to validate the entire nested structure with a single expectation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nitpick on the gene_biodomains list length expectation. Otherwise, this covers all of the validation I'd expect for the unnested data in this dataset.
src/agoradatatools/great_expectations/gx/expectations/genes_biodomains.json
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏼 LGTM! Very straightforward with the tiny exception of the nested data validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Problem:
We do not have Great Expectations (GX) data validation set up for the
genes_biodomains
dataset.Solution:
Made the following changes to support GX data validation for
genes_biodomains
:ExpectColumnValuesToHaveListLengthInRange
andExpectColumnValuesToHaveListOfDictWithExpectedValues
.genes_biodomains
GX report uploads in Synapse.gx_folder
to thegenes_biodomains
dataset in both configuration files.Please let me know if there are any expectations that I should add to this suite, or if there are any modifications to the expectations I chose that would make the validation more robust.
Note:
The custom
ExpectColumnValuesToHaveListOfDictWithExpectedValues
expectation was our first shot at validating nested data. If we are not careful we could end up with many narrow expectations which may not generalize well to other contexts (like this one). We may want to take some time to think about whether we should be using GX for nested data validation at all, and if we choose to do so, we should try to come up with some expectations that can generalize better.