-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v4 frequency script #376
v4 frequency script #376
Conversation
I'm realizing the non-ukb subset needs to have its own grpmax and age hists and this code doesnt currently account for that. Since we are already splitting by ukb, this should be easy enough to add in but I wont until we get through one round of review. |
Co-authored-by: Mike Wilson <[email protected]>
Co-authored-by: Mike Wilson <[email protected]>
…stions UKB downsampling frequency PR suggestions
Co-authored-by: jkgoodrich <[email protected]>
non_ukb downsamplings for v4 freq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a small change, and pinging KC to look at a few lines
faf_index_dict=[ | ||
make_faf_index_dict(hl.eval(x), label_delimiter="-") for x in faf_meta_expr | ||
], | ||
grp_max_meta=[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't follow all the discussions on this so pinging @ch-kr to check this (L778-L782).
Co-authored-by: jkgoodrich <[email protected]>
grpmax_meta=[ | ||
{"dataset": "gnomad"}, | ||
{"dataset": "non_ukb"}, | ||
], # TODO: These seem silly but keeps with the meta/dict theme of globals | ||
grpmax_index_dict=SUBSET_DICT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I committed a suggestion on naming and it looks like it removed @jkgoodrich 's other comment. @ch-kr do you have thoughts on the grpmax_meta and index_dict? L778-782
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm. what about removing these globals and restructuring grpmax to have the same structure as the age hists? something like
grpmax: struct{
gnomad: struct{
AC: int32,
AF: float64,
AN: int32,
homozygote_count: int64,
pop: str,
faf95: float64,
},
non-ukb: struct{
AC: int32,
AF: float64,
AN: int32,
homozygote_count: int64,
pop: str,
faf95: float64,
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with that but just confirming we're dropping age hists so it actually wont look like the above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yes sorry that wasn't clear. we're dropping the non-UKB age hists, so that annotation looks like this now, right?
age_hists: struct{
age_hist_hom: struct {
bin_edges: array<float64>,
bin_freq: array<int64>,
n_smaller: int64,
n_larger: int64
},
age_hist_het: struct {
bin_edges: array<float64>,
bin_freq: array<int64>,
n_smaller: int64,
n_larger: int64
},
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved pending KC's review
KC approved the structure and schema output via slack, will merge once we resolve the pylint check:#416 |
Here is the full frequency script. I've run it on the test dataset using the
--use-test-dataset
arg. The script currently puts out all of the annotations from the split VDSs and any annotation used in correcting for the high AB hets. I plan on removing these once this is ready and have a note before the final write on which fields I plan on removing for now. I also realized this does not account for the downsamplings within the ukb subset which Konrad had asked for so I will work on adding that but figured this is huge so should not hold up review with it. As I mentioned in one of our chats, I copied Tim's freq updates and then updated them here to account for the upstream downsampling annotation. Post v4, I will rewrite the gnomad_methods annotate_freq method. This depends on the code added in broadinstitute/gnomad_methods#565 so will fail checks until that is in.