v4 frequency script #376

mike-w-wilson · 2023-07-20T17:07:19Z

Here is the full frequency script. I've run it on the test dataset using the --use-test-dataset arg. The script currently puts out all of the annotations from the split VDSs and any annotation used in correcting for the high AB hets. I plan on removing these once this is ready and have a note before the final write on which fields I plan on removing for now. I also realized this does not account for the downsamplings within the ukb subset which Konrad had asked for so I will work on adding that but figured this is huge so should not hold up review with it. As I mentioned in one of our chats, I copied Tim's freq updates and then updated them here to account for the upstream downsampling annotation. Post v4, I will rewrite the gnomad_methods annotate_freq method. This depends on the code added in broadinstitute/gnomad_methods#565 so will fail checks until that is in.

gnomad_qc/v4/annotations/generate_freq.py

mike-w-wilson · 2023-07-31T18:52:51Z

I'm realizing the non-ukb subset needs to have its own grpmax and age hists and this code doesnt currently account for that. Since we are already splitting by ukb, this should be easy enough to add in but I wont until we get through one round of review.

…dense

Co-authored-by: Mike Wilson <[email protected]>

…stions UKB downsampling frequency PR suggestions

Co-authored-by: jkgoodrich <[email protected]>

non_ukb downsamplings for v4 freq

jkgoodrich

just a small change, and pinging KC to look at a few lines

gnomad_qc/v4/annotations/generate_freq.py

jkgoodrich · 2023-09-07T19:42:51Z

gnomad_qc/v4/annotations/generate_freq.py

        faf_index_dict=[
            make_faf_index_dict(hl.eval(x), label_delimiter="-") for x in faf_meta_expr
        ],
+        grp_max_meta=[


I didn't follow all the discussions on this so pinging @ch-kr to check this (L778-L782).

Co-authored-by: jkgoodrich <[email protected]>

mike-w-wilson · 2023-09-07T19:51:35Z

gnomad_qc/v4/annotations/generate_freq.py

+        grpmax_meta=[
+            {"dataset": "gnomad"},
+            {"dataset": "non_ukb"},
+        ],  # TODO: These seem silly but keeps with the meta/dict theme of globals
+        grpmax_index_dict=SUBSET_DICT,


Sorry, I committed a suggestion on naming and it looks like it removed @jkgoodrich 's other comment. @ch-kr do you have thoughts on the grpmax_meta and index_dict? L778-782

hmmm. what about removing these globals and restructuring grpmax to have the same structure as the age hists? something like

grpmax: struct{ gnomad: struct{ AC: int32, AF: float64, AN: int32, homozygote_count: int64, pop: str, faf95: float64, }, non-ukb: struct{ AC: int32, AF: float64, AN: int32, homozygote_count: int64, pop: str, faf95: float64, } }

I'm fine with that but just confirming we're dropping age hists so it actually wont look like the above?

oh yes sorry that wasn't clear. we're dropping the non-UKB age hists, so that annotation looks like this now, right?

age_hists: struct{ age_hist_hom: struct { bin_edges: array<float64>, bin_freq: array<int64>, n_smaller: int64, n_larger: int64 }, age_hist_het: struct { bin_edges: array<float64>, bin_freq: array<int64>, n_smaller: int64, n_larger: int64 }, }

jkgoodrich

Approved pending KC's review

mike-w-wilson · 2023-09-08T14:09:21Z

KC approved the structure and schema output via slack, will merge once we resolve the pylint check:#416

mike-w-wilson added 14 commits June 2, 2023 14:58

Fresh start

6fd410a

Clean main

26df8ea

Add split functionality and clean up main with smaller functions

e378797

popmax -> grpmax

f9ec345

high ab fix

9c5df1d

Merge all histograms

ea9da7a

Update use of test datasets

c40bc51

add test and adjustment to freq resource

7107b71

Change param name in get_freq

36e2d63

Sort order fix in freq_dict

0f6377e

Add final schema describe

fbb7d47

Add proper freq dict creation

b998298

Missing fields and notes

78c2f14

Remove test directory in freq resource

1c3288c

mike-w-wilson added v4 Frequency labels Jul 20, 2023

mike-w-wilson requested a review from jkgoodrich July 20, 2023 17:07

mike-w-wilson assigned mike-w-wilson and jkgoodrich Jul 20, 2023

mike-w-wilson commented Jul 31, 2023

View reviewed changes

gnomad_qc/v4/annotations/generate_freq.py Outdated Show resolved Hide resolved

jkgoodrich added 9 commits August 1, 2023 00:11

Generalize frequency annotation function

6906bc9

Small changes to histogram creation

9ce780d

Changes needed during testing

3b26e32

Remove TODOs that I added to PR comments

10787b2

Need to add globals to freq mt

0b3a531

Remaining suggested changes to run_freq_and_dense_annotations

79ba8b3

add eval for freq_meta global annotation

c29f50b

Remove/change comments and TODOs, add possible option for adj before …

2c6467f

…dense

Small fix for _het_ad

9a61cc0

mike-w-wilson and others added 19 commits August 30, 2023 15:28

Add grpmax and faf

76bce57

Add suggestions to UKB downsampling PR

0ebf517

high_ab_hets -> high_ab_hets_by_group

1cbea60

rearrange functions for easier review

dea9c52

rearrange functions for easier review 2

a4d4390

rearrange functions for easier review 3

fd25bde

Apply suggestions from code review

5aaaa97

Co-authored-by: Mike Wilson <[email protected]>

Update gnomad_qc/v4/annotations/generate_freq.py

4142cd5

Co-authored-by: Mike Wilson <[email protected]>

Merge pull request #414 from broadinstitute/jg/ukb_downsampling_sugge…

08e6640

…stions UKB downsampling frequency PR suggestions

Black post github suggestion commits

508a618

Drop ukb_sample from strata, artifact of past attempt

b1879ee

Move set y to NA to end after freq dict

ed78b35

update resource so temp files are in v4

df1474b

Update gnomad_qc/v4/annotations/generate_freq.py

44be3c8

Co-authored-by: jkgoodrich <[email protected]>

Merge pull request #407 from broadinstitute/mw/ukb_downsamplings_v4_freq

c7e6911

non_ukb downsamplings for v4 freq

Flatten faf

c3ac21e

Drop age_hists

2e6444e

Add grpmax meta and index dict

98a66af

Drop age_hist_index_dict global

22ec834

jkgoodrich requested changes Sep 7, 2023

View reviewed changes

Apply suggestions from code review

dac6ea1

Co-authored-by: jkgoodrich <[email protected]>

mike-w-wilson commented Sep 7, 2023

View reviewed changes

jkgoodrich approved these changes Sep 7, 2023

View reviewed changes

mike-w-wilson added 3 commits September 7, 2023 16:13

Drop 27k allele site

8942dd8

make grpmax struct, drop globals

c1c0e11

drop new line

7115eab

Add cuKING to ignore-paths

b7e19cb

mike-w-wilson merged commit 9763d8a into main Sep 8, 2023
1 check passed

mike-w-wilson deleted the mw/v4_freq branch September 29, 2023 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v4 frequency script #376

v4 frequency script #376

mike-w-wilson commented Jul 20, 2023 •

edited

Loading

mike-w-wilson commented Jul 31, 2023

jkgoodrich left a comment

jkgoodrich Sep 7, 2023

mike-w-wilson Sep 7, 2023 •

edited

Loading

ch-kr Sep 7, 2023

mike-w-wilson Sep 7, 2023

ch-kr Sep 7, 2023

jkgoodrich left a comment

mike-w-wilson commented Sep 8, 2023

v4 frequency script #376

v4 frequency script #376

Conversation

mike-w-wilson commented Jul 20, 2023 • edited Loading

mike-w-wilson commented Jul 31, 2023

jkgoodrich left a comment

Choose a reason for hiding this comment

jkgoodrich Sep 7, 2023

Choose a reason for hiding this comment

mike-w-wilson Sep 7, 2023 • edited Loading

Choose a reason for hiding this comment

ch-kr Sep 7, 2023

Choose a reason for hiding this comment

mike-w-wilson Sep 7, 2023

Choose a reason for hiding this comment

ch-kr Sep 7, 2023

Choose a reason for hiding this comment

jkgoodrich left a comment

Choose a reason for hiding this comment

mike-w-wilson commented Sep 8, 2023

mike-w-wilson commented Jul 20, 2023 •

edited

Loading

mike-w-wilson Sep 7, 2023 •

edited

Loading