Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v4 frequency script #376

Merged
merged 66 commits into from
Sep 8, 2023
Merged

v4 frequency script #376

merged 66 commits into from
Sep 8, 2023

Conversation

mike-w-wilson
Copy link
Contributor

@mike-w-wilson mike-w-wilson commented Jul 20, 2023

Here is the full frequency script. I've run it on the test dataset using the --use-test-dataset arg. The script currently puts out all of the annotations from the split VDSs and any annotation used in correcting for the high AB hets. I plan on removing these once this is ready and have a note before the final write on which fields I plan on removing for now. I also realized this does not account for the downsamplings within the ukb subset which Konrad had asked for so I will work on adding that but figured this is huge so should not hold up review with it. As I mentioned in one of our chats, I copied Tim's freq updates and then updated them here to account for the upstream downsampling annotation. Post v4, I will rewrite the gnomad_methods annotate_freq method. This depends on the code added in broadinstitute/gnomad_methods#565 so will fail checks until that is in.

@mike-w-wilson
Copy link
Contributor Author

I'm realizing the non-ukb subset needs to have its own grpmax and age hists and this code doesnt currently account for that. Since we are already splitting by ukb, this should be easy enough to add in but I wont until we get through one round of review.

Copy link
Contributor

@jkgoodrich jkgoodrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a small change, and pinging KC to look at a few lines

gnomad_qc/v4/annotations/generate_freq.py Outdated Show resolved Hide resolved
gnomad_qc/v4/annotations/generate_freq.py Outdated Show resolved Hide resolved
faf_index_dict=[
make_faf_index_dict(hl.eval(x), label_delimiter="-") for x in faf_meta_expr
],
grp_max_meta=[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't follow all the discussions on this so pinging @ch-kr to check this (L778-L782).

Comment on lines 778 to 782
grpmax_meta=[
{"dataset": "gnomad"},
{"dataset": "non_ukb"},
], # TODO: These seem silly but keeps with the meta/dict theme of globals
grpmax_index_dict=SUBSET_DICT,
Copy link
Contributor Author

@mike-w-wilson mike-w-wilson Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I committed a suggestion on naming and it looks like it removed @jkgoodrich 's other comment. @ch-kr do you have thoughts on the grpmax_meta and index_dict? L778-782

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm. what about removing these globals and restructuring grpmax to have the same structure as the age hists? something like

grpmax: struct{
    gnomad: struct{
        AC: int32,
        AF: float64,
        AN: int32,
        homozygote_count: int64,
        pop: str,
        faf95: float64,
    },
    non-ukb: struct{
            AC: int32,
            AF: float64,
            AN: int32,
            homozygote_count: int64,
            pop: str,
            faf95: float64,
        }
} 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with that but just confirming we're dropping age hists so it actually wont look like the above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes sorry that wasn't clear. we're dropping the non-UKB age hists, so that annotation looks like this now, right?

age_hists: struct{
            age_hist_hom: struct {
                bin_edges: array<float64>, 
                bin_freq: array<int64>, 
                n_smaller: int64, 
                n_larger: int64
            }, 
            age_hist_het: struct {
                bin_edges: array<float64>, 
                bin_freq: array<int64>, 
                n_smaller: int64, 
                n_larger: int64
            },
        }

Copy link
Contributor

@jkgoodrich jkgoodrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved pending KC's review

@mike-w-wilson
Copy link
Contributor Author

KC approved the structure and schema output via slack, will merge once we resolve the pylint check:#416

@mike-w-wilson mike-w-wilson merged commit 9763d8a into main Sep 8, 2023
1 check passed
@mike-w-wilson mike-w-wilson deleted the mw/v4_freq branch September 29, 2023 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants