Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script to compute combined FAF of v4 genomes and exomes #403

Merged
merged 40 commits into from
Oct 16, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
9adcf5e
Copy of v3 compare freqs to v4
jkgoodrich Aug 22, 2023
b961693
Modify compare freqs for v4
jkgoodrich Aug 23, 2023
08804ef
clean up cochran_mantel_haenszel_test
jkgoodrich Aug 23, 2023
b172649
Add resources and docstrings
jkgoodrich Aug 24, 2023
cc2c665
Merge branch 'main' of https://github.com/broadinstitute/gnomad_qc in…
jkgoodrich Sep 14, 2023
51848ae
change input to freq_ht & test on one gene
KoalaQin Sep 22, 2023
9ff063a
removed combined_faf temporarily
KoalaQin Sep 22, 2023
4fd6064
minor editing for pops
KoalaQin Sep 22, 2023
a16ec9b
Merge remote-tracking branch 'origin' into qh/combined_faf
KoalaQin Sep 22, 2023
62ff299
change to use genomes release_sites
KoalaQin Sep 22, 2023
10881a0
Merge remote-tracking branch 'origin' into qh/combined_faf
KoalaQin Sep 22, 2023
c846c20
calcultate faf/grpmax if unavailable in freq_ht & rename popmax to gr…
KoalaQin Sep 24, 2023
af05bfe
add statsmodels in requirements-dev.txt
KoalaQin Sep 25, 2023
8cae8ca
reformat with black
KoalaQin Sep 25, 2023
190d517
remove potential idx duplicates
KoalaQin Sep 25, 2023
52ae80f
Merge branch 'main' of https://github.com/broadinstitute/gnomad_qc in…
jkgoodrich Sep 25, 2023
9b3573d
Merge branch 'jg/combined_faf' of https://github.com/broadinstitute/g…
jkgoodrich Sep 25, 2023
3b60904
add function to calculate fafmax/pop for faf95 and faf99
KoalaQin Sep 25, 2023
8a737cc
fafmax must be greater than 0
KoalaQin Sep 25, 2023
19815ed
address review comments
KoalaQin Sep 28, 2023
e2533bd
add TODO for passing variant_qc HT for filtering
KoalaQin Sep 28, 2023
813c5b3
annotate joint_fafmax directly from fax_expr output
KoalaQin Sep 28, 2023
0050c1c
change name of faf_max function
KoalaQin Oct 5, 2023
01d35a7
Merge pull request #446 from broadinstitute/qh/combined_faf
jkgoodrich Oct 9, 2023
8892dc3
Merge branch 'main' of https://github.com/broadinstitute/gnomad_qc in…
jkgoodrich Oct 12, 2023
ad5f413
Get PCSK9 test working with test genomes HT
jkgoodrich Oct 13, 2023
4458498
Change to work with exome and genome frequency data
jkgoodrich Oct 13, 2023
33a82fa
Address review comments
jkgoodrich Oct 13, 2023
44ac53e
Merge branch 'main' of https://github.com/broadinstitute/gnomad_qc in…
jkgoodrich Oct 13, 2023
894028f
Temp change for testing
jkgoodrich Oct 13, 2023
ca0e9a2
fix use of filter_arrays_by_meta
jkgoodrich Oct 13, 2023
27f8a20
Add checkpoint to avoid error
jkgoodrich Oct 13, 2023
15e93ba
apply_keep_to_only_items_in_filter -> exact_match
jkgoodrich Oct 13, 2023
9f805f4
Merge branch 'main' of https://github.com/broadinstitute/gnomad_qc in…
jkgoodrich Oct 13, 2023
b176054
Add versions to global annotation and finalize
jkgoodrich Oct 15, 2023
8045e37
Merge pull request #486 from broadinstitute/jg/combined_faf_final
jkgoodrich Oct 16, 2023
2fadc4a
Apply suggestions from code review
jkgoodrich Oct 16, 2023
775a42c
Add review suggestions
jkgoodrich Oct 16, 2023
187041e
Update gnomad_qc/v4/create_release/create_combined_faf_release_ht.py
jkgoodrich Oct 16, 2023
90aeff0
faf_meta to literal
jkgoodrich Oct 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 19 additions & 22 deletions gnomad_qc/v4/create_release/create_combined_faf_release_ht.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
pop_max_expr,
)
from gnomad.utils.filtering import filter_arrays_by_meta
from gnomad.utils.release import make_faf_index_dict, make_freq_index_dict_from_meta
from gnomad.utils.release import make_freq_index_dict_from_meta
from gnomad.utils.slack import slack_notifications
from statsmodels.stats.contingency_tables import StratifiedTable

Expand Down Expand Up @@ -59,15 +59,10 @@ def filter_gene_to_test(ht: hl.Table) -> hl.Table:

def extract_freq_info(
ht: hl.Table,
faf_pops: List[str],
prefix: str,
) -> hl.Table:
"""
Extract frequencies and FAF for populations in `pops` and `faf_pops` respectively.

.. note::

The reduced frequency arrays include adj and raw in addition to the `pops`.
Extract frequencies and FAF for adj, raw (only for frequencies), adj by pop, adj by sex, and adj by pop/sex.

The following annotations are filtered and renamed:
- freq: {prefix}_freq
Expand All @@ -77,15 +72,12 @@ def extract_freq_info(
- freq_meta: {prefix}_freq_meta
- faf_meta: {prefix}_faf_meta

This only keeps variants that PASS filtering.
This only keeps variants that pass variant QC filtering.

:param ht: Table with frequency and FAF information.
:param faf_pops: List of populations to include in the reduced FAF arrays.
:param prefix: Prefix to add to each of the filtered annotations.
:return: Table with filtered frequency and FAF information.
"""
# Keep full gnomAD callset adj [0], raw [1], and ancestry and/or sex specific adj
# frequencies.
logger.info(
"Keeping only frequencies for adj, raw, adj by pop, adj by sex, and adj by "
"pop/sex..."
Expand All @@ -107,25 +99,26 @@ def extract_freq_info(
combine_operator="or",
exact_match=True,
)
logger.info("Filtering to only adj frequencies for FAF...")
faf_meta, faf = filter_arrays_by_meta(
hl.literal(faf_meta),
{"faf": faf["faf"]},
{"gen_anc": faf_pops},
{"group": ["adj"]},
combine_operator="or",
)

# Rename filtered annotations with supplied prefix and remove "raw" group from faf.
# Rename filtered annotations with supplied prefix.
ht = ht.select(
**{
f"{prefix}_freq": array_exprs["freq"],
f"{prefix}_faf": ht.faf[:1].extend(faf["faf"][2:]),
f"{prefix}_faf": faf["faf"][2:],
jkgoodrich marked this conversation as resolved.
Show resolved Hide resolved
}
)
ht = ht.select_globals(
**{
f"{prefix}_freq_meta": freq_meta,
f"{prefix}_freq_meta_sample_count": array_exprs["freq_meta_sample_count"],
f"{prefix}_faf_meta": ht.faf_meta[:1].extend(ht.faf_meta[2:]),
f"{prefix}_faf_meta": faf_meta,
}
)

Expand Down Expand Up @@ -193,13 +186,13 @@ def get_joint_freq_and_faf(
),
joint_grpmax=grpmax,
)
ht = ht.checkpoint(hl.utils.new_temp_file("combine_faf", "ht"), True)
ht = ht.checkpoint(hl.utils.new_temp_file("combine_faf", "ht"))

ht = ht.annotate_globals(
mike-w-wilson marked this conversation as resolved.
Show resolved Hide resolved
joint_freq_meta=freq_meta,
joint_freq_index_dict=make_freq_index_dict_from_meta(freq_meta),
joint_faf_meta=faf_meta,
joint_faf_index_dict=make_faf_index_dict(faf_meta),
joint_faf_index_dict=make_freq_index_dict_from_meta(faf_meta),
joint_freq_meta_sample_count=count_arrays_dict["counts"],
)

Expand Down Expand Up @@ -248,7 +241,11 @@ def perform_cmh_test(
Perform the Cochran–Mantel–Haenszel test on the alleles counts between two frequency expressions using population as the stratification.

This is done by creating a list of 2x2 matrices of freq1/freq2 reference and
alternate allele counts for each population in pops.
alternate allele counts for each population in pops. The stats used in
`perform_contingency_table_test` can only be used on 2x2 matrices, so we perform
that per population to get one statistic per population. The CMH test allows for
multiple 2x2 matrices for a specific stratification, giving a single statistic
across all populations.
jkgoodrich marked this conversation as resolved.
Show resolved Hide resolved

`freq1_expr` and `freq2_expr` should be ArrayExpressions of structs with 'AN' and
'AC' annotations.
Expand Down Expand Up @@ -330,8 +327,8 @@ def get_combine_faf_resources(
"--create-combined-frequency-table",
output_resources={"comb_freq_ht": get_combined_frequency(test=test)},
input_resources={
"generate_freq.py": {"exomes_ht": get_freq()},
"create_release_sites_ht_genomes.py": {
"generate_freq.py": {"exomes_ht": get_freq(test=test)},
"generate_freq_genomes.py": {
"genomes_ht": get_freq(test=test, data_type="genomes")
},
},
Expand Down Expand Up @@ -405,8 +402,8 @@ def main(args):
lambda x: x.annotate(homozygote_count=hl.int32(x.homozygote_count))
)
)
exomes_ht = extract_freq_info(exomes_ht, faf_pops, "exomes")
genomes_ht = extract_freq_info(genomes_ht, faf_pops, "genomes")
exomes_ht = extract_freq_info(exomes_ht, "exomes")
genomes_ht = extract_freq_info(genomes_ht, "genomes")

ht = get_joint_freq_and_faf(genomes_ht, exomes_ht)
ht = ht.annotate_globals(
Expand Down
2 changes: 1 addition & 1 deletion gnomad_qc/v4/resources/annotations.py
Original file line number Diff line number Diff line change
Expand Up @@ -333,7 +333,7 @@ def get_combined_frequency(test: bool = False) -> VersionedTableResource:
Get the combined v4 genome and exome frequency annotation VersionedTableResource.

:param test: Whether to use a tmp path for testing.
:return: Hail Table containing combined frequency annotations
:return: Hail Table containing combined frequency annotations.
"""
return VersionedTableResource(
CURRENT_COMBINED_FAF_RELEASE,
Expand Down