Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FilterIntervals will get rid of Y chromosome intervals if there are >50% of female samples #9043

Open
NotAPoetButACriminal opened this issue Nov 14, 2024 · 2 comments

Comments

@NotAPoetButACriminal
Copy link

NotAPoetButACriminal commented Nov 14, 2024

Bug Report

FilterIntervals

gatk FilterIntervals
-L ${OUTPUT}/bins.interval_list
--annotated-intervals ${OUTPUT}/bins_annotated.interval_list
-imr OVERLAPPING_ONLY
$INPUTHDF5S
-O ${OUTPUT}/bins_filtered.interval_list

Description

I've been running the gCNV pipeline as per this article on WES samples and have noticed that in some of my runs all of the Y chromosome contigs are being removed. This then messes with sex estimation during ploidy determination which further messes up the cnv calls on sex chromosomes.
Correct me if I'm wrong, but it seems that the low count filter ie "intervals with a count < 10 in > 50.0% of samples fail" will remove the Y chromosome from any batch of samples where more than half of them are female. Pushing the percentage up (e.g. 55%, 60% etc.) to where it catches up with the percentage of samples that are female can remove this problem, but it will also change the interval filtering parameters for all other contigs.
It seems that there should be a special consideration for sex chromosomes, for example stating "--allosomal-contig Y" like when using DetermineGermlineContigPloidy, or an always keep intervals option, like the -XL flag just in reverse.

@gokalpcelik
Copy link
Contributor

gokalpcelik commented Nov 14, 2024

Hi @NotAPoetButACriminal
This is not a bug but the intended behavior of the tool. What you can do is to play with the below parameters to avoid Y segments to be eliminated based on sample gender percentage.

--exclude-intervals,-XL <String>
                              One or more genomic intervals to exclude from processing  This argument may be specified 0
                              or more times. Default value: null. 

--extreme-count-filter-maximum-percentile <Double>
                              Maximum-percentile parameter for the extreme-count filter.  Intervals with a count that
                              has a percentile strictly greater than this in a percentage of samples strictly greater
                              than extreme-count-filter-percentage-of-samples will be filtered out.  (This is the second
                              count-based filter applied.)  Default value: 99.0. 

--extreme-count-filter-minimum-percentile <Double>
                              Minimum-percentile parameter for the extreme-count filter.  Intervals with a count that
                              has a percentile strictly less than this in a percentage of samples strictly greater than
                              extreme-count-filter-percentage-of-samples will be filtered out.  (This is the second
                              count-based filter applied.)  Default value: 1.0. 

--extreme-count-filter-percentage-of-samples <Double>
                              Percentage-of-samples parameter for the extreme-count filter.  Intervals with a count that
                              has a percentile outside of [extreme-count-filter-minimum-percentile,
                              extreme-count-filter-maximum-percentile] in a percentage of samples strictly greater than
                              this will be filtered out.  (This is the second count-based filter applied.)  Default
                              value: 90.0. 


--low-count-filter-count-threshold <Integer>
                              Count-threshold parameter for the low-count filter.  Intervals with a count strictly less
                              than this threshold in a percentage of samples strictly greater than
                              low-count-filter-percentage-of-samples will be filtered out.  (This is the first
                              count-based filter applied.)  Default value: 10. 

--low-count-filter-percentage-of-samples <Double>
                              Percentage-of-samples parameter for the low-count filter.  Intervals with a count strictly
                              less than low-count-filter-count-threshold in a percentage of samples strictly greater
                              than this will be filtered out.  (This is the first count-based filter applied.)  Default
                              value: 50.0.

We would suggest you to set --low-count-filter-percentage-of-samples to something much greater (e.g. 90) than those of female percentage so that Y fragments will remain regardless of the female count. You may wish to avoid running this filter on Y chromosome by adjusting -XL chrY as parameter.

Alternatively you may run this tool in 3 rounds first on autosomes only, second on X chromosome and finally on Y chromosome but beware that X chromosome on males is a single copy therefore counts may be affected and may end up removing more of X than what you expect to have. Once you produced filtered intervals for autosomes, X and Y you may combine them to proceed to the next stage. We do not have a special consideration for X and Y for this step because it will require you to know the gender and chromosome counts of all samples before, tested by other orthogonal methods.

Setting percentage of samples to a higher value may end up producing more CNV calls in common CNV polymorphic regions but most of which won't be a false positive if your samples are all balanced for FOLD80 Base Penalty and AT/GC dropout rates as well as insert sizes.

@NotAPoetButACriminal
Copy link
Author

NotAPoetButACriminal commented Nov 14, 2024

This intended behavior is not very useful when the behavior changes every time based on the sex ratio of the cohort. Running -XL chrY will always remove chrY, which is the problem to begin with. Some kind of reverse -XL would be needed as an always include option, but even that would not be ideal as then chrY bins would never be filtered.

I understand that FilterIntervals itself is not aware of the sex, however DetermineGermlineContigPloidy also estimates sex, so perhaps in future versions FilterIntervals could be expanded to contain a --contig-ploidy-calls flag similar to GermlineCNVCaller, so that it could be run downstream of DetermineGermlineContigPloidy, and then it could perform filtering of chrY contings only on samples that have ploidy 1 on chrY ie males.

For now a quick fix is increasing the -low-count-filter-percentage-of-samples, but this may end up producing more CNV calls as you said.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants