Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high quality batch level variants resulting in low quality merged variant. #4

Open
batzler opened this issue Nov 7, 2024 · 3 comments

Comments

@batzler
Copy link

batzler commented Nov 7, 2024

We are finding examples where batch level R2 > 0.3 for all batches but the output R2 from hds-util is < 0.3 and thus the variants do not end up in the output data when using the -m 0.3 option. While this is a very small number of variants (just a few out of 22K HLA imputed variants) we are curious how this can happen in the calculation of the merged R2. Thanks for any explanation you can provide.

Example

hds-util -f GT,HDS,GP,DS -m 0.3 data1.dose.vcf.gz data2.dose.vcf.gz data3.dose.vcf.gz data4.dose.vcf.gz -O vcf.gz > merged.dose.vcf.gz

data1
AA_B_167_31323991_exon3_SW
6 31324002 AA_B_167_31323991_exon3_SW A T . .IMPUTED;AF=0.999851;MAF=0.00014919;AVG_CS=0.999931;R2=0.543911

data2
[batzler@mforgehn5 TAPphase1QCimp]$ zcat v4/HLA/HLAoutput/2/chr6.info.gz | grep AA_B_167_31323991_exon3_SW
6 31324002 AA_B_167_31323991_exon3_SW A T . .IMPUTED;AF=0.99989;MAF=0.000109911;AVG_CS=0.999933;R2=0.402355

data3
6 31324002 AA_B_167_31323991_exon3_SW A T . .IMPUTED;AF=0.999909;MAF=9.11355e-05;AVG_CS=0.999944;R2=0.384272

data4
6 31324002 AA_B_167_31323991_exon3_SW A T . .IMPUTED;AF=0.999825;MAF=0.000174999;AVG_CS=0.999938;R2=0.646454

merged
CHROM POS SNP REF ALT N ALT_Frq MAF minRsq minEmpRsq vtype RSQ

1 6 31324002 AA_B_167_31323991_exon3_SW A T 4 1.00 0.000131 0.384 NA AA NA

Thanks

@jonathonl
Copy link
Contributor

Can you please proved the INFO fields for the merged variant record as you did for data[1-4]?

@batzler
Copy link
Author

batzler commented Nov 8, 2024

Apologies for the confusion. When I ran using "-m 0.3" the hds-util did not provide the output for this variant. I just ran the merge for this variant separately removing the -m filter and this is the result.

6 31324002 AA_B_167_31323991_exon3_SW A T . . IMPUTED;AF=0.99989;MAF=0.00010
9732;AVG_CS=0.999958;R2=0.174244

@jonathonl
Copy link
Contributor

Thanks.

The R2 field is an estimated r-squared based on the mean the HDS values. It's plausible for the chunked data to "fit" their respective means better than the merged data. This is also plausible in empirical r-squared. In an extreme example where the x is unimodal and the y is bimodal due to two underlying distributions with different means, the r-squared of the combined data could be much worse than if you were to compute r-squared on the two underlying distributions separately.

I just double checked that the R2 INFO field was being computed correctly by separating an imputed VCF into two sample sets and then merging back together with hds-util. This will give equivalent results (barring precision errors).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants