high quality batch level variants resulting in low quality merged variant. #4

batzler · 2024-11-07T20:21:39Z

We are finding examples where batch level R2 > 0.3 for all batches but the output R2 from hds-util is < 0.3 and thus the variants do not end up in the output data when using the -m 0.3 option. While this is a very small number of variants (just a few out of 22K HLA imputed variants) we are curious how this can happen in the calculation of the merged R2. Thanks for any explanation you can provide.

Example

hds-util -f GT,HDS,GP,DS -m 0.3 data1.dose.vcf.gz data2.dose.vcf.gz data3.dose.vcf.gz data4.dose.vcf.gz -O vcf.gz > merged.dose.vcf.gz

data1
AA_B_167_31323991_exon3_SW
6 31324002 AA_B_167_31323991_exon3_SW A T . .IMPUTED;AF=0.999851;MAF=0.00014919;AVG_CS=0.999931;R2=0.543911

data2
[batzler@mforgehn5 TAPphase1QCimp]$ zcat v4/HLA/HLAoutput/2/chr6.info.gz | grep AA_B_167_31323991_exon3_SW
6 31324002 AA_B_167_31323991_exon3_SW A T . .IMPUTED;AF=0.99989;MAF=0.000109911;AVG_CS=0.999933;R2=0.402355

data3
6 31324002 AA_B_167_31323991_exon3_SW A T . .IMPUTED;AF=0.999909;MAF=9.11355e-05;AVG_CS=0.999944;R2=0.384272

data4
6 31324002 AA_B_167_31323991_exon3_SW A T . .IMPUTED;AF=0.999825;MAF=0.000174999;AVG_CS=0.999938;R2=0.646454

merged
CHROM POS SNP REF ALT N ALT_Frq MAF minRsq minEmpRsq vtype RSQ

1 6 31324002 AA_B_167_31323991_exon3_SW A T 4 1.00 0.000131 0.384 NA AA NA

Thanks

jonathonl · 2024-11-07T23:35:58Z

Can you please proved the INFO fields for the merged variant record as you did for data[1-4]?

batzler · 2024-11-08T00:55:28Z

Apologies for the confusion. When I ran using "-m 0.3" the hds-util did not provide the output for this variant. I just ran the merge for this variant separately removing the -m filter and this is the result.

6 31324002 AA_B_167_31323991_exon3_SW A T . . IMPUTED;AF=0.99989;MAF=0.00010
9732;AVG_CS=0.999958;R2=0.174244

jonathonl · 2024-11-08T20:35:04Z

Thanks.

The R2 field is an estimated r-squared based on the mean the HDS values. It's plausible for the chunked data to "fit" their respective means better than the merged data. This is also plausible in empirical r-squared. In an extreme example where the x is unimodal and the y is bimodal due to two underlying distributions with different means, the r-squared of the combined data could be much worse than if you were to compute r-squared on the two underlying distributions separately.

I just double checked that the R2 INFO field was being computed correctly by separating an imputed VCF into two sample sets and then merging back together with hds-util. This will give equivalent results (barring precision errors).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

high quality batch level variants resulting in low quality merged variant. #4

high quality batch level variants resulting in low quality merged variant. #4

batzler commented Nov 7, 2024

jonathonl commented Nov 7, 2024

batzler commented Nov 8, 2024

jonathonl commented Nov 8, 2024

high quality batch level variants resulting in low quality merged variant. #4

high quality batch level variants resulting in low quality merged variant. #4

Comments

batzler commented Nov 7, 2024

jonathonl commented Nov 7, 2024

batzler commented Nov 8, 2024

jonathonl commented Nov 8, 2024