Bcftools view——locate problematic genotype in GT field #2309

Lloyd-LiuSiyi · 2024-10-24T05:13:38Z

Hi! I was trying to split multiallelic variants and keep only SNPs in my vcf file when I encountered an error message saying:

bcftools norm -m -any chr13.hg38.vcf.gz -Ou |bcftools view --types snps -o chr13_split_snp.vcf.gz
[E::vcf_parse_format_fill5] Couldn't read GT data: value not a number or '.' at 13:19451111
Lines   total/split/joined/realigned/removed/skipped:   403267/58535/0/0/0/0

It seems the genotypes of some individuals were corrupted (or coded in the wrong way I guess), so I retrieved the GT with the following command:
bcftools view --regions 13:19451111 chr13.hg38.vcf.gz -o chr13_error_snp.vcf and zcat chr13.hg38.vcf.gz|grep "19451111" > error_snp.tsv
A preview of the tsv is presented here, and I also attached the genotype of 13:19451111 for all individuals in text format.
error_snp.txt

13	19451111	.	G	A	.	.	.	GT	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	./.	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	./.	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	./.	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0

The genotypes seems fine at first glance. Since there were ~460k individuals (from the UK Biobank) in the vcf file, I was wondering how I could locate the individual with the problematic genotype. Or was the error message given by mistake?
Thank you!

The text was updated successfully, but these errors were encountered:

pd3 · 2024-10-31T10:14:42Z

When the underlying htslib parser fails, using bcftools will not help with investigating a corrupted file. Instead try something like this

zcat chr13.hg38.vcf.gz | awk '$1=="13" && $2==19451111' > invalid-row.txt
cat invalid-row.txt | sed 's,\t,\n,g' | less

to check what the genotypes look like and how many there are

Lloyd-LiuSiyi · 2024-11-01T04:32:41Z

Thanks @pd3! I tried as you suggested, based on which I also deleted all 0/0 and ./. genotypes to get fewer rows (or there will be 460k rows) with the following command:

zcat chr13.hg38.vcf.gz | awk '$1=="13" && $2==19451111' > invalid-row.txt
cat invalid-row.txt | sed 's,\t,\n,g' > invalid-row-vertical.txt
sed -i 's/0\/0//g' invalid-row-vertical.txt
sed -i 's/\.\/\.//g' invalid-row-vertical.txt
sed -ir '/^\s*$/d' invalid-row-vertical.txt

This should leave only 0/1, 1/1 and problematic genotypes within the file, and I get:

13
19451111
.
G
A
.
.
.
GT
0/1
0/1
0/1
0/1

It seems I failed to locate the corrupted genotype. I was wondering if it's OK if I neglect the error and keep working with the vcf since (majority of) it looks fine?

pd3 · 2024-11-06T08:56:18Z

No, I wouldn't want to work with a file like that. Any chance you could share the invalid-row.txt file?

Lloyd-LiuSiyi · 2024-11-06T09:20:37Z

Sure! Thank you so much for spending time on this issue.
Here is the invalid-row.txt file:
invalid-row.txt
I also would like to complement that all procedures were performed on Ubuntu 20.04.3 LTS of wsl2.

pd3 · 2024-11-16T14:53:08Z

Mmm, I can't find anything wrong with that line. You mentioned it's a UKB file. I have an access, can you show which file is that? I can try to debug the problem right there

Lloyd-LiuSiyi · 2024-11-19T05:10:44Z

Hi @pd3! The file I'm working with is the whole exome sequencing 450k final release in pVCF format for chr13. We have performed genotype QC and merged blocks into a whole chromosome.
To avoid wasting your time I re-ran the command yesterday and surprisingly this time no error was given:

bcftools norm -m -any chr13.hg38.vcf.gz -Ou|bcftools view --types snps -Ou|bcftools annotate --set-id '%CHROM\:%POS\:%REF\:%ALT' -Oz -o chr13_split_snp_rename.vcf.gz
Lines   total/split/joined/realigned/removed/skipped:   403268/58535/0/0/0/0

Now I'm not sure why I faced the error during the first run, but now the result seems fine. Probably this issue can be closed?
Thanks again for your time and patience!
Best wishes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bcftools view——locate problematic genotype in GT field #2309

Bcftools view——locate problematic genotype in GT field #2309

Lloyd-LiuSiyi commented Oct 24, 2024

pd3 commented Oct 31, 2024

Lloyd-LiuSiyi commented Nov 1, 2024

pd3 commented Nov 6, 2024

Lloyd-LiuSiyi commented Nov 6, 2024

pd3 commented Nov 16, 2024

Lloyd-LiuSiyi commented Nov 19, 2024

Bcftools view——locate problematic genotype in GT field #2309

Bcftools view——locate problematic genotype in GT field #2309

Comments

Lloyd-LiuSiyi commented Oct 24, 2024

pd3 commented Oct 31, 2024

Lloyd-LiuSiyi commented Nov 1, 2024

pd3 commented Nov 6, 2024

Lloyd-LiuSiyi commented Nov 6, 2024

pd3 commented Nov 16, 2024

Lloyd-LiuSiyi commented Nov 19, 2024