Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bcftools view——locate problematic genotype in GT field #2309

Open
Lloyd-LiuSiyi opened this issue Oct 24, 2024 · 6 comments
Open

Bcftools view——locate problematic genotype in GT field #2309

Lloyd-LiuSiyi opened this issue Oct 24, 2024 · 6 comments

Comments

@Lloyd-LiuSiyi
Copy link

Hi! I was trying to split multiallelic variants and keep only SNPs in my vcf file when I encountered an error message saying:

bcftools norm -m -any chr13.hg38.vcf.gz -Ou |bcftools view --types snps -o chr13_split_snp.vcf.gz
[E::vcf_parse_format_fill5] Couldn't read GT data: value not a number or '.' at 13:19451111
Lines   total/split/joined/realigned/removed/skipped:   403267/58535/0/0/0/0

It seems the genotypes of some individuals were corrupted (or coded in the wrong way I guess), so I retrieved the GT with the following command:
bcftools view --regions 13:19451111 chr13.hg38.vcf.gz -o chr13_error_snp.vcf and zcat chr13.hg38.vcf.gz|grep "19451111" > error_snp.tsv
A preview of the tsv is presented here, and I also attached the genotype of 13:19451111 for all individuals in text format.
error_snp.txt

13	19451111	.	G	A	.	.	.	GT	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	./.	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	./.	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	./.	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	0/0	

The genotypes seems fine at first glance. Since there were ~460k individuals (from the UK Biobank) in the vcf file, I was wondering how I could locate the individual with the problematic genotype. Or was the error message given by mistake?
Thank you!

@pd3
Copy link
Member

pd3 commented Oct 31, 2024

When the underlying htslib parser fails, using bcftools will not help with investigating a corrupted file. Instead try something like this

zcat chr13.hg38.vcf.gz | awk '$1=="13" && $2==19451111' > invalid-row.txt
cat invalid-row.txt | sed 's,\t,\n,g' | less

to check what the genotypes look like and how many there are

@Lloyd-LiuSiyi
Copy link
Author

Thanks @pd3! I tried as you suggested, based on which I also deleted all 0/0 and ./. genotypes to get fewer rows (or there will be 460k rows) with the following command:

zcat chr13.hg38.vcf.gz | awk '$1=="13" && $2==19451111' > invalid-row.txt
cat invalid-row.txt | sed 's,\t,\n,g' > invalid-row-vertical.txt
sed -i 's/0\/0//g' invalid-row-vertical.txt
sed -i 's/\.\/\.//g' invalid-row-vertical.txt
sed -ir '/^\s*$/d' invalid-row-vertical.txt

This should leave only 0/1, 1/1 and problematic genotypes within the file, and I get:

13
19451111
.
G
A
.
.
.
GT
0/1
0/1
0/1
0/1

It seems I failed to locate the corrupted genotype. I was wondering if it's OK if I neglect the error and keep working with the vcf since (majority of) it looks fine?

@pd3
Copy link
Member

pd3 commented Nov 6, 2024

No, I wouldn't want to work with a file like that. Any chance you could share the invalid-row.txt file?

@Lloyd-LiuSiyi
Copy link
Author

Sure! Thank you so much for spending time on this issue.
Here is the invalid-row.txt file:
invalid-row.txt
I also would like to complement that all procedures were performed on Ubuntu 20.04.3 LTS of wsl2.

@pd3
Copy link
Member

pd3 commented Nov 16, 2024

Mmm, I can't find anything wrong with that line. You mentioned it's a UKB file. I have an access, can you show which file is that? I can try to debug the problem right there

@Lloyd-LiuSiyi
Copy link
Author

Hi @pd3! The file I'm working with is the whole exome sequencing 450k final release in pVCF format for chr13. We have performed genotype QC and merged blocks into a whole chromosome.
To avoid wasting your time I re-ran the command yesterday and surprisingly this time no error was given:

bcftools norm -m -any chr13.hg38.vcf.gz -Ou|bcftools view --types snps -Ou|bcftools annotate --set-id '%CHROM\:%POS\:%REF\:%ALT' -Oz -o chr13_split_snp_rename.vcf.gz
Lines   total/split/joined/realigned/removed/skipped:   403268/58535/0/0/0/0

Now I'm not sure why I faced the error during the first run, but now the result seems fine. Probably this issue can be closed?
Thanks again for your time and patience!
Best wishes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants