Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcf2db without samples #38

Open
parkesK opened this issue Feb 1, 2018 · 3 comments
Open

vcf2db without samples #38

parkesK opened this issue Feb 1, 2018 · 3 comments

Comments

@parkesK
Copy link

parkesK commented Feb 1, 2018

Hi there,

I annotated a VCF of Clinvar variants using vcfanno and would like to create a database using vcf2db. However, there are no samples in the VCF. I tried adding a fake sample in the header and then making a ped file to go along with this, but am getting an error with this:

File "cyvcf2/cyvcf2.pyx", line 932, in cyvcf2.cyvcf2.Variant.gt_bases.get (cyvcf2/cyvcf2.c:22059)
ValueError: range() step argument must not be zero

I am guessing this is because there are no genotypes? Is there anyway to upload this vcf without samples? I have attached 2 VCFs that are each 1000 lines. The one that has _Copy in the title is the original; the other one has the fake sample added. I also attached my fake ped file.

Thanks,
Parkes

clinvar_noncoding_pathogenic_headed_NoAnno_Split_VEP_annotated_HeaderFixed_Copy_1000.vcf.gz

fake.ped.gz

clinvar_noncoding_pathogenic_headed_NoAnno_Split_VEP_annotated_HeaderFixed_1000.vcf.gz

@Phillip-a-richmond
Copy link

+1 on this request

@davemcg
Copy link

davemcg commented Mar 1, 2018

Your faux vcf is wrong - you need to add, bare minimum, GT to the FORMAT column and a fake genotype (0/1) to the fake sample 20 column. You can't just leave them blank.

Also, you need to add a FORMAT header field to define what GT is (https://samtools.github.io/hts-specs/VCFv4.2.pdf).

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

@davemcg
Copy link

davemcg commented Mar 1, 2018

Here's a super barebones python script that will add a fake FORMAT and sample field to a vcf. You have to give a gzipped vcf as the first argument and a fake sample name (no spaces or special characters!) as the second argument to the script.

python3 this_script.py clinvar.vcf.gz FAKESAMPLENAME | bgzip > new_clinvar.vcf.gz

#!/usr/local/bin/python3

import sys
import gzip

vcf = sys.argv[1]
sample_name = sys.argv[2]

for line in gzip.open(vcf,'r'):
    line = line.decode('utf-8')
    if line[0:2] == '##':
        print(line[:-1])
    elif line[0:2] == '#C':
        print('##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">')
        print('##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">')
        print('##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">')
        print(line[:-1] + '\tFORMAT\t' + sample_name)
    else:
        line = line.split()
        line = line[:-1]
        print('\t'.join(line) + '\t.\tGT:GQ:DP\t0/1:100:100')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants