Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Allele Count Exceeding 99 in UKBB WGS VCF Files #32

Open
Zhangliubin opened this issue Nov 15, 2024 · 0 comments
Open

Issue with Allele Count Exceeding 99 in UKBB WGS VCF Files #32

Zhangliubin opened this issue Nov 15, 2024 · 0 comments

Comments

@Zhangliubin
Copy link

Dear Genozip Team,

I am currently working with the UK Biobank (UKBB) Whole Genome Sequencing (WGS) dataset, which includes approximately 490,000 samples and over a billion variant sites. During the compression process of a VCF.GZ file using genozip, I encountered the following error:

genozip chr21.samples_1.hg38.vcf.gz : 25% (0 seconds)
Error vcf_seg_FORMAT_in variant 21:9028016: VCF file sample 1 - genozip currently supports only alleles up to 99
The error occurs when processing variant sites that have more than 99 alleles, which, in some rare cases, exceed 300 alleles per site.

Dataset: UKBB WGS data, with 490,000 samples and 1 billion variant sites.

I could not find any options in the documentation to handle such variant sites with a high allele count. Specifically, I would like to know:
Is there an option to split these multi-allelic variant sites into multiple biallelic sites?
Or is there an option to discard variants with an excessive number of alleles (e.g., more than 99)?

I appreciate your help in resolving this issue and would be grateful for any guidance or potential solutions.

Thank you for your time and support.

Best regards,
Liubin Zhang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant