Pangenome of human nonreference sequences from population-scale long-read sequencing
___________ ____ __ ____________ __________
| _________| | \ | | | ________ | | ____ |
| | | |\ \ | | | | | | | | |__|
| | | | \ \ | | | | | | | |
| | _____ | | \ \ | | | |______| | | |_______
| | |__ | | | \ \ | | | ___ ___| |_______ |
| | | | | | \ \ | | | | \ \ __ | |
| | | | | | \ \ | | | | \ \ | | | |
| |_____| | | | \ \| | | | \ \ | |____| |
|___________| |__| \____| |__| \__\ |__________|
In order to make sure the results are reproduceable, the pipeline is performed using framework Snakemake coupled with the environment conducted by Anoconda. And the pipeline can be used in other cohort with long-read sequencing.
The workflow of GNRS on the population-scale long-read sequencing are below:
- a, Long-read sequencing data from different platforms are de novo assembled and polished.
- b, The NRSs are anchored to GRCh38. Placed NRSs are clustered to select the representative NRSs, and unplaced NRSs are clustered after filtering out contaminants and centromeric repeats. Then, we merge the placed and the unplaced NRSs to obtain the nonredundant NRSs of the whole population.
- c, vg is used to construct the graph pangenome, and NRS genotyping is performed for each NRS of the individual.
1. wtdbg2 v2.5
2. MarginPolish v1.3.0
3. Hifiasm v0.16.1-r375
4. NextPolish v1.4.0
5. QUAST v5.0.2
6. AGE v0.4
7. Kalign v3.3
8. Jasmine v1.1.0
9. vg toolkit v1.33.1
10. GraphAligner v1.0.13
11. snakemake v7.2.1
Install the software and configure the environment:
- wtdbg2 v2.5
- MarginPolish v1.3.0
- Hifiasm v0.16.1-r375
- NextPolish v1.4.0
- QUAST v5.0.2
- AGE v0.4
- Kalign v3.3
- Jasmine v1.1.0
- vg toolkit v1.33.1
- GraphAligner v1.0.13
- snakemake v7.2.1
Please note the comments in the pipeline. Change the sample path in the configuration file.
usage: snakemake -p -s GNRS.pipeline.py --configfile GNRS.pipeline.yaml --cores
Our pipeline works for any species with a reference genome. We tested the pipeline on yeast dataset.
- S288C reference genome: GCF_000146045.2
- ONT: Whole genome sequencing of 741-7-Nanopore SRR18365591
- CLR: Whole genome sequencing of JSC20-1_Pacbio SRR18365586
- HiFi: HiFi-Seq of S288C SRR18210286
- NGS: Whole genome sequencing of 741-7-NGS SRR18365940
- NGS: BGISEQ of S288C SRR17374239
The results of the tests are in the output folder.
We provided the NRS callsets of the 539 individuals produced by GNRS from three different long-read sequencing platforms. (i.e. PacBio CLR, PacBio HiFi, and ONT). The sequences and genotypes of the NRSs are publicly available at the National Genomics Data Center (NGDC), China National Center for Bioinformation (CNCB) with project accession number PRJCA007976. The sequences and genotypes of the placed NRSs are available with accession number GVM000324. And the sequences of the unplaced NRSs are under the accession number GWHBHSK00000000.
Wu Z, Li T, Jiang Z, Zheng J, Gu Y, Liu Y, Liu Y, Xie Z. Human pangenome analysis of sequences missing from the reference genome reveals their widespread evolutionary, phenotypic, and functional roles. Nucleic Acids Res. 2024 Feb 14:gkae086. doi: 10.1093/nar/gkae086. Epub ahead of print. PMID: 38364871.
For advising, bug reporting and requiring help, please post on Github Issue or contact [email protected].