Skip to content
/ GNRS Public

Pangenome of human nonreference sequences from population-scale long-read sequencing

Notifications You must be signed in to change notification settings

Kmanjor/GNRS

Repository files navigation

GNRS (Graph nonreference sequences)

Pangenome of human nonreference sequences from population-scale long-read sequencing


 ___________         ____         __         ____________         __________
|  _________|       |    \       |  |       |  ________  |       |   ____   | 
|  |                |  |\ \      |  |       |  |      |  |       |  |    |__|
|  |                |  | \ \     |  |       |  |      |  |       |  |        
|  |   _____        |  |  \ \    |  |       |  |______|  |       |  |_______  
|  |  |__   |       |  |   \ \   |  |       |   ___   ___|       |_______   |
|  |     |  |       |  |    \ \  |  |       |  |   \  \           __     |  |          
|  |     |  |       |  |     \ \ |  |       |  |    \  \         |  |    |  |           
|  |_____|  |       |  |      \ \|  |       |  |     \  \        |  |____|  |            
|___________|       |__|       \____|       |__|      \__\       |__________|   


Description

In order to make sure the results are reproduceable, the pipeline is performed using framework Snakemake coupled with the environment conducted by Anoconda. And the pipeline can be used in other cohort with long-read sequencing.

The workflow of GNRS on the population-scale long-read sequencing are below: image

Schematic representation of GraphNRS

  • a, Long-read sequencing data from different platforms are de novo assembled and polished.
  • b, The NRSs are anchored to GRCh38. Placed NRSs are clustered to select the representative NRSs, and unplaced NRSs are clustered after filtering out contaminants and centromeric repeats. Then, we merge the placed and the unplaced NRSs to obtain the nonredundant NRSs of the whole population.
  • c, vg is used to construct the graph pangenome, and NRS genotyping is performed for each NRS of the individual.

Requirements

1. wtdbg2 v2.5
2. MarginPolish v1.3.0
3. Hifiasm v0.16.1-r375
4. NextPolish v1.4.0 
5. QUAST v5.0.2
6. AGE v0.4
7. Kalign v3.3
8. Jasmine v1.1.0
9. vg toolkit v1.33.1
10. GraphAligner v1.0.13
11. snakemake v7.2.1

Configure the environment

Install the software and configure the environment:

Please note the comments in the pipeline. Change the sample path in the configuration file.


Quick start for the pipeline

usage: snakemake -p -s GNRS.pipeline.py --configfile GNRS.pipeline.yaml --cores

Our pipeline works for any species with a reference genome. We tested the pipeline on yeast dataset.

The results of the tests are in the output folder.


Datasets generated from GNRS

We provided the NRS callsets of the 539 individuals produced by GNRS from three different long-read sequencing platforms. (i.e. PacBio CLR, PacBio HiFi, and ONT). The sequences and genotypes of the NRSs are publicly available at the National Genomics Data Center (NGDC), China National Center for Bioinformation (CNCB) with project accession number PRJCA007976. The sequences and genotypes of the placed NRSs are available with accession number GVM000324. And the sequences of the unplaced NRSs are under the accession number GWHBHSK00000000.


Citation

Wu Z, Li T, Jiang Z, Zheng J, Gu Y, Liu Y, Liu Y, Xie Z. Human pangenome analysis of sequences missing from the reference genome reveals their widespread evolutionary, phenotypic, and functional roles. Nucleic Acids Res. 2024 Feb 14:gkae086. doi: 10.1093/nar/gkae086. Epub ahead of print. PMID: 38364871.


Contact

For advising, bug reporting and requiring help, please post on Github Issue or contact [email protected].

About

Pangenome of human nonreference sequences from population-scale long-read sequencing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages