GNRS (Graph nonreference sequences)

Pangenome of human nonreference sequences from population-scale long-read sequencing

 ___________         ____         __         ____________         __________
|  _________|       |    \       |  |       |  ________  |       |   ____   | 
|  |                |  |\ \      |  |       |  |      |  |       |  |    |__|
|  |                |  | \ \     |  |       |  |      |  |       |  |        
|  |   _____        |  |  \ \    |  |       |  |______|  |       |  |_______  
|  |  |__   |       |  |   \ \   |  |       |   ___   ___|       |_______   |
|  |     |  |       |  |    \ \  |  |       |  |   \  \           __     |  |          
|  |     |  |       |  |     \ \ |  |       |  |    \  \         |  |    |  |           
|  |_____|  |       |  |      \ \|  |       |  |     \  \        |  |____|  |            
|___________|       |__|       \____|       |__|      \__\       |__________|

Description

In order to make sure the results are reproduceable, the pipeline is performed using framework Snakemake coupled with the environment conducted by Anoconda. And the pipeline can be used in other cohort with long-read sequencing.

The workflow of GNRS on the population-scale long-read sequencing are below:

Schematic representation of GraphNRS

a, Long-read sequencing data from different platforms are de novo assembled and polished.
b, The NRSs are anchored to GRCh38. Placed NRSs are clustered to select the representative NRSs, and unplaced NRSs are clustered after filtering out contaminants and centromeric repeats. Then, we merge the placed and the unplaced NRSs to obtain the nonredundant NRSs of the whole population.
c, vg is used to construct the graph pangenome, and NRS genotyping is performed for each NRS of the individual.

Requirements

1. wtdbg2 v2.5
2. MarginPolish v1.3.0
3. Hifiasm v0.16.1-r375
4. NextPolish v1.4.0 
5. QUAST v5.0.2
6. AGE v0.4
7. Kalign v3.3
8. Jasmine v1.1.0
9. vg toolkit v1.33.1
10. GraphAligner v1.0.13
11. snakemake v7.2.1

Configure the environment

Install the software and configure the environment:

Please note the comments in the pipeline. Change the sample path in the configuration file.

Quick start for the pipeline

usage: snakemake -p -s GNRS.pipeline.py --configfile GNRS.pipeline.yaml --cores

Our pipeline works for any species with a reference genome. We tested the pipeline on yeast dataset.

S288C reference genome: GCF_000146045.2
ONT: Whole genome sequencing of 741-7-Nanopore SRR18365591
CLR: Whole genome sequencing of JSC20-1_Pacbio SRR18365586
HiFi: HiFi-Seq of S288C SRR18210286
NGS: Whole genome sequencing of 741-7-NGS SRR18365940
NGS: BGISEQ of S288C SRR17374239

The results of the tests are in the output folder.

Datasets generated from GNRS

We provided the NRS callsets of the 539 individuals produced by GNRS from three different long-read sequencing platforms. (i.e. PacBio CLR, PacBio HiFi, and ONT). The sequences and genotypes of the NRSs are publicly available at the National Genomics Data Center (NGDC), China National Center for Bioinformation (CNCB) with project accession number PRJCA007976. The sequences and genotypes of the placed NRSs are available with accession number GVM000324. And the sequences of the unplaced NRSs are under the accession number GWHBHSK00000000.

Citation

Wu Z, Li T, Jiang Z, Zheng J, Gu Y, Liu Y, Liu Y, Xie Z. Human pangenome analysis of sequences missing from the reference genome reveals their widespread evolutionary, phenotypic, and functional roles. Nucleic Acids Res. 2024 Feb 14:gkae086. doi: 10.1093/nar/gkae086. Epub ahead of print. PMID: 38364871.

Contact

For advising, bug reporting and requiring help, please post on Github Issue or contact [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
NRS		NRS
data		data
output		output
scripts		scripts
GNRS.pipeline.py		GNRS.pipeline.py
GNRS.pipeline.yaml		GNRS.pipeline.yaml
README.md		README.md
test.jasmine.merge.represent.final.refine.rpvg.vcf.gz		test.jasmine.merge.represent.final.refine.rpvg.vcf.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GNRS (Graph nonreference sequences)

Description

Schematic representation of GraphNRS

Requirements

Configure the environment

Quick start for the pipeline

Datasets generated from GNRS

Citation

Contact

About

Releases

Packages

Languages

Kmanjor/GNRS

Folders and files

Latest commit

History

Repository files navigation

GNRS (Graph nonreference sequences)

Description

Schematic representation of GraphNRS

Requirements

Configure the environment

Quick start for the pipeline

Datasets generated from GNRS

Citation

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages