This repository contains material to construct pangenome graphs for a small bacteria dataset detailing every step in the Nesi environment. This study includes an anaysis of Neisseria Bacteria genome sequence data with 4Sim data samples to construct pangenome graphs to identify genetic variation and structural variance.
A pangenome graph is a graphical representation of the collective genomic information of a set of related organisms. Unlike a traditional linear genome assembly, which represents a single consensus genome sequence, a pangenome graph captures the genetic variation and structural diversity within a population. Pangenome graphs are constructed by integrating multiple genome sequences into a single graph structure that represents the entire set of genetic elements present in the population. This graph structure allows for the identification and visualization of genomic variations, such as single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variations, as well as the relationships between different genomic regions. Pangenome graphs have become an important tool in genomics research, especially in the study of bacterial and viral populations, where genetic diversity is often high.
Neisseria is a genus of Gram-negative bacteria that are typically found in the mucous membranes and epithelial tissues of humans and animals. There are various species of Neisseria bacteria, some of which are harmless commensals, while others can cause diseases. Two well-known pathogenic species are Neisseria meningitidis, which can cause meningitis and septicemia, and Neisseria gonorrhoeae, which is the causative agent of the sexually transmitted infection gonorrhea. Other species of Neisseria are generally considered harmless and reside as commensals in the oral and/or nasopharynx of humans.
PanGenome Graph Builder (PGGB)
To investigate and analyse Neisseria Bacteria the PanGenome Graph Builder has been used. PanGenome Graph Builder (PGGB) is a computational tool used in genomics research to construct pan-genome graphs from large sets of genomic data. A pan-genome is the collection of all the genes and non-coding sequences present in a species or a group of related organisms. PGGB constructs a pan-genome graph, which is a data structure that represents the entire set of genes and genetic variations in a population. This graph can be used to study genetic diversity, gene function, and evolution. PGGB is designed to be scalable and efficient, making it suitable for large-scale genomic analyses. It is an open-source tool that can be used freely by researchers in the field of genomics.
The pggb algorithm facilitates the construction of these graphs by progressively integrating genetic variants into a reference genome. The key features and purpose of the PGGb include:
- Progressive Approach: The PGGb algorithm follows a step-by-step approach, iteratively adding genetic variants to the graph one at a time.
- Variant Integration: The tool efficiently integrates genetic variants, including single nucleotide variations (SNVs), insertions, deletions, and larger structural variations, into the graph representation.
- Optimal Placement: For each variant, the PGGb algorithm determines the most suitable position to incorporate it into the graph. This involves aligning the variant sequence with the existing graph structure while minimizing conflicts with the nodes and edges.
- Graph Expansion: Once the optimal placement is determined, the PGGb algorithm expands the graph by adding new nodes and edges to represent the variant sequence. The overall graph structure is modified to connect the variant nodes with the adjacent reference nodes.
- Large-Scale Graph Construction: The PGGb algorithm is designed to handle large-scale genomes and can efficiently construct genome graphs containing extensive genetic variations.
Generally, refers to the process of aligning all sequences in a given set against each other, rather than aligning them to a single reference sequence. We begin with an alignment, with wfmash. This compares all sequences to each other and finds the best N mappings for each. It produces base-level alignments.
Refers to the process of constructing the genome graph by progressively integrating genetic variants into a reference genome. These base-level alignments are converted into a graph with seqwish. A filter is applied to remove short matches, which anchors the graph on confident longer exact matches.
refers to a process that aims to optimize the structure and representation of the genome graph by resolving redundant or overlapping elements. This step is typically performed after the initial construction of the graph. The normalization process in PGGb involves several steps, which may vary depending on the specific implementation or version of the tool. Here are some common steps involved in normalizing the graph:
- Removal of Redundant Nodes: During the construction of the genome graph, it is possible that some nodes become redundant due to overlapping or repetitive sequences. Normalization involves identifying and removing these redundant nodes, streamlining the graph structure.
- Edge Optimization: In the graph, edges represent connections between nodes. During normalization, the edges are optimized to minimize redundancy and improve the efficiency of the graph. This can include merging or repositioning edges to create a more streamlined and accurate representation of the genome.
- Compact Representation: Normalization aims to reduce the overall size of the graph by compacting the representation. This can involve compressing repetitive regions or simplifying complex structures while preserving the essential information and variant representation.
- Graph Refinement: The normalization process also involves refining the graph structure by resolving inconsistencies, correcting errors, and improving the overall quality of the graph representation. This may include resolving conflicts between nodes and edges, addressing mismatches, and ensuring the graph accurately reflects the underlying genetic variations
To normalize the graph and harmonize the allele representation, we use smoothxg to apply a local MSA across all parts of the graph.
These graphs offer a wide range of capabilities. Initially, we can generate several diagnostic visualizations derived from the graphs, providing a user-friendly way to comprehend the alignments at a broader level. Additionally, using the PGGb tool, we can generate variant calls by leveraging vg deconstruct. The graphs produced by PGGb serve as reference systems for aligning short reads through vg giraffe or long reads through GraphAligner. Furthermore, odgi allows us to utilize the graphs as reference systems to elucidate homology relationships across entire genomes.
PanGenome Graph Evaluator (PGGE)
The pangenome graph evaluation is a pipeline in the pangenome graphs, which measures the reconstruction accuracy of the graph. It helps find the best pangenome graph using input data and tasks.
- Creating scripts in specific work directory in the Nesi environment.
- Downloading and preparing sequencing data (in fasta format).
- Creating Pangenome graphs using PGGB and PGGE Tools.
- Identifying genetic variation and structural variance.
NeSI HPC environment is used for the analysis. Please make sure to have a NeSI account and you are able to login.
# Create a new directory in somewhere and change to that directory
mkdir pg_workshop
cd pg_workshop
# Keep a note of the absolute path of your directory
pwd
/nesi/nobackup/ga03793/pg_workshop
The National Library of Medicine is the largest library focused on biomedicine worldwide, serving as the central hub for biomedical informatics and computational biology. It has many genome assembly data and Genome assembly ASM19152v1 will be used for this workshop.
Please follow the proedure described in this page
#Use samtools to create the index file
#In Nesi environment you will have to load the command first
$ module load SAMtools
$ samtools faidx ASM19152v1_pgsim.fa
$ $ cat ASM19152v1_pgsim.fa.fai
NC_017518.1 2248966 64 80 81
NC_017518.1_SNP_5000 2248966 2277165 2248966 2248967
NC_017518.1_INDEL_5000 2249048 4526156 2249048 2249049
NC_017518.1_SNP_4000_INDEL_4000 2242147 6775238 2242147 2242148
NC_017518.1_SNP_4000_INDEL_4000_INV_4 2242147 9017425 2242147 2242148
NC_017518.1_SNP_4000_INDEL_4000_CNV_4 2415498 11259612 2415498 2415499
As per the index this assembly consists of 6 samples described in the below table.
Name | Length | SNPs | INDELs | INV | CNV |
---|---|---|---|---|---|
NC_017518.1 (Reference) | 2,248,966 | N/A | N/A | N/A | N/A |
NC_017518.1_SNP_5000 | 2,248,966 | 5,000 | 0 | 0 | 0 |
NC_017518.1_INDEL_5000 | 2,249,048 | 0 | 5,000 | 0 | 0 |
NC_017518.1_SNP_4000_INDEL_4000 | 2,153,883 | 4,000 | 4,000 | 0 | 0 |
NC_017518.1_SNP_4000_INDEL_4000_INV_4 | 2,242,147 | 4,000 | 4,000 | 4 | 0 |
NC_017518.1_SNP_4000_INDEL_4000_CNV_4 | 2,415,498 | 4,000 | 4,000 | 0 | 4 |
We can follow the procedure in https://github.com/pangenome/pggb#singularity to setup the Singularity image. This is already done and the image is in /nesi/project/ga03793/software/pggb/
directory for version 0.5.3.
Following script (pggb_test.sh) can be used to run pggb
on the downloaded sequence.
#!/usr/bin/bash
module load Singularity
#export container to a variable for convenience
WD=/nesi/nobackup/ga03793/pg_workshop #Working Directory
container=/nesi/project/ga03793/software/pggb/pggb_0.5.3.simg
data=${WD}/ASM19152v1_pgsim.fa
#Bind filesystem to container image
export SINGULARITY_BIND="${WD}, /nesi/project/ga03793/"
singularity exec ${container} pggb -i $data -s 1000 -p 95 -n 6 -k 79 -t 2 -S -m -o output -V 'NC_017518.1:#'
In pggb
-i
is for specifying the sequence file. -s
specifies the segment length for mapping and -p
specifies percent identity for mapping/alignment. -n
is for number of haplotypes (or number of samples). -k
for minimum matching length. -t
says number of threads to be used for the execution. -S
will generate the stats. -m
will generate MultiQC report of graphs' statistics and visualizations. -o
specifies the output directory name. -V 'NC_017518.1:#'
will create a vcf file and its stats considering NC_017518.1 as the reference sequence.
You can run run pggb without parameters to get information on the meaning of each parameter. Noe take a look at the files in the "output" folder. We get a graph in GFA (.gfa) and odgi (.og) formats. These can be used downstream in many methods, including those in vg, like vg giraffe. You can visualize the GFA format graph with BandageNG, and use odgi directly on the *.gfa or *.og output.
Executing pggb
as a SLURM Job
Executing shell scripts in the Nesi environment might not be the best way to handle larger files which will require large memory, CPU power and time. We can modify the previusely explained script as below (pggb_slurm_1K95.sh) to run as SLURM job. Note the additional parameters specified by #SBATCH
which will indicate maximum resource limitations.
#!/usr/bin/bash
#SBATCH --account ga03793
#SBATCH --job-name NC_017518.1_1K95
#SBATCH --cpus-per-task 8
#SBATCH --mem 4G
#SBATCH --time 1:00:00
module purge
module load Singularity
#export container to a variable for convenience
WD=/nesi/nobackup/ga03793/pg_workshop #Working Directory
container=/nesi/project/ga03793/software/pggb/pggb_0.5.3.simg
data=${WD}/ASM19152v1_pgsim.fa
#Bind filesystem to container image
export SINGULARITY_BIND="${WD}, /nesi/project/ga03793/"
singularity exec ${container} pggb -i $data -s 1000 -p 95 -n 6 -k 79 -t 24 -S -m -o output_1K95 -V 'NC_017518.1:#'
The job can be submitted using the sbatch
command it will show a job id. In this case 35887085
$ sbatch pggb_slurm_1K95.sh
Submitted batch job 35887085
We can monitor the job status using seff
and squeue
specifying the job id.
seff 35887085
Job ID: 35887085
Cluster: mahuika
User/Group: ismnu81p/ismnu81p
State: RUNNING
Nodes: 1
Cores per node: 8
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:04:16 core-walltime
Job Wall-clock time: 00:00:32
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 4.00 GB (4.00 GB/node)
WARNING: Efficiency statistics may be misleading for RUNNING jobs.
$ squeue --job 35887085
JOBID USER ACCOUNT NAME CPUS MIN_MEM PARTITI START_TIME TIME_LEFT STATE NODELIST(REASON)
35887085 ismnu81p ga03793 NC_017518.1_ 8 4G large 2023-05-21T0 58:35 RUNNING wbn063
SLURM will also create a output log file and we can monitor it realtime using tail -f
.
$ tail -f slurm-35887085.out
[smoothxg::(1-3)::prep] writing graph output_1K95/ASM19152v1_pgsim.fa.2ab4142.c2fac19.seqwish.gfa.prep.0.gfa
[smoothxg::(1-3)::main] building xg index
[smoothxg::(1-3)::smoothable_blocks] computing blocks
[smoothxg::(1-3)::smoothable_blocks] computing blocks for 54747 handles: 100.00% @ 5.47e+04/s elapsed: 00:00:00:01 remain: 00:00:00:00
[smoothxg::(1-3)::break_and_split_blocks] cutting blocks that contain sequences longer than max-poa-length (1400) and depth >= 0
[smoothxg::(1-3)::break_and_split_blocks] splitting 3862 blocks at identity 0.950 (WFA-based clustering) and at estimated-identity 0.950 (mash-based clustering)
[smoothxg::(1-3)::break_and_split_blocks] cutting and splitting 3862 blocks: 100.00% @ 1.49e+04/s elapsed: 00:00:00:00 remain: 00:00:00:00
[smoothxg::(1-3)::break_and_split_blocks] cut 0 blocks of which 0 had repeats
[smoothxg::(1-3)::break_and_split_blocks] split 0 blocks
[smoothxg::(1-3)::smooth_and_lace] applying local SPOA to 3862 blocks: 78.40% @ 5.77e+01/s elapsed: 00:00:00:52 remain: 00:00:00:14
When the job is completed the seff
command will show a summary report with below details. The job has used 785.61 MB memory and taken 7 minuted and 50 seconds to complete.
$ seff 35887085
Job ID: 35887085
Cluster: mahuika
User/Group: ismnu81p/ismnu81p
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 8
CPU Utilized: 00:41:23
CPU Efficiency: 66.04% of 01:02:40 core-walltime
Job Wall-clock time: 00:07:50
Memory Utilized: 785.61 MB
Memory Efficiency: 19.18% of 4.00 GB
Now we can try the same script by changing the pggb
parameters -s
, -p
and -k
and compare the results. Please refer script folder for scripts.
We obtain a series of diagnostic images that represent the pangenome alignment. These are created with odgi viz (1D matrix) and odgi layout with odgi draw (2D graph drawings). First, the 2D layout gives us a view of the total alignment. For small graphs, we can look at the version that shows where specific paths go (*.draw_multiqc.png): For larger ones, the *.draw.png result is usually more legible.
What happens if we set a lower -n
? This parameter determines how many mappings we have. Each sequence is aligned against its n-1 best matches. Setting -n 6 causes clustering of sequences into groups that are more similar.
Another key parameter is -k, which affects the behavior of seqwish. This filter removes exact matches from alignments that are shorter than -k. Short matches occur in regions of high diversity. In practice, these short matches contribute little to the overall structure of the graph, and we remove them to further simplify the base graph structure.
By setting a lower mapping segment length, which affects the behavior of the very first step in the pipeline, wfmash’s mapping step (itself based on a heavily modified version of MashMap). This defaults to -s 5k. We can use -s 1k to guarantee we pick up on smaller homology segments, leading to a more complete alignment.
The script generated output directory consists of a compehensive and interactive MutltiQC Report which will decribe all. Open the file multiqc_report.html which is in the output folder from your browser.
Note: To download the output folder from the Nesi environment you can first zip it using the command zip -r output.zip output