Computational Microbial Genomic, Trento University 2023 - 2024

Oral Microbiome & Metagenomic analysis for the Characterization of a uSGB See the presentation

The text is generated with ChatGPT4 using code chunks used in the projects as input

Authors: Zehra Korkusuz, Eevi Sipponen

This project is centered on the computational exploration of metagenome-assembled genomes (MAGs) derived from a dental plaque Species-Level Genome Bin (SGB), specifically SGB 985.

Objectives

The primary objective is to extract valuable insights from the MAGs of SGB 985. To achieve this, we will execute a series of computational genomic analyses, which include:

Quality-Checking: Using CheckM to assess the quality of the assembled genomes, ensuring their suitability for further analysis.
Taxonomic Assignment: Employing PhyloPhlAn to categorize the MAGs into taxonomic ranks, providing a clearer understanding of their biological context.
Genome Annotation: Annotating the genomes with Prokka to identify genes and infer their function.
Pangenome Analysis: Applying Roary to compare gene content across different genomes, generating plots to visualize the pan-genome landscape.
Phylogenetic Analysis: Constructing phylogenetic trees with Roary and FastTree to elucidate evolutionary relationships, accompanied by plots for clarity.
Association with Host Metadata: Correlating genomic data with host metadata to explore potential associations.

ANALYSIS

1. CheckM Quality Analysis

This document provides a step-by-step guide for downloading, decompressing, and analyzing metagenome-assembled genomes (MAGs) belonging to a specific gene bank (SGB).

Prerequisites

Ensure you have conda and bunzip2 installed on your system.
Download the ZIP file containing the MAGs.

Steps

Unzip the Downloaded File

Use bunzip2 to decompress the downloaded ZIP file. This file contains MAGs related to a specific SGB.
Prepare for Quality Analysis with CheckM

First, create a directory to store output files generated by CheckM analysis:
```
mkdir checkm_output
```

Run CheckM

bunzip2 mags/* && checkm taxonomy_wf domain Bacteria mags checkm_output -t 4

2. Taxonomic Assignment using PhyloPhlAn

The following steps outline how to assign taxonomy to metagenome-assembled genomes (MAGs) using PhyloPhlAn 3.0.

Prerequisites

Ensure you have conda installed on your system.

Set Conda Channel Priority

Adjust the conda channel priority to flexible for compatibility:
```
conda config --set channel_priority flexible
```
Create a New Conda Environment

Create a new conda environment with the necessary packages:
```
conda create -n ppa phylophlan=3.0 biopython=1.83 -c bioconda -c conda-forge
```
Activate the New Environment

Activate the conda environment named ppa:
```
conda activate ppa
```
List Available Databases

Check the available databases in PhyloPhlAn:
```
(ppa) phylophlan_metagenomic --database_list
```
Create Output Directory

Create a directory to store the output from PhyloPhlAn:
```
(ppa) mkdir phylophlan_output
```
Run PhyloPhlAn Metagenomic Analysis

Execute the PhyloPhlAn metagenomic analysis with the specified parameters:
```
(ppa) phylophlan_metagenomic -i mags -d CMG2324 --verbose --nproc 4 -n 1 -o phylophlan_output/ppa_ms
```
- -i mags: Input folder where the MAGs are located.
- -d CMG2324: Specifies the database to be used for the analysis.
- --verbose: Enables verbose mode, providing detailed log messages.
- --nproc 4: Number of processors to use.
- -n 1: Names the run, useful for running multiple analyses.
- -o phylophlan_output/ppa_ms: Specifies the output directory and prefix for the output files.

After running these commands, your MAGs will be processed and the taxonomic assignment will be saved in the specified output directory.

3. Gene Annotation with Prokka

Prerequisites

Ensure Prokka is installed and available in your environment.
Assume MAGs are in shortened_fasta/*.fna.

Prokka Annotation

To annotate the MAGs, use the following script. It iterates over each .fna file, runs Prokka, and skips already processed files.

for i in shortened_fasta/*.fna; do
    base=\$(basename "\$i" .fna)
    output_dir="Prokka_\${base}"

    if [ ! -d "\$output_dir" ]; then
        echo "Processing: \$i..."
        prokka --outdir "\$output_dir" \
               --prefix "\${base}" \
               --force \
               --centre "Project_" \
               --compliant \
               --kingdom Bacteria "\$i"
        echo "Completed: \$i"
    else
        echo "Output directory \$output_dir already exists, skipping..."
    fi
done

Compiling Annotation Data

Use the following shell script to compile annotation data into a .tsv file:

#!/bin/bash

temp_data="temp_data.txt"
output_file="compiled_data.tsv"

[ -e "$temp_data" \] && rm "$temp_data"

for dir in Prokka_short/*; do ID=$(echo "$dir" | sed 's/Prokka_(.)_short/\1/') file="$dir/${ID}_short.txt"

    while IFS="-: " read -r key value; do
        echo -e "${ID}\t${key}\t${value}" >> "$temp_data"
    done < "$file"
done

echo -e "ID\t$(awk -F'\t' '!seen[$2]++ {keys=keys"\t"$2} END {print substr(keys,2)}' "$temp_data")" > "$output_file"

awk -F'\t' '{ id_key=$1 OFS $2 data[id_key]=$3 if (!key_seen[$2]++) { keys[key_order++]=$2 } if (!id_seen[$1]++) { ids[id_order++]=$1 } }
END { for (i=0; i<id_order; i++) { printf ids[i] for (j=0; j<key_order; j++) { printf "\t%s", data[ids[i] OFS keys[j]] } printf "\n" } }' OFS='\t' "$temp_data" >> "$output_file"

rm "$temp_data"

Counting Proteins

To count proteins and merge them into the compiled data, use the following script:

#!/bin/bash

output_file="compiled_data.tsv"
protein_counts_file="protein_counts.tsv"

echo -e "ID\tknown_proteins\thypothetical_proteins" > "\$protein_counts_file"

for dir in Prokka_*_short; do
    ID=\$(echo "\$dir" | sed 's/Prokka_\(.*\)_short/\1/')
    prokka_tsv="\${dir}/\${ID}_short.tsv"
    
    known=\$(grep -v "hypothetical protein" "\$prokka_tsv" | wc -l)
    hypothetical=\$(grep "hypothetical protein" "\$prokka_tsv" | wc -l)
    
    echo -e "\${ID}\t\${known}\t\${hypothetical}" >> "\$protein_counts_file"
done

sort -o "\$protein_counts_file" "\$protein_counts_file"

join -1 1 -2 1 -t \$'\t' -o '1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 2.2 2.3' <(sort -k1,1 "\$output_file") <(sort -k1,1 "\$protein_counts_file") > "temp_\$output_file"

mv "temp_\$output_file" "\$output_file"
rm "\$protein_counts_file"

echo "Protein counts have been merged into \$output_file."

Compiled Prokka Output to a tsv file

#!/bin/bash

# Temporary file to hold all parsed data
temp_data="temp_data.txt"

# Final output TSV file
output_file="compiled_data.tsv"

# Check if the temp file exists, if so, remove it
[ -e "$temp_data" ] && rm "$temp_data"

# Extract and compile information from each file
for dir in Prokka_*_short; do
 ID=$(echo "$dir" | sed 's/Prokka_\(.*\)_short/\1/')
 file="${dir}/${ID}_short.txt"
 
 # Read each line in the file and format it
 while IFS=": " read -r key value; do
     echo -e "${ID}\t${key}\t${value}" >> "$temp_data"
 done < "$file"
done

# Create the header for the output file
echo -e "ID\t$(awk -F'\t' '!seen[$2]++ {keys=keys"\t"$2} END {print substr(keys,2)}' "$temp_data")" > "$output_file"

# Fill the output file
awk -F'\t' '{
 id_key=$1 OFS $2
 data[id_key]=$3
 if (!key_seen[$2]++) {
     keys[key_order++]=$2
 }
 if (!id_seen[$1]++) {
     ids[id_order++]=$1
 }
}
END {
 for (i=0; i<id_order; i++) {
     printf ids[i]
     for (j=0; j<key_order; j++) {
         printf "\t%s", data[ids[i] OFS keys[j]]
     }
     printf "\n"
 }
}' OFS='\t' "$temp_data" >> "$output_file"

# Remove temporary file
rm "$temp_data"

Running bash files

foo@bar:~$ chmod +x run_prokka.sh
foo@bar:~$ ./run_prokka

Step 4: Run Roary for Pan-genome Analysis

The following step details how to use Roary with GFF files produced by Prokka annotation for pan-genome analysis.

Prerequisites

Make sure Roary is installed in your environment.
Have Prokka output directories with .gff files ready.

Pan-genome Analysis with Roary

Execute Roary on the .gff files generated by Prokka:

roary -f roary_output -e -n -p 4 -i 95 prokka_output_dir/*.gff

-f roary_output: Specifies the output directory where Roary will write the results.
-e: Produces a multiFASTA alignment of core genes.
-n: Use MAFFT to align core genes.
-p 4: Use 4 CPU cores.
-i 95: Set the minimum percentage identity for BLASTp at 95%.
prokka_output_dir/*.gff: Input all .gff files from the Prokka output directories.

Make sure to replace prokka_output_dir with the actual directory containing your Prokka-generated .gff files. Roary will collate the .gff files, create pan-genome analysis results, and place them in the roary_output directory.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Data		Data
Jupyter Notebooks		Jupyter Notebooks
Plots		Plots
CMG Report_Korkusuz_Sipponen.pdf		CMG Report_Korkusuz_Sipponen.pdf
README.md		README.md
accessory_binary_genes.fa.newick		accessory_binary_genes.fa.newick
bin_stats_ext.csv		bin_stats_ext.csv
compiled_data.tsv		compiled_data.tsv
computational microbial genomics.pdf		computational microbial genomics.pdf
ppa_m.tsv		ppa_m.tsv
tree.R		tree.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Computational Microbial Genomic, Trento University 2023 - 2024

Objectives

ANALYSIS

1. CheckM Quality Analysis

Prerequisites

Steps

2. Taxonomic Assignment using PhyloPhlAn

Prerequisites

3. Gene Annotation with Prokka

Prerequisites

Prokka Annotation

Compiling Annotation Data

Counting Proteins

Compiled Prokka Output to a tsv file

Running bash files

Step 4: Run Roary for Pan-genome Analysis

Prerequisites

Pan-genome Analysis with Roary

About

Releases

Packages

Languages

zehrakorkusuz/Microbial_Genomics

Folders and files

Latest commit

History

Repository files navigation

Computational Microbial Genomic, Trento University 2023 - 2024

Objectives

ANALYSIS

1. CheckM Quality Analysis

Prerequisites

Steps

2. Taxonomic Assignment using PhyloPhlAn

Prerequisites

3. Gene Annotation with Prokka

Prerequisites

Prokka Annotation

Compiling Annotation Data

Counting Proteins

Compiled Prokka Output to a tsv file

Running bash files

Step 4: Run Roary for Pan-genome Analysis

Prerequisites

Pan-genome Analysis with Roary

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages