Skip to content

Oral Microbiome & Metagenomic analysis for the Characterization of a uSGB

Notifications You must be signed in to change notification settings

zehrakorkusuz/Microbial_Genomics

Repository files navigation

Computational Microbial Genomic, Trento University 2023 - 2024

Oral Microbiome & Metagenomic analysis for the Characterization of a uSGB See the presentation

The text is generated with ChatGPT4 using code chunks used in the projects as input

Authors: Zehra Korkusuz, Eevi Sipponen

This project is centered on the computational exploration of metagenome-assembled genomes (MAGs) derived from a dental plaque Species-Level Genome Bin (SGB), specifically SGB 985.

Objectives

The primary objective is to extract valuable insights from the MAGs of SGB 985. To achieve this, we will execute a series of computational genomic analyses, which include:

  • Quality-Checking: Using CheckM to assess the quality of the assembled genomes, ensuring their suitability for further analysis.
  • Taxonomic Assignment: Employing PhyloPhlAn to categorize the MAGs into taxonomic ranks, providing a clearer understanding of their biological context.
  • Genome Annotation: Annotating the genomes with Prokka to identify genes and infer their function.
  • Pangenome Analysis: Applying Roary to compare gene content across different genomes, generating plots to visualize the pan-genome landscape.
  • Phylogenetic Analysis: Constructing phylogenetic trees with Roary and FastTree to elucidate evolutionary relationships, accompanied by plots for clarity.
  • Association with Host Metadata: Correlating genomic data with host metadata to explore potential associations.

ANALYSIS

1. CheckM Quality Analysis

This document provides a step-by-step guide for downloading, decompressing, and analyzing metagenome-assembled genomes (MAGs) belonging to a specific gene bank (SGB).

Prerequisites

  • Ensure you have conda and bunzip2 installed on your system.
  • Download the ZIP file containing the MAGs.

Steps

  1. Unzip the Downloaded File

    Use bunzip2 to decompress the downloaded ZIP file. This file contains MAGs related to a specific SGB.

  2. Prepare for Quality Analysis with CheckM

    First, create a directory to store output files generated by CheckM analysis:

    mkdir checkm_output
  3. Run CheckM

    bunzip2 mags/* && checkm taxonomy_wf domain Bacteria mags checkm_output -t 4

2. Taxonomic Assignment using PhyloPhlAn

The following steps outline how to assign taxonomy to metagenome-assembled genomes (MAGs) using PhyloPhlAn 3.0.

Prerequisites

  • Ensure you have conda installed on your system.
  1. Set Conda Channel Priority

    Adjust the conda channel priority to flexible for compatibility:

    conda config --set channel_priority flexible
  2. Create a New Conda Environment

    Create a new conda environment with the necessary packages:

    conda create -n ppa phylophlan=3.0 biopython=1.83 -c bioconda -c conda-forge
  3. Activate the New Environment

    Activate the conda environment named ppa:

    conda activate ppa
  4. List Available Databases

    Check the available databases in PhyloPhlAn:

    (ppa) phylophlan_metagenomic --database_list
  5. Create Output Directory

    Create a directory to store the output from PhyloPhlAn:

    (ppa) mkdir phylophlan_output
  6. Run PhyloPhlAn Metagenomic Analysis

    Execute the PhyloPhlAn metagenomic analysis with the specified parameters:

    (ppa) phylophlan_metagenomic -i mags -d CMG2324 --verbose --nproc 4 -n 1 -o phylophlan_output/ppa_ms
    • -i mags: Input folder where the MAGs are located.
    • -d CMG2324: Specifies the database to be used for the analysis.
    • --verbose: Enables verbose mode, providing detailed log messages.
    • --nproc 4: Number of processors to use.
    • -n 1: Names the run, useful for running multiple analyses.
    • -o phylophlan_output/ppa_ms: Specifies the output directory and prefix for the output files.

After running these commands, your MAGs will be processed and the taxonomic assignment will be saved in the specified output directory.

3. Gene Annotation with Prokka

Prerequisites

  • Ensure Prokka is installed and available in your environment.
  • Assume MAGs are in shortened_fasta/*.fna.

Prokka Annotation

To annotate the MAGs, use the following script. It iterates over each .fna file, runs Prokka, and skips already processed files.

for i in shortened_fasta/*.fna; do
    base=\$(basename "\$i" .fna)
    output_dir="Prokka_\${base}"

    if [ ! -d "\$output_dir" ]; then
        echo "Processing: \$i..."
        prokka --outdir "\$output_dir" \
               --prefix "\${base}" \
               --force \
               --centre "Project_" \
               --compliant \
               --kingdom Bacteria "\$i"
        echo "Completed: \$i"
    else
        echo "Output directory \$output_dir already exists, skipping..."
    fi
done

Compiling Annotation Data

Use the following shell script to compile annotation data into a .tsv file:

#!/bin/bash

temp_data="temp_data.txt"
output_file="compiled_data.tsv"

[ -e "$temp_data" \] && rm "$temp_data"

for dir in Prokka_short/*; do ID=$(echo "$dir" | sed 's/Prokka_(.)_short/\1/') file="$dir/${ID}_short.txt"

    while IFS="-: " read -r key value; do
        echo -e "${ID}\t${key}\t${value}" >> "$temp_data"
    done < "$file"
done

echo -e "ID\t$(awk -F'\t' '!seen[$2]++ {keys=keys"\t"$2} END {print substr(keys,2)}' "$temp_data")" > "$output_file"

awk -F'\t' '{ id_key=$1 OFS $2 data[id_key]=$3 if (!key_seen[$2]++) { keys[key_order++]=$2 } if (!id_seen[$1]++) { ids[id_order++]=$1 } }
END { for (i=0; i<id_order; i++) { printf ids[i] for (j=0; j<key_order; j++) { printf "\t%s", data[ids[i] OFS keys[j]] } printf "\n" } }' OFS='\t' "$temp_data" >> "$output_file"

rm "$temp_data"

Counting Proteins

To count proteins and merge them into the compiled data, use the following script:

#!/bin/bash

output_file="compiled_data.tsv"
protein_counts_file="protein_counts.tsv"

echo -e "ID\tknown_proteins\thypothetical_proteins" > "\$protein_counts_file"

for dir in Prokka_*_short; do
    ID=\$(echo "\$dir" | sed 's/Prokka_\(.*\)_short/\1/')
    prokka_tsv="\${dir}/\${ID}_short.tsv"
    
    known=\$(grep -v "hypothetical protein" "\$prokka_tsv" | wc -l)
    hypothetical=\$(grep "hypothetical protein" "\$prokka_tsv" | wc -l)
    
    echo -e "\${ID}\t\${known}\t\${hypothetical}" >> "\$protein_counts_file"
done

sort -o "\$protein_counts_file" "\$protein_counts_file"

join -1 1 -2 1 -t \$'\t' -o '1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 2.2 2.3' <(sort -k1,1 "\$output_file") <(sort -k1,1 "\$protein_counts_file") > "temp_\$output_file"

mv "temp_\$output_file" "\$output_file"
rm "\$protein_counts_file"

echo "Protein counts have been merged into \$output_file."

Compiled Prokka Output to a tsv file

#!/bin/bash

# Temporary file to hold all parsed data
temp_data="temp_data.txt"

# Final output TSV file
output_file="compiled_data.tsv"

# Check if the temp file exists, if so, remove it
[ -e "$temp_data" ] && rm "$temp_data"

# Extract and compile information from each file
for dir in Prokka_*_short; do
 ID=$(echo "$dir" | sed 's/Prokka_\(.*\)_short/\1/')
 file="${dir}/${ID}_short.txt"
 
 # Read each line in the file and format it
 while IFS=": " read -r key value; do
     echo -e "${ID}\t${key}\t${value}" >> "$temp_data"
 done < "$file"
done

# Create the header for the output file
echo -e "ID\t$(awk -F'\t' '!seen[$2]++ {keys=keys"\t"$2} END {print substr(keys,2)}' "$temp_data")" > "$output_file"

# Fill the output file
awk -F'\t' '{
 id_key=$1 OFS $2
 data[id_key]=$3
 if (!key_seen[$2]++) {
     keys[key_order++]=$2
 }
 if (!id_seen[$1]++) {
     ids[id_order++]=$1
 }
}
END {
 for (i=0; i<id_order; i++) {
     printf ids[i]
     for (j=0; j<key_order; j++) {
         printf "\t%s", data[ids[i] OFS keys[j]]
     }
     printf "\n"
 }
}' OFS='\t' "$temp_data" >> "$output_file"

# Remove temporary file
rm "$temp_data"

Running bash files

foo@bar:~$ chmod +x run_prokka.sh
foo@bar:~$ ./run_prokka

Step 4: Run Roary for Pan-genome Analysis

The following step details how to use Roary with GFF files produced by Prokka annotation for pan-genome analysis.

Prerequisites

  • Make sure Roary is installed in your environment.
  • Have Prokka output directories with .gff files ready.

Pan-genome Analysis with Roary

Execute Roary on the .gff files generated by Prokka:

roary -f roary_output -e -n -p 4 -i 95 prokka_output_dir/*.gff
  • -f roary_output: Specifies the output directory where Roary will write the results.
  • -e: Produces a multiFASTA alignment of core genes.
  • -n: Use MAFFT to align core genes.
  • -p 4: Use 4 CPU cores.
  • -i 95: Set the minimum percentage identity for BLASTp at 95%.
  • prokka_output_dir/*.gff: Input all .gff files from the Prokka output directories.

Make sure to replace prokka_output_dir with the actual directory containing your Prokka-generated .gff files. Roary will collate the .gff files, create pan-genome analysis results, and place them in the roary_output directory.

About

Oral Microbiome & Metagenomic analysis for the Characterization of a uSGB

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published