Oral Microbiome & Metagenomic analysis for the Characterization of a uSGB See the presentation
The text is generated with ChatGPT4 using code chunks used in the projects as input
Authors: Zehra Korkusuz, Eevi Sipponen
This project is centered on the computational exploration of metagenome-assembled genomes (MAGs) derived from a dental plaque Species-Level Genome Bin (SGB), specifically SGB 985.
The primary objective is to extract valuable insights from the MAGs of SGB 985. To achieve this, we will execute a series of computational genomic analyses, which include:
- Quality-Checking: Using CheckM to assess the quality of the assembled genomes, ensuring their suitability for further analysis.
- Taxonomic Assignment: Employing PhyloPhlAn to categorize the MAGs into taxonomic ranks, providing a clearer understanding of their biological context.
- Genome Annotation: Annotating the genomes with Prokka to identify genes and infer their function.
- Pangenome Analysis: Applying Roary to compare gene content across different genomes, generating plots to visualize the pan-genome landscape.
- Phylogenetic Analysis: Constructing phylogenetic trees with Roary and FastTree to elucidate evolutionary relationships, accompanied by plots for clarity.
- Association with Host Metadata: Correlating genomic data with host metadata to explore potential associations.
This document provides a step-by-step guide for downloading, decompressing, and analyzing metagenome-assembled genomes (MAGs) belonging to a specific gene bank (SGB).
- Ensure you have
conda
andbunzip2
installed on your system. - Download the ZIP file containing the MAGs.
-
Unzip the Downloaded File
Use
bunzip2
to decompress the downloaded ZIP file. This file contains MAGs related to a specific SGB. -
Prepare for Quality Analysis with CheckM
First, create a directory to store output files generated by CheckM analysis:
mkdir checkm_output
-
Run CheckM
bunzip2 mags/* && checkm taxonomy_wf domain Bacteria mags checkm_output -t 4
The following steps outline how to assign taxonomy to metagenome-assembled genomes (MAGs) using PhyloPhlAn 3.0.
- Ensure you have
conda
installed on your system.
-
Set Conda Channel Priority
Adjust the conda channel priority to flexible for compatibility:
conda config --set channel_priority flexible
-
Create a New Conda Environment
Create a new conda environment with the necessary packages:
conda create -n ppa phylophlan=3.0 biopython=1.83 -c bioconda -c conda-forge
-
Activate the New Environment
Activate the conda environment named
ppa
:conda activate ppa
-
List Available Databases
Check the available databases in PhyloPhlAn:
(ppa) phylophlan_metagenomic --database_list
-
Create Output Directory
Create a directory to store the output from PhyloPhlAn:
(ppa) mkdir phylophlan_output
-
Run PhyloPhlAn Metagenomic Analysis
Execute the PhyloPhlAn metagenomic analysis with the specified parameters:
(ppa) phylophlan_metagenomic -i mags -d CMG2324 --verbose --nproc 4 -n 1 -o phylophlan_output/ppa_ms
-i mags
: Input folder where the MAGs are located.-d CMG2324
: Specifies the database to be used for the analysis.--verbose
: Enables verbose mode, providing detailed log messages.--nproc 4
: Number of processors to use.-n 1
: Names the run, useful for running multiple analyses.-o phylophlan_output/ppa_ms
: Specifies the output directory and prefix for the output files.
After running these commands, your MAGs will be processed and the taxonomic assignment will be saved in the specified output directory.
- Ensure Prokka is installed and available in your environment.
- Assume MAGs are in
shortened_fasta/*.fna
.
To annotate the MAGs, use the following script. It iterates over each .fna
file, runs Prokka, and skips already processed files.
for i in shortened_fasta/*.fna; do
base=\$(basename "\$i" .fna)
output_dir="Prokka_\${base}"
if [ ! -d "\$output_dir" ]; then
echo "Processing: \$i..."
prokka --outdir "\$output_dir" \
--prefix "\${base}" \
--force \
--centre "Project_" \
--compliant \
--kingdom Bacteria "\$i"
echo "Completed: \$i"
else
echo "Output directory \$output_dir already exists, skipping..."
fi
done
Use the following shell script to compile annotation data into a .tsv
file:
#!/bin/bash
temp_data="temp_data.txt"
output_file="compiled_data.tsv"
[ -e "$temp_data" \] && rm "$temp_data"
for dir in Prokka_short/*; do ID=$(echo "$dir" | sed 's/Prokka_(.)_short/\1/') file="$dir/${ID}_short.txt"
while IFS="-: " read -r key value; do
echo -e "${ID}\t${key}\t${value}" >> "$temp_data"
done < "$file"
done
echo -e "ID\t$(awk -F'\t' '!seen[$2]++ {keys=keys"\t"$2} END {print substr(keys,2)}' "$temp_data")" > "$output_file"
awk -F'\t' '{ id_key=$1 OFS $2 data[id_key]=$3 if (!key_seen[$2]++) { keys[key_order++]=$2 } if (!id_seen[$1]++) { ids[id_order++]=$1 } }
END { for (i=0; i<id_order; i++) { printf ids[i] for (j=0; j<key_order; j++) { printf "\t%s", data[ids[i] OFS keys[j]] } printf "\n" } }' OFS='\t' "$temp_data" >> "$output_file"
rm "$temp_data"
To count proteins and merge them into the compiled data, use the following script:
#!/bin/bash
output_file="compiled_data.tsv"
protein_counts_file="protein_counts.tsv"
echo -e "ID\tknown_proteins\thypothetical_proteins" > "\$protein_counts_file"
for dir in Prokka_*_short; do
ID=\$(echo "\$dir" | sed 's/Prokka_\(.*\)_short/\1/')
prokka_tsv="\${dir}/\${ID}_short.tsv"
known=\$(grep -v "hypothetical protein" "\$prokka_tsv" | wc -l)
hypothetical=\$(grep "hypothetical protein" "\$prokka_tsv" | wc -l)
echo -e "\${ID}\t\${known}\t\${hypothetical}" >> "\$protein_counts_file"
done
sort -o "\$protein_counts_file" "\$protein_counts_file"
join -1 1 -2 1 -t \$'\t' -o '1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 2.2 2.3' <(sort -k1,1 "\$output_file") <(sort -k1,1 "\$protein_counts_file") > "temp_\$output_file"
mv "temp_\$output_file" "\$output_file"
rm "\$protein_counts_file"
echo "Protein counts have been merged into \$output_file."
#!/bin/bash
# Temporary file to hold all parsed data
temp_data="temp_data.txt"
# Final output TSV file
output_file="compiled_data.tsv"
# Check if the temp file exists, if so, remove it
[ -e "$temp_data" ] && rm "$temp_data"
# Extract and compile information from each file
for dir in Prokka_*_short; do
ID=$(echo "$dir" | sed 's/Prokka_\(.*\)_short/\1/')
file="${dir}/${ID}_short.txt"
# Read each line in the file and format it
while IFS=": " read -r key value; do
echo -e "${ID}\t${key}\t${value}" >> "$temp_data"
done < "$file"
done
# Create the header for the output file
echo -e "ID\t$(awk -F'\t' '!seen[$2]++ {keys=keys"\t"$2} END {print substr(keys,2)}' "$temp_data")" > "$output_file"
# Fill the output file
awk -F'\t' '{
id_key=$1 OFS $2
data[id_key]=$3
if (!key_seen[$2]++) {
keys[key_order++]=$2
}
if (!id_seen[$1]++) {
ids[id_order++]=$1
}
}
END {
for (i=0; i<id_order; i++) {
printf ids[i]
for (j=0; j<key_order; j++) {
printf "\t%s", data[ids[i] OFS keys[j]]
}
printf "\n"
}
}' OFS='\t' "$temp_data" >> "$output_file"
# Remove temporary file
rm "$temp_data"
foo@bar:~$ chmod +x run_prokka.sh
foo@bar:~$ ./run_prokka
The following step details how to use Roary with GFF files produced by Prokka annotation for pan-genome analysis.
- Make sure Roary is installed in your environment.
- Have Prokka output directories with
.gff
files ready.
Execute Roary on the .gff
files generated by Prokka:
roary -f roary_output -e -n -p 4 -i 95 prokka_output_dir/*.gff
-f roary_output
: Specifies the output directory where Roary will write the results.-e
: Produces a multiFASTA alignment of core genes.-n
: Use MAFFT to align core genes.-p 4
: Use 4 CPU cores.-i 95
: Set the minimum percentage identity for BLASTp at 95%.prokka_output_dir/*.gff
: Input all.gff
files from the Prokka output directories.
Make sure to replace prokka_output_dir
with the actual directory containing your Prokka-generated .gff
files. Roary will collate the .gff
files, create pan-genome analysis results, and place them in the roary_output
directory.