This is a method for identification of Calvin cycle-positive microbial genomes and ranking of genetic features (Enzyme Commission numbers and Pfam families) that represent adaptations to the Calvin cycle.
Linux operating system (Tested on Ubuntu 18.04.5 LTS and 20.04.1 LTS)
bash 4.0 (Tested with 4.4.20(1)-release and 5.0.17(1)-release)
Python 3.7 (Tested with 3.7.6)
R ≥ 3.6.3 (Tested with 3.6.3)
GNU parallel 20161222 (Tested with 20161222)
pigz 2.4 (Tested with 2.4)
seqmagick 0.6.2 (Tested with 0.6.2)
ORFfinder 0.4.3 (Tested with 0.4.3)
FastTreeMP 2.1.8 SSE3
DeepEC (Tested with commit b7e4546)
Python libraries: sklearn, numpy, pandas
R libraries: doMC, egg, foreach, ggnewscale, ggrepel, ggtree, MidpointRooter, optparse, phytools, RColorBrewer, scales, tidyverse, viridis
This analysis used 24,706 species representative archaeal and bacterial genomes listed in GTDB release 89 (https://gtdb.ecogenomic.org/). The genomes were downloaded from NCBI in gzip-compressed nucleotide FASTA format followed by ORF annotation with stand-alone ORFfinder v.0.4.3 (https://www.ncbi.nlm.nih.gov/orffinder/) using the following command:
ls /path/to/ncbi/genomes/ | parallel --no-notice --jobs 16 '
Outfile=`echo "data/orf/{}" | sed -e "s/fna.gz$/fasta/"`;
zcat /path/to/ncbi/genomic/{} > intermediate/{};
ORFfinder -g 11 -ml 300 -n true -in intermediate/{} -out $Outfile;
gzip $Outfile;
rm intermediate/{}
'
Note that the input nucleotide files must be named with the GTDB accession IDs, i.e. <GTDB accession ID>.fna.gz
, which is not the case directly after download from NCBI. If populated correctly, the directory data/orf/
should contain 24,706 gzip-compressed amino acid FASTA files with ORFs for each of the genomes. Each file must be named <GTDB accession ID>.fasta.gz
for the analysis to work.
The analysis also relies on the GTDB core protein alignments and Pfam-A HMMs (not included; see steps 02, 03, and 05).
Files containing example genome ORFs and their annotations can be downloaded from Figshare (DOI 10.6084/m9.figshare.13013309).
These are short descriptions of each step in the analysis. For details, see the corresponding scripts. Some output files have been omitted in the descriptions.
Script: source/00_Extract_accession_to_group_association.R
Output: intermediate/accession_taxonomy.tab
Creates a tab-delimited table with taxonomic information for each accession in GTDB.
Script: source/01_Identify_Rubisco_and_Prk.sh
Output: data/rubisco.txt
, data/prk.txt
, data/positive_genomes.txt
, and more
This script contains a long series of commands used to identify and filter Rubisco and Prk sequences in the GTDB genomes, followed by designating genomes as CBB-positive if they carry both Rubisco and Prk. Manual intervention is required for downloading sequences from UniProt, submitting sequences to KEGG BlastKOALA, and selecting the appropriate position of the K174 amino acid in the Rubisco alignment.
Script: source/02_Create_trees.sh
Output: intermediate/archaea.tree
and intermediate/bacteria.tree
Creates phylogenetic trees for use in the analyses below. Requires the GTDB core protein alignments ar122_msa.faa
and bac120_msa.faa
saved as data/archaea_msa.fasta
and data/bacteria_msa.fasta
.
Script: source/03_Create_distance_matrices.sh
Output: intermediate/archaea.dist
and intermediate/bacteria.dist
Creates phylogenetic distance matrices for use in the analyses below. Requires the GTDB core protein alignments ar122_msa.faa
and bac120_msa.faa
saved as data/archaea_msa.fasta
and data/bacteria_msa.fasta
(not included).
Script: source/04_Select_closest_relatives_as_negative.R
Output: intermediate/example_genomes.tab
Selects as many CBB-negative genomes as there are CBB-positive genomes, maintaining a phylogenetic distance that is as short as possible between the two groups, based on the distance matrices generated in the preceding step.
Script: source/05_Identify_functions_in_example_genomes.sh
Output: intermediate/deepec.tab.gz
and intermediate/pfam.tab.gz
Creates a one-column list of all example genomes (CBB-positive and CBB-negative), i.e. intermediate/example_genomes.txt
, and then runs DeepEC and hmmsearch with Pfam on the example genome ORFs to acquire EC and Pfam annotations. Requires the Pfam-A database with HMMs saved as data/Pfam-A.hmm.gz
(not included).
Script: source/06_Format_random_forest_input.R
Output: Files containing EC and Pfam count matrices, CBB-status, and names.
Filters out all Rubisco and Prk EC and Pfam annotations and formats a training dataset for the random forest analysis.
Script: source/07_Perform_random_forest_analysis.sh
Output: Log files with accuracies, importance and prediction reports.
Runs the random forest analysis on EC and Pfam features using source/Rank_features_with_random_forest.py
, yielding importance values and predictions for every genome from 100 random forests. Train:test split is 3:1 and logistic regression is used to select the 600 most promising EC and Pfam features.
Script: source/08_Create_feature_heatmap.R
Output: results/Feature_heatmap.png
Creates a heatmap with genomes sorted by average random forest prediction accuracy (columns), features sorted by average random forest importance (rows), and feature count represented by fill color.
Script: source/09_Feature_enrichment.R
Output: intermediate/feature_enrichment.tab
Compare EC and Pfam counts between CBB-positive and CBB-negative genomes using a Wilcoxon rank sum test.
Script: source/10_Taxonomic_distribution.R
Output: results/taxonomic_distribution.pdf
Generates a set of colors for taxonomic groups (intermediate/accession_organism_colours.tab
) and plots the distribution of CBB-positive and CBB-negative genomes.
Script: source/11_Correlate_feature_histories_in_Archaea.R
Output: Feature history correlation results in Archaea and trees.
Performs ancestral character estimation in Archaea, and correlates the emergence of the Calvin cycle to other genetic features. Plots the highest absolute correlations on phylogenetic trees.
Script: source/12_Correlate_feature_histories_in_Bacteria.R
Output: Subtree table, feature history correlation results in Bacteria, and trees.
Selects bacterial subtrees and plots them on the bacterial phylogenetic tree. Then performs ancestral character estimation in subtrees of Bacteria, and correlates the emergence of the Calvin cycle to other genetic features. Plots the highest absolute correlations on phylogenetic trees.
Script: source/13_Create_supplementary_and_compare.R
Output: Tables for enrichment, ACE, and random forest analyses, and comparisons.
Creates supplementary tables for the enrichment, ACE, and random forest analyses, with feature ranks. Correlates ranks between methods. Compares methods with a plot of the ranks.
Script: source/14_Create_consensus_ranking.R
Output: results/Supplementary_Consensus_rank.tab
and top 20 tables.
Ranks the sum of ranks for all methods to generate a consensus rank for each genetic feature (EC or Pfam). Highlights the top 20 EC and top 20 Pfam features for Table 1 (see step 31).
Script: source/15_Compare_accuracy_to_completeness.R
Output: results/EC_count_features.acc_vs_comp.png
Compares average per-genome random forest accuracy and genome completeness in a plot.
Script: source/16_Gene_proximity.R
Output: results/Supplementary_Gene_proximity.tab
Count the distance in number of ORFs between all genetic features and Rubisco and Prk in CBB-positive genomes.
Script: source/17_Comparison_figure.R
Output: results/method_feature_rank_comparison_2.pdf
Plots consensus rank versus method ranks for every genetic feature, with color based on proximity to Rubisco in CBB-positive genomes.
Script: source/18_Create_Genomes_dataset.R
Output: results/Supplementary_Example_genomes.tab.gz
Creates a supplementary dataset with all feature counts (except for ECs and Pfams representing Rubisco and Prk) as well as information about each example genome; closest selected relative, distance to closest selected relative, subtree affiliation, CBB status, genome completeness, and GTDB taxonomy.
Script: source/19_Correlate_importance_and_difference.R
Output: A Spearman correlation r value.
Correlate difference in mean count between CBB-positive and CBB-negative genomes for each feature and the random forest importance values of those features.
Script: source/20_Create_ORF_annotations_dataset.sh
Output: results/Supplementary_ORF_annotations.tab.gz
Creates a supplementary dataset with all ORF annotations (EC and Pfam). ORF names can be used to track position of ORFs in official NCBI contigs.
Script: source/21_Cyano_completeness.R
Output: Test and summary results for cyanobacterial completeness versus CBB status.
Performs a Wilcoxon rank sum test to check whether Cyanobacteria with the CBB-positive and CBB-negative classifications have significantly different genome completeness. Also compares CBB status of highly complete and less complete genomes.
Script: source/22_EC_count_colours_for_KEGG.R
Output: results/EC_count_features.KEGG_difference_colours.discrete.txt
Generate colors based on enrichment/depletion of ECs to use with KEGG's pathway mapping tool.
Script: source/23_Consensus_rank_colours_for_KEGG.R
Output: results/EC_count_features.KEGG_consensus_rank_colours.txt
Generate colors based on EC consensus ranks to use with KEGG's pathway mapping tool.
Script: source/24_Find_transport_features.R
Output: Number of transport features that were significant for each method.
Searches for transport-related terms in feature descriptions to count total number of transport features and how many were significant or ranked in the top 1000 (consensus rank).
Script: source/25_Photosynthetic_genomes.R
Output: results/Supplementary_Photosynthesis.tab
and results/genome_photosynthesis_status.tab
Identify photosynthesis-related features and determine how many CBB-positive and CBB-negative genomes may be photosynthetic.
Script: source/26_Make_Rubisco_tree.sh
Output: intermediate/rubisco_with_Tabita_examples.tree
Aligns the identified Rubisco sequences with examples from Tabita et al. (2007) and constructs a phylogenetic tree to identify Rubisco forms and ensure exclusion of Rubisco-like proteins.
Script: source/27_Plot_Rubisco_tree.R
Output: results/rubisco_with_Tabita_examples.pdf
Plots the tree constructed in step 26.
Script: source/28_Count_DUFs.R
Output: Counts of domains of unknown function among top ranks of the methods.
Counts the domains of unknown function (DUFs/UPFs) in the top 100 of enrichment and ACE analyses, and among all selected 600 Pfams in the random forest analysis.
Script: source/29_Range_of_fold_enrichment.R
Output: Maximum and minimum log2 fold enrichment.
Calculate the maximum and minimum log2 fold enrichment of all significant features and also identify the smallest absolute ratio that is still significant.
Script: source/30_Count_significant_features_in_subtrees.R
Output: The number of significant features in ACE in each subtree.
Counts the number of significant features in the ACE analysis for each subtree.
Script: source/31_Create_Table_1.R
Output: results/Table_1_Top_consensus_ranks.tab
Tidy up the top 20 EC and Pfam features output from step 14 to create Table 1.
Script: source/32_List_cyanobacteria_with_CBB.R
Output: intermediate/cyano_with_CBB.txt
Make a list of Cyanobacteria that were identified to carry the Calvin cycle.
Script: source/33_Select_cyano_closest_relatives.R
Output: intermediate/cyano_relatives.tab
Select closest CBB-negative relatives of CBB-positive Cyanobacteria as in step 04.
Script: source/34_Cyano_vs_others_relative_dist.R
Output: results/cyano_vs_others_relative_dist.tab
Compare the distance between CBB-positive Cyanobacteria and their CBB-negative relatives to the distance between other CBB-positive and CBB-negative microbes with a Wilcoxon rank sum test.
Script: source/Plot_ACE.R
Output: A PDF with a phylogenetic (sub)tree displaying ACE data.
Plots ACE results of any feature on any subtree (as long as the feature is present in the subtree organisms) to allow creating phylogenetic tree figures displaying correlations between features and the CBB status.
Johannes Asplund-Samuelsson ([email protected])