-
Notifications
You must be signed in to change notification settings - Fork 81
Combining metagenomes and metatranscriptomes with SqueezeMeta
This tutorial is aimed to give you some hints on the best way to analyze your combination of metagenomes and metatranscriptomes. We will illustrate the example with a set of three metagenomes and three metatranscriptomes taken from thermal microbial mats (for a reference, see https://www.ncbi.nlm.nih.gov/pubmed/30333812). Each sample was taken at a different temperature (44, 55 and 66ºC), and we have paired samples, that is, each metagenome has its corresponding metatranscriptome. Everything was sequenced with Illumina HiSeq, paired-end for the metagenomes, and single-read for the metatranscriptomes.
Our goal is to correlate the values of gene abundance (from the metagenome) and gene expression (from the metatranscriptome), so that we could infer the difference between the genes that are present and the genes that are active. We could plan the analysis in several ways.
We could analyze separately metagenomes and metatranscriptomes, creating functional/taxonomic profiles for both and just comparing these profiles. The problem of this approach is that we will not have a connection between the genes in both datasets, that is, we could not directly know that gene A in the metagenome is the same than gene B in the metatranscriptome, and therefore comparison between gene sets would be difficult.
We could think of doing a co-assembly of all samples, metagenomes and metatranscriptomes all together, so that we would have a single set of genes coming from both datasets. But take into account that while metagenomic sequencing is random, in the sense that all fragments from the same genome have equal probability of being sequenced (excluding some biases for particular compositions), metatranscriptomics is not. Some parts of the genome will be sequenced with a high coverage (corresponding to genes that are very active) while some other will not. As the assemblers assume that parts from the same genome will have approximately the same coverage, mixing metagenome and metatranscriptome will create havoc. Yes, you can assemble metatrancriptome data separately, because you are constructing independent transcripts, and therefore the assembler will deal with it. But please do not co-assemble metagenomes and metatranscriptomes together.
One approach that we and others have followed in the past was co-assembling the metagenomes and simply mapping the transcripts on to them. This indeed let us correlate the abundance of the genes in the metagenomes with their expression levels in the metatranscriptome, but (big but) only for the genes that are represented in the metagenome. We are missing the transcripts corresponding to genes that were not assembled in the metagenome (because they belong to a genomic region that was not assembled, for instance because they belong to low-abundance species). And we do not want to lose that valuable information!
A solution would be to recover the transcripts that were not mapped, assemble them separately, add the resulting contigs to the contig dataset from the metagenome, and remap. This is certainly doable, but SqueezeMeta can help you to have similar results easily and painlessly. We will use the merge mode to do this.
We will use the merge mode to assemble all data sets separately, and then merging together in a single dataset. In this way, we will have a single dataset as a result, where contigs will come from either the metagenomes and the metatranscriptomes, and contigs that were present in more than one dataset will be detected and merged (We are preparing a more sophisticated approach that we expect to improve results even better, where we will co-assemble metagenomes and metatrancriptomes separately, and then we will merged both datasets)
Let us do it. First, we create a samples file specifying all our datasets:
Then we run SqueezeMeta in merged mode:
/media/mcm/jtamames/squeezeM/repo/SqueezeM/scripts/SqueezeMeta.pl -m merged -s Huinay.samples -f /dir/to/fastq -p test_merged -c 200 --nopfam --nobins
Notice that we set the minimum contig length to 200 to get as much contigs as possible (this is especially important when working with metatranscriptomes). We also skipped the pfam annotations and the binning steps, since it is not the subject of this tutorial.
Let's start checking the mapping percentage, that is, how many reads we could map back to the contigs. This will give us an idea of the goodness of our assembly, or better, how well our assembly represents the diversity found in the individual samples. We can find this information in the results' file 09.<project>.mappingstat:
We see that we have rather good mapping percentages, all datasets above 85%. These are good news. The lowest percentages are achieved for the 66DNA sample, which was the one having less sequences in all the DNA samples, so it is likely that the assembly was less complete for it.
In this tutorial, we will use the wonderful STAMP software for the statistical analysis and comparison between samples. You can find it in http://kiwi.cs.dal.ca/Software/STAMP
We can inspect the differences in phylogenetic composition between the different datasets using the results in the file 10.<project>.mcount. This file counts how many reads and bases were assigned to each taxa in each sample, and it looks like this (only part of the file is shown):
From that file, we can easily extract a phylogenetic profile for each sample. We choose the phylum level for making the analysis. The resulting table is like this:
Notice that we are introducing raw counts for the analysis, since STAMP will normalize the data itself. We load the table in STAMP, and select the comparison between two samples using G-test with Yates + Fisher, correcting for multiple testing with Benjamini-Hochberg FDR, to discover the significant differences between both. We first addressed the differences between temperatures, comparing the 44 and 66ºC samples. The results are shown in the histogram below:
We see that at lower temperatures, three phyla are abundant (Proteobacteria, Chloroflexi and Cyanobacteria), while at higher temperatures the phylum Chloroflexi is clearly dominant, and Cyanobacteria keeps its importance, with similar numbers at both temperatures. The comparison of the RNA datasets at the same temperatures show very interesting differences:
We can see how the activity is focused in the members of the phyla Chloroflexi at high temperatures, and Cyanobacteria at lower ones. We then can envision a microbiome structure in which members of these two dominant phyla are present at the whole range of temperatures in the mat, but they are active only at a pàrticular range of temperatures: the cooler parts of the mat for Cyanobacteria, and in the hotter ones for Chloroflexi.
Next, we can perform a similar analysis focusing in the functions instead of the taxa. The question here could be: which functions are active and which ones are more active at different temperatures? For this, we need a table of the abundance of functions in each sample. As we can find different copies of the same gene in the datasets (corresponding to the same gene in different species or to paralogs within the same species), we need to add the abundances of all genes with the same function, thus obtaining a single value of abundance for each. SqueezeMeta has already done this for you, creating the file 11.<project>.kegg.rawcounts:
There you have the raw counts of all functions in all samples, allowing you to compare between datasets. You can input that table directly on STAMP. The differences between temperatures that you see may correspond to the differences between the composition of the microbiomes at each temperatures. It is more interesting to look at the differences between DNA and RNA at the same temperatures. This will inform about the processes that are significantly more active than expected according to their abundance.The functions that are significantly overexpressed at 48 ºC are shown below (only part of the file is shown):
Not surprisingly, these are cyanobacterial genes. Remember that Cyanobacteria were the most active phylum at this temperature. As you can see, the functions are focused in generating energy througth photosynthesis (this sample was taken during the daytime, therefore photosynthesis is rampant) and protein synthesis (some ribosomal proteins)
At 66ºC, the differences between the metagenome and the metatranscriptome are more difficult to interpret. The table below shows the most overexpressed genes at that temperature (only part of the file is shown):
We can see a pili asssembly protein (related to adhesion and mat formation, maybe?), some fatty acid enzimes involved in membrane generation, some transporters, and especially, a big representation of transcprition and protein synthesis genes. There is no evident metabolic pattern as in the case of 44ºC, therefore a most detailed study would be needed.