-
Notifications
You must be signed in to change notification settings - Fork 0
Post processing
Once (meta)genomic information (i.e. reads or contigs) were accurately integrated into a whole-genome alignment for a given species, a series of assessing and post-processing steps may help improve the reliability for downstream phylogenetics and molecular clocking analysis. Therefore, we provide rich lightweight utility scripts for manual curation.
Assessing the missing information in the genome alignment reconstruction
While one of outputs( Mac_stats.tsv
) from metaclock_mac
manifests basic statistics about the reconstructed alignment regarding gap ratio, alignment length, reconstructed sequence length, and GC contents, a genome-wide visualization can enhance the details about reconstruction quality along different genomic regions. Thus, we provide metaclock_visualizer
for assessing the reconstruction quality along the whole genome.
metaclock_visualizer -h
usage: metaclock_visualizer [-h] [-w [WINDOW_SIZE]] [-s [STEP_SIZE]]
[-l_dist [LABEL_DISTANCE]]
[-opt_text [OUTPUT_TEXT]]
[input_alignment] [output_name]
positional arguments:
input_alignment Input a MSA alignment in fasta
output_name Specify the output name for resulted figure
optional arguments:
-h, --help show this help message and exit
-w [WINDOW_SIZE], --window_size [WINDOW_SIZE]
Sliding window size. default:[200bp]
-s [STEP_SIZE], --step_size [STEP_SIZE]
Moving steps of sliding window. default:[50bp]
-l_dist [LABEL_DISTANCE], --label_distance [LABEL_DISTANCE]
The distance betwen text label and the plotting.
default: [0.15]
-opt_text [OUTPUT_TEXT], --output_text [OUTPUT_TEXT]
Output a statistics text.
metaclock_visualizer.py Mac_genome_MSA.fna genome_wide_visual.png -w 10000 -s 2000
"Reconstruction score" indicates the proportion of non-gap part in the genomic region within sliding windows. It provides the information of how well specific genomic regions were reconstructed using metagenomics data along the whole microbial genome. A high "reconstruction score" reflects a good reconstruction result. As the figure above, except for sample ERR3003623, ERR3003647, SRR6877339, and SRR6877399, we managed to reconstruct a near complete genome for remaining samples.
Minimizing the missing information in the genome alignment reconstruction
To increase the phylogenetic and molecular clocking analysis in the downstream analysis, we are going to apply some tailoring work using metaclock_tailor basic
in order to minimize missing information:
metaclock_tailor basic -h
usage: landscape basic [-h] [-a] [-s STATS] [-tt TARGET_TAILORING]
[whole_genome_alignment] [output_file]
Assess mutational landscape of whole genome alignment and select interested
regions.
positional arguments:
whole_genome_alignment
Input the whole genome alignment generated by
metaclock_mac.
output_file Specify output name for the tailored genome alignment
in fasta format.
optional arguments:
-h, --help show this help message and exit
-a, --automated_tailoring
This option allows automated tailoring using trimAl
for minimizing missing information in the alignment
-s STATS, --stats STATS
Specify a file name for statistic report of tailored
alignment.
-tt TARGET_TAILORING, --target_tailoring TARGET_TAILORING
Input a list of samples to keep for alignment
tailoring. The limit for missing info in each column
can be given and it is delimited by comma (e.g.
shortlist.txt,0.1), or column tailoring will be
performed automatically using trimAl (e.g. list.txt).
Here, we will choose target tailoring -tt
, namely keep samples we want (those having least missing information) and then perform trimming on columns.
First, prepare a short list shortlist.txt
including all interesting samples:
cat shortlist.txt
ERR3003614.fna
ERR3003615.fna
ERR3003619.fna
ERR3003621.fna
ERR3003622.fna
ERR3003632.fna
ERR3003637.fna
ERR3003640.fna
ERR3003644.fna
ERR3003646.fna
ERR3003651.fna
SRR6877288.fna
GCF_000529525.fna
GCF_902384065.fna
GCF_900289035.fna
Second, run metaclock_tailor basic
and remove all columns containing gaps:
metaclock_tailor basic Mac_opt.fna Mac_opt_tailored.fna -tt shortlist.txt,0
If you would like to use automated trimming on target samples:
metaclock_tailor basic Mac_opt.fna Mac_opt_tailored.fna -tt shortlist.txt
Thirdly, assess tailored alignment using metaclock_visualizer
:
metaclock_visualizer Mac_opt_tailored.fna genome_wide_visual_tailored.png -opt_text assessment.txt -w 10000 -s 2000
cat assessment.txt
SeqHeader GapRatio AlignmentLength SequenceLength GCcontent
ERR3003614.fna 0.0 1745214 1745214 0.28
ERR3003615.fna 0.0 1745214 1745214 0.28
ERR3003619.fna 0.0 1745214 1745214 0.28
ERR3003621.fna 0.0 1745214 1745214 0.28
ERR3003622.fna 0.0 1745214 1745214 0.28
ERR3003632.fna 0.0 1745214 1745214 0.28
ERR3003637.fna 0.0 1745214 1745214 0.28
ERR3003640.fna 0.0 1745214 1745214 0.28
ERR3003644.fna 0.0 1745214 1745214 0.28
ERR3003646.fna 0.0 1745214 1745214 0.28
ERR3003651.fna 0.0 1745214 1745214 0.28
SRR6877288.fna 0.0 1745214 1745214 0.28
GCF_000529525.fna 0.0 1745214 1745214 0.28
GCF_902384065.fna 0.0 1745214 1745214 0.28
GCF_900289035.fna 0.0 1745214 1745214 0.28
Assessing the whole genome landscape
Different genomic regions commonly have been through various evolutionary trajectory (e.g. purifying selection, position selection, lateral gene transfer), resulting clear heterogeneity in mutational landscape along the whole genome. Hence, in this framework, we provide another sophisticate utility metaclock_tailor landscape
for exploring the genome landscape region by region and selecting/concatenating interesting genomic regions in order to avoid noisy evolutionary signals.
metaclock_tailor landscape -h
usage: landscape landscape [-h] [-o OPT_DIR] [-gff3 GFF3_ANNOTATION] [-s]
[-nproc NUMBER_OF_PROCESSORS]
[-ns NUMBER_OF_SAMPLES] [-l MINIMUM_LENGTH]
[-mi MAXIMUM_MISSING_INFORMATION]
[-snv_density_max MAXIMUM_SNV_DENSITY]
[-snv_density_min MINIMUM_SNV_DENSITY]
[-Pdist_mean AVERAGE_PAIRWISE_DISTANCES]
[-Pdist_stdv PAIRWISE_DISTANCES_STDV] [-f FEATURE]
[-fast_TempEst FAST_TEMPORAL_SIGNAL_TEST] [-raxml]
[whole_genome_alignment]
Assess mutational landscape of whole genome alignment and select interesting
regions.
positional arguments:
whole_genome_alignment
Input the whole genome alignment generated by
metaclock_mac.
optional arguments:
-h, --help show this help message and exit
-o OPT_DIR, --opt_dir OPT_DIR
Sepecify a directory name for outputs. default:
"opt_dir" in current working directory.
-gff3 GFF3_ANNOTATION, --gff3_annotation GFF3_ANNOTATION
Input the gff3 annotation file. If no gff3 file was
provided automatic annotation would be performed.
(Please use gff file from the same reference sequence
in resconstructing whole genome alignment in order to
maintain the consistency.)
-s, --select Flag this option to select your interesting regions.
Note: other options need to be specificed.
-nproc NUMBER_OF_PROCESSORS, --number_of_processors NUMBER_OF_PROCESSORS
Specify the number of processor to be used. default:
[1]
-ns NUMBER_OF_SAMPLES, --number_of_samples NUMBER_OF_SAMPLES
Specify the minimum number of samples with <10 percent
missing information to select the child alignment.
Default: [1]
-l MINIMUM_LENGTH, --minimum_length MINIMUM_LENGTH
Specify the minimum length of a child alignment to be
selected. default: [100]
-mi MAXIMUM_MISSING_INFORMATION, --maximum_missing_information MAXIMUM_MISSING_INFORMATION
Specify the minimum missing information of a child
alignment to be selected. default: [1.0]
-snv_density_max MAXIMUM_SNV_DENSITY, --maximum_snv_density MAXIMUM_SNV_DENSITY
Specify the maximum SNV density for selecting a child
alignment in order to exclude hypervariable regions.
default: [1.0]
-snv_density_min MINIMUM_SNV_DENSITY, --minimum_snv_density MINIMUM_SNV_DENSITY
Specify the minimum SNV density for selecting a child
alignment in order to exclude hypervariable regions.
default: [0.0]
-Pdist_mean AVERAGE_PAIRWISE_DISTANCES, --average_pairwise_distances AVERAGE_PAIRWISE_DISTANCES
Specify the maximum average pairwise distances for
selecting a child alignment. default: [1.0]
-Pdist_stdv PAIRWISE_DISTANCES_STDV, --pairwise_distances_stdv PAIRWISE_DISTANCES_STDV
Specify the maximum standard deviation of pairwise
distances. default: [1.0]
-f FEATURE, --feature FEATURE
Specify the types [CDS, tRNA, nCDS] to select. If more
than one type comma should be used to delimit. e.g.
CDS,tRNA,nCDS: Output all kinds of alignment.
-fast_TempEst FAST_TEMPORAL_SIGNAL_TEST, --fast_temporal_signal_test FAST_TEMPORAL_SIGNAL_TEST
Inputting a mapping file containing sample names and
sample dates will allow a fast estimation for temporal
signal. Note: this option has to be used together with
-raxml.
-raxml, --raxml_tree Reconstruct a maximum likelihood tree using raxmlHPC.
Basic usage:
metaclock_tailor landscape Mac_opt.fna -o landscape_opt_folder
Output_1 assessment.txt
:
cat assessment.txt
Child_alignment Type Function Length SNV_density Missing_value #Samples(missing information < 10%) Avg_genetic_distances Stdv_genetic_distances
nCDS_0_282 nCDS NA 282 0.0035460992907801418 0.23439716312056738 10 0.3338185890257559 0.3107757517495592
CDS_282_477 CDS hypothetical protein 195 0.010256410256410256 0.16589743589743589 14 0.28531713900134953 0.3600023929274135
nCDS_477_1002 nCDS NA 525 0.007619047619047619 0.23038095238095238 13 0.3688721804511278 0.38124077678156876
CDS_1002_1299 CDS hypothetical protein 297 0.0 0.22457912457912457 14 0.3473684210526316 0.37025930338082963
CDS_1335_1923 CDS hypothetical protein 588 0.006802721088435374 0.16717687074829932 16 0.28956319369853206 0.38168537617993875
CDS_1895_2309 CDS hypothetical protein 414 0.007246376811594203 0.23647342995169082 14 0.3760742435799644 0.3986969844821037
nCDS_2309_2435 nCDS NA 126 0.007936507936507936 0.20952380952380953 14 0.34870509607351713 0.44260770953203554
CDS_2435_2804 CDS hypothetical protein 369 0.024390243902439025 0.22466124661246611 8 0.2841392098131508 0.24579265715767315
CDS_2811_3126 CDS hypothetical protein 315 0.009523809523809525 0.11507936507936507 16 0.2030409356725146 0.28389965848294024
.
.
.
.
.
where:
-
Child_alignment
: The label of sub-alignments partitioned based on annotation file from prokka. -
Function
: The functional feature assigned to the sub-alignment. -
Length
: The length of the sub-alignment. [bp] -
SNV_density
: The density of SNV in the sub-alignment (The number of SNVs/The length of the sub-alignment). -
Missing_value
: The proportion of gaps in the sub-alignment. -
#Samples(missing information < 10%)
: The number of genomes containing <10% gaps. -
Avg_genetic_distances
: The mean of pairwise genetic distances. Note: gaps here are not ignored. -
Stdv_genetic_distances
: The standard deviation of pairwise distances.
Output_2 assess_stats.png
:
If we want to improve alignment quality removing gappy part: Advanced usage (Select & concatenate specific regions, build maximum likelihood tree, and estimate temporal signals)
metaclock_tailor landscape Mac_opt.fna -o landscape_opt_folder -s -nproc 8 -l 100 -mi 0.1 -fast_TempEst mp_file.tsv -raxml
where:
-
mp_file.tsv
is the mapping file which have genome names in 1st column and the corresponding age (years ago) in 2nd column:
GCA_001639275.fna 0
ERR3003614.fna 207.5
ERR3003615.fna 207.5
ERR3003619.fna 207.5
ERR3003621.fna 207.5
ERR3003622.fna 207.5
ERR3003632.fna 207.5
ERR3003637.fna 207.5
ERR3003640.fna 207.5
ERR3003644.fna 207.5
ERR3003646.fna 207.5
ERR3003651.fna 207.5
SRR6877288.fna 835
GCF_000529525.fna 0
GCF_902384065.fna 0
GCF_900289035.fna 0
Output_3 selected_region_concatenation.fna
:
The alignment concatenation from the selected sub-alignments in FASTA format.
Output_4 raxml
:
The folder containing inferred maximum likelihood tree.
Output_5 selected_region_concatenation_TempEst.png
:
The estimation of temporal signal from the data set given.