Skip to content

Post processing

Kun Huang edited this page Mar 8, 2021 · 1 revision

Once (meta)genomic information (i.e. reads or contigs) were accurately integrated into a whole-genome alignment for a given species, a series of assessing and post-processing steps may help improve the reliability for downstream phylogenetics and molecular clocking analysis. Therefore, we provide rich lightweight utility scripts for manual curation.

Assessing the missing information in the genome alignment reconstruction

While one of outputs( Mac_stats.tsv) from metaclock_mac manifests basic statistics about the reconstructed alignment regarding gap ratio, alignment length, reconstructed sequence length, and GC contents, a genome-wide visualization can enhance the details about reconstruction quality along different genomic regions. Thus, we provide metaclock_visualizer for assessing the reconstruction quality along the whole genome.

metaclock_visualizer -h

usage: metaclock_visualizer [-h] [-w [WINDOW_SIZE]] [-s [STEP_SIZE]]
                            [-l_dist [LABEL_DISTANCE]]
                            [-opt_text [OUTPUT_TEXT]]
                            [input_alignment] [output_name]

positional arguments:
  input_alignment       Input a MSA alignment in fasta
  output_name           Specify the output name for resulted figure

optional arguments:
  -h, --help            show this help message and exit
  -w [WINDOW_SIZE], --window_size [WINDOW_SIZE]
                        Sliding window size. default:[200bp]
  -s [STEP_SIZE], --step_size [STEP_SIZE]
                        Moving steps of sliding window. default:[50bp]
  -l_dist [LABEL_DISTANCE], --label_distance [LABEL_DISTANCE]
                        The distance betwen text label and the plotting.
                        default: [0.15]
  -opt_text [OUTPUT_TEXT], --output_text [OUTPUT_TEXT]
                        Output a statistics text.
metaclock_visualizer.py Mac_genome_MSA.fna genome_wide_visual.png -w 10000 -s 2000

Assessment of genome reconstruction

"Reconstruction score" indicates the proportion of non-gap part in the genomic region within sliding windows. It provides the information of how well specific genomic regions were reconstructed using metagenomics data along the whole microbial genome. A high "reconstruction score" reflects a good reconstruction result. As the figure above, except for sample ERR3003623, ERR3003647, SRR6877339, and SRR6877399, we managed to reconstruct a near complete genome for remaining samples.

Minimizing the missing information in the genome alignment reconstruction

To increase the phylogenetic and molecular clocking analysis in the downstream analysis, we are going to apply some tailoring work using metaclock_tailor basic in order to minimize missing information:

metaclock_tailor basic -h

usage: landscape basic [-h] [-a] [-s STATS] [-tt TARGET_TAILORING]
                       [whole_genome_alignment] [output_file]

Assess mutational landscape of whole genome alignment and select interested
regions.

positional arguments:
  whole_genome_alignment
                        Input the whole genome alignment generated by
                        metaclock_mac.
  output_file           Specify output name for the tailored genome alignment
                        in fasta format.

optional arguments:
  -h, --help            show this help message and exit
  -a, --automated_tailoring
                        This option allows automated tailoring using trimAl
                        for minimizing missing information in the alignment
  -s STATS, --stats STATS
                        Specify a file name for statistic report of tailored
                        alignment.
  -tt TARGET_TAILORING, --target_tailoring TARGET_TAILORING
                        Input a list of samples to keep for alignment
                        tailoring. The limit for missing info in each column
                        can be given and it is delimited by comma (e.g.
                        shortlist.txt,0.1), or column tailoring will be
                        performed automatically using trimAl (e.g. list.txt).

Here, we will choose target tailoring -tt, namely keep samples we want (those having least missing information) and then perform trimming on columns. First, prepare a short list shortlist.txt including all interesting samples:

cat shortlist.txt

ERR3003614.fna
ERR3003615.fna
ERR3003619.fna
ERR3003621.fna
ERR3003622.fna
ERR3003632.fna
ERR3003637.fna
ERR3003640.fna
ERR3003644.fna
ERR3003646.fna
ERR3003651.fna
SRR6877288.fna
GCF_000529525.fna
GCF_902384065.fna
GCF_900289035.fna

Second, run metaclock_tailor basic and remove all columns containing gaps:

metaclock_tailor basic Mac_opt.fna Mac_opt_tailored.fna -tt shortlist.txt,0

If you would like to use automated trimming on target samples:

metaclock_tailor basic Mac_opt.fna Mac_opt_tailored.fna -tt shortlist.txt

Thirdly, assess tailored alignment using metaclock_visualizer:

metaclock_visualizer Mac_opt_tailored.fna genome_wide_visual_tailored.png -opt_text assessment.txt -w 10000 -s 2000

Assessment of genome reconstruction

cat assessment.txt

SeqHeader GapRatio  AlignmentLength SequenceLength  GCcontent
ERR3003614.fna  0.0 1745214 1745214 0.28
ERR3003615.fna  0.0 1745214 1745214 0.28
ERR3003619.fna  0.0 1745214 1745214 0.28
ERR3003621.fna  0.0 1745214 1745214 0.28
ERR3003622.fna  0.0 1745214 1745214 0.28
ERR3003632.fna  0.0 1745214 1745214 0.28
ERR3003637.fna  0.0 1745214 1745214 0.28
ERR3003640.fna  0.0 1745214 1745214 0.28
ERR3003644.fna  0.0 1745214 1745214 0.28
ERR3003646.fna  0.0 1745214 1745214 0.28
ERR3003651.fna  0.0 1745214 1745214 0.28
SRR6877288.fna  0.0 1745214 1745214 0.28
GCF_000529525.fna 0.0 1745214 1745214 0.28
GCF_902384065.fna 0.0 1745214 1745214 0.28
GCF_900289035.fna 0.0 1745214 1745214 0.28

Assessing the whole genome landscape

Different genomic regions commonly have been through various evolutionary trajectory (e.g. purifying selection, position selection, lateral gene transfer), resulting clear heterogeneity in mutational landscape along the whole genome. Hence, in this framework, we provide another sophisticate utility metaclock_tailor landscape for exploring the genome landscape region by region and selecting/concatenating interesting genomic regions in order to avoid noisy evolutionary signals.

metaclock_tailor landscape -h

usage: landscape landscape [-h] [-o OPT_DIR] [-gff3 GFF3_ANNOTATION] [-s]
                           [-nproc NUMBER_OF_PROCESSORS]
                           [-ns NUMBER_OF_SAMPLES] [-l MINIMUM_LENGTH]
                           [-mi MAXIMUM_MISSING_INFORMATION]
                           [-snv_density_max MAXIMUM_SNV_DENSITY]
                           [-snv_density_min MINIMUM_SNV_DENSITY]
                           [-Pdist_mean AVERAGE_PAIRWISE_DISTANCES]
                           [-Pdist_stdv PAIRWISE_DISTANCES_STDV] [-f FEATURE]
                           [-fast_TempEst FAST_TEMPORAL_SIGNAL_TEST] [-raxml]
                           [whole_genome_alignment]

Assess mutational landscape of whole genome alignment and select interesting
regions.

positional arguments:
  whole_genome_alignment
                        Input the whole genome alignment generated by
                        metaclock_mac.

optional arguments:
  -h, --help            show this help message and exit
  -o OPT_DIR, --opt_dir OPT_DIR
                        Sepecify a directory name for outputs. default:
                        "opt_dir" in current working directory.
  -gff3 GFF3_ANNOTATION, --gff3_annotation GFF3_ANNOTATION
                        Input the gff3 annotation file. If no gff3 file was
                        provided automatic annotation would be performed.
                        (Please use gff file from the same reference sequence
                        in resconstructing whole genome alignment in order to
                        maintain the consistency.)
  -s, --select          Flag this option to select your interesting regions.
                        Note: other options need to be specificed.
  -nproc NUMBER_OF_PROCESSORS, --number_of_processors NUMBER_OF_PROCESSORS
                        Specify the number of processor to be used. default:
                        [1]
  -ns NUMBER_OF_SAMPLES, --number_of_samples NUMBER_OF_SAMPLES
                        Specify the minimum number of samples with <10 percent
                        missing information to select the child alignment.
                        Default: [1]
  -l MINIMUM_LENGTH, --minimum_length MINIMUM_LENGTH
                        Specify the minimum length of a child alignment to be
                        selected. default: [100]
  -mi MAXIMUM_MISSING_INFORMATION, --maximum_missing_information MAXIMUM_MISSING_INFORMATION
                        Specify the minimum missing information of a child
                        alignment to be selected. default: [1.0]
  -snv_density_max MAXIMUM_SNV_DENSITY, --maximum_snv_density MAXIMUM_SNV_DENSITY
                        Specify the maximum SNV density for selecting a child
                        alignment in order to exclude hypervariable regions.
                        default: [1.0]
  -snv_density_min MINIMUM_SNV_DENSITY, --minimum_snv_density MINIMUM_SNV_DENSITY
                        Specify the minimum SNV density for selecting a child
                        alignment in order to exclude hypervariable regions.
                        default: [0.0]
  -Pdist_mean AVERAGE_PAIRWISE_DISTANCES, --average_pairwise_distances AVERAGE_PAIRWISE_DISTANCES
                        Specify the maximum average pairwise distances for
                        selecting a child alignment. default: [1.0]
  -Pdist_stdv PAIRWISE_DISTANCES_STDV, --pairwise_distances_stdv PAIRWISE_DISTANCES_STDV
                        Specify the maximum standard deviation of pairwise
                        distances. default: [1.0]
  -f FEATURE, --feature FEATURE
                        Specify the types [CDS, tRNA, nCDS] to select. If more
                        than one type comma should be used to delimit. e.g.
                        CDS,tRNA,nCDS: Output all kinds of alignment.
  -fast_TempEst FAST_TEMPORAL_SIGNAL_TEST, --fast_temporal_signal_test FAST_TEMPORAL_SIGNAL_TEST
                        Inputting a mapping file containing sample names and
                        sample dates will allow a fast estimation for temporal
                        signal. Note: this option has to be used together with
                        -raxml.
  -raxml, --raxml_tree  Reconstruct a maximum likelihood tree using raxmlHPC.

Basic usage:

metaclock_tailor landscape Mac_opt.fna -o landscape_opt_folder

Output_1 assessment.txt:

cat assessment.txt

Child_alignment Type  Function  Length  SNV_density Missing_value #Samples(missing information < 10%) Avg_genetic_distances Stdv_genetic_distances
nCDS_0_282  nCDS  NA  282 0.0035460992907801418 0.23439716312056738 10  0.3338185890257559  0.3107757517495592
CDS_282_477 CDS hypothetical protein  195 0.010256410256410256  0.16589743589743589 14  0.28531713900134953 0.3600023929274135
nCDS_477_1002 nCDS  NA  525 0.007619047619047619  0.23038095238095238 13  0.3688721804511278  0.38124077678156876
CDS_1002_1299 CDS hypothetical protein  297 0.0 0.22457912457912457 14  0.3473684210526316  0.37025930338082963
CDS_1335_1923 CDS hypothetical protein  588 0.006802721088435374  0.16717687074829932 16  0.28956319369853206 0.38168537617993875
CDS_1895_2309 CDS hypothetical protein  414 0.007246376811594203  0.23647342995169082 14  0.3760742435799644  0.3986969844821037
nCDS_2309_2435  nCDS  NA  126 0.007936507936507936  0.20952380952380953 14  0.34870509607351713 0.44260770953203554
CDS_2435_2804 CDS hypothetical protein  369 0.024390243902439025  0.22466124661246611 8 0.2841392098131508  0.24579265715767315
CDS_2811_3126 CDS hypothetical protein  315 0.009523809523809525  0.11507936507936507 16  0.2030409356725146  0.28389965848294024
.
.
.
.
.

where:

  • Child_alignment: The label of sub-alignments partitioned based on annotation file from prokka.
  • Function: The functional feature assigned to the sub-alignment.
  • Length: The length of the sub-alignment. [bp]
  • SNV_density: The density of SNV in the sub-alignment (The number of SNVs/The length of the sub-alignment).
  • Missing_value: The proportion of gaps in the sub-alignment.
  • #Samples(missing information < 10%): The number of genomes containing <10% gaps.
  • Avg_genetic_distances: The mean of pairwise genetic distances. Note: gaps here are not ignored.
  • Stdv_genetic_distances: The standard deviation of pairwise distances.

Output_2 assess_stats.png:

Assessment of genome landscape

If we want to improve alignment quality removing gappy part: Advanced usage (Select & concatenate specific regions, build maximum likelihood tree, and estimate temporal signals)

metaclock_tailor landscape Mac_opt.fna -o landscape_opt_folder -s -nproc 8 -l 100 -mi 0.1 -fast_TempEst mp_file.tsv -raxml

where:

  • mp_file.tsv is the mapping file which have genome names in 1st column and the corresponding age (years ago) in 2nd column:
GCA_001639275.fna 0
ERR3003614.fna  207.5
ERR3003615.fna  207.5
ERR3003619.fna  207.5
ERR3003621.fna  207.5
ERR3003622.fna  207.5
ERR3003632.fna  207.5
ERR3003637.fna  207.5
ERR3003640.fna  207.5
ERR3003644.fna  207.5
ERR3003646.fna  207.5
ERR3003651.fna  207.5
SRR6877288.fna  835
GCF_000529525.fna 0
GCF_902384065.fna 0
GCF_900289035.fna 0

Output_3 selected_region_concatenation.fna : The alignment concatenation from the selected sub-alignments in FASTA format.

Output_4 raxml: The folder containing inferred maximum likelihood tree.

Output_5 selected_region_concatenation_TempEst.png: The estimation of temporal signal from the data set given.

Temporal signal

Clone this wiki locally