Skip to content

User manual

Kun Huang edited this page Mar 8, 2021 · 25 revisions

MetaClock

MetaClock is an integrated framework for reconstructing time-revolved evolutionary history for microbiome species using large-scale (meta)genomic data from ancient and contemporary populations.

Installation

While there are three possible installation solutions, we highly recommend to install MetaClock through Conda in order to avoid the complexity of managing dependencies.

1. Conda environment (Thumb-up)

Note: we strongly suggest that MetaClock should be installed in a new and isolated conda environment thus dependencies can be resolved automatically by Conda.

conda create -n "metaclock" -c bioconda metaclock

2. PyPi

pip install metaclock

3. Repository from GitHub

git clone https://github.com/SegataLab/metaclock.git
cd metaclock
python setup.py install 

Both solution 2 and 3 require install dependencies independently:

Constructing genome alignment

metaclock_mac <configuration_file> -r <reference_genome> \
    --ancient_metagenomes <ancient_metagenomic_data> \
    --modern_metagenomes <modern_metagenomic_data>
    --genome_assemblies <assembled_contigs>
    --output_dir <output_directory>
    --intermediate_dir <intermediate_directory>
  • <configuration_file> is a mandatory input which is used to config parameters (please check the detailed description below). The template configuration file - configs.json - can be automatically generated by metaclock_mac_template_configs and customized directly by users.
  • <reference_genome> is a mandatory input which is usually a single-amplified genome (SAG) or high-quality metagenome assembled genome (MAG) whose whole sequence is stored in a FASTA file. It represents a species from a microbiome. The input path can either be specified in command line or in configuration file.
  • <ancient_metagenomic_data> is an optional input which is a folder contains sub-folders, with each sub-folder containing metagenomic reads, in fastq files (or compressed in .gz or .bzip2, or mixed), from an ancient sample. The input path can be specified in the respective section in the configuration file.
  • <modern_metagenomic_data> is an optional input which is a folder contains sub-folders, with each sub-folder containing metagenomic reads, in fastq files (or compressed in .gz or .bzip2, or mixed), from an contemporary sample. The input path can be specified in the respective section in the configuration file.
  • <genome_assemblies> is an optional input which is a folder contains multiple assembled genomes in FASTA files, each file containing nucleotide sequences from one genome. The assembled genomes are mostly from contemporary samples but in rare case ancient assembled genomes are also available. If ancient assembled genomes were used as input here, please be cautious of genome quality. The input path can be specified in the respective section in the configuration file.
  • <output_directory> is an optional argument which stores the outputs from metaclock_mac. If no specific folder path was given, a new folder will be created in the working directory by default.
  • <intermediate_directory> is an optional argument which stores the intermediate files generated in the process. If no specific folder path was given, a new folder will be created in the working directory by default.

Note: While <ancient_metagenomic_data>, <modern_metagenomic_data> and <genome_assemblies> are three independent and optional inputs, at least one of three kinds must be given in order to construct a genome alignment.

Other optional settings

  • --clean is an option for rerunning analysis from intermediate bam files. Flagging this argument will save you a good amount of time from re-mapping reads, particularly time-efficient when metagenomic data is enormous.
  • --authentication is an option for authenticating ancient origin of metagenomic reads from ancient specimens.
  • --SNV_rate is an option for estimating pairwise SNV rates based on the genome sequences in the reconstructed alignment.

Configuration description

The configuration file configs.json is divided into three sections: ancient_reads, modern_reads, and contigs whose internal parameter settings correspond to input <ancient_metagenomic_data>, <modern_metagenomic_data> and <genome_assemblies> respectively.

{
  "ancient_reads": {
    "input_type":"reads",
    "reference_genome":"",
    "age_type":1,
    "intermediate":"",
    "samples":"",
    "parameter_set":{
      "search_report_mode":"-k,5",
      "bowtie2_threads":10,
      "minimum_mapping_quality":30,
      "minimum_mapping_length":30,
      "maximum_snp_edit_distance":0.03,
      "nproc":5,
      "minimum_coverage":5,
      "trim_distance":"5:5",
      "dominant_allele_frequency":0.8,
      "output_trimmed_reads":0
    }
  },
  "modern_reads": {
    "input_type":"reads",
    "reference_genome":"",
    "age_type":2,
    "intermediate":"",
    "samples":"",
    "parameter_set":{
      "search_report_mode":"-k,1",
      "bowtie2_threads":15,
      "minimum_mapping_quality":30,
      "minimum_mapping_length":30,
      "maximum_snp_edit_distance":0.03,
      "nproc":5,
      "minimum_coverage":5,
      "dominant_allele_frequency":0.8
    }
  },
  "contigs": {
    "input_type":"contigs",
    "reference_genome":"",
    "intermediate":"",
    "samples":"",
    "parameter_set":{
      "homolog_length":500,
      "homolog_identity":95,
      "blastn_threads":10
    }
  }
}

1. Ancient data input as reads (ancient_reads):

This section is specific to the configuration of processing sequencing reads from ancient (meta)genomic samples.

  • input_type <string> [optional]:   The type of genomic information. Default: "reads".
  • reference_genome <string> [required]   The reference genome representative of a microbial species, in FASTA format. This parameter is equivalent to --reference in the --help menu.

* `age_type` `` `[optional]`:   Age type of sample ( 1 indicates ancient type and 2 indicates contemporary type) . Default: 1.

* `intermediate` `` `[optional]`:   The directory for storing intermediate files. Default: "intermediates" in current directory. This parameter is equivalent to `--intermediate_dir` in the `--help` menu.

* `samples` `` `[required]`: The directory holding ancient metagenomics reads folders. This parameter is equivalent to `--ancient_metagenomes` in the `--help` menu.
* `search_report_mode` `<"-k,int">` `[optional]`:   The upper limit on the number of alignments Bowtie 2 should report in the search of alignments. Default: "-k,5". (The higher the value is set the more computational time required , but more accurate alignment returned.)

* `bowtie2_threads` `` `[optional]`:   The thread number for bowtie2 to process each sample. Default: [1].

* `minimum_mapping_quality` `` `[optional]`:   Bases with quality score lower than the specified value will be ignored in reconstructing genome alignment. Default: [30].

* `minimum_mapping_length` `` `[optional]`:   Aligned reads with length shorter than the specified value will be ignored in reconstructing genome alignment. Default: [30].

* `maximum_snp_edit_distance` `` `[optional]`:   Reads with SNP edit distance greater than the specified value will be ignored in reconstructing genome alignment. Default: [0.03].

* `nproc` `` `[optional]`:   The number of processors for handling multiple samples in parallel. Default: [1].

* `minimum_coverage` `` `[optional]`:   A position with a coverage depth lower than the specified value will be ignored in reconstructing genome alignment. Default: [5].

* `trim_distance` `<"int:int">` `[optional]`:   The number of nucleotides to trim at two ends of ancient reads. These positions are likely post-mortem damages. Default: "5:5".

* `dominant_allele_frequency` `` `[optional]`:   A position with the degree of dominant allele lower than the specified value will be ignored in reconstructing genome alignment. Default: [0.8].

* `output_trimmed_reads` `` `[optional]`: Specify 1 if you need to output aligned trimmed reads which are used in reconstructing genome alignment; Specify 0 if you want to skip this feature. Default: [0].


2. Modern data input as reads(modern_reads):

This section is specific to the configuration of processing sequencing reads from modern (meta)genomic samples.

  • input_type <string> [optional]:   The type of genomic information. Default: "reads".

* `reference_genome` `` `[required]`   The reference genome representative of a microbial species, in FASTA format. This parameter is equivalent to `--reference` in the `--help` menu.

* `age_type` `` `[optional]`:   Age type of sample ( 1 indicates ancient type and 2 indicates contemporary type) . Default: 2.

* `intermediate` `` `[optional]`:   The directory for storing intermediate files. Default: "intermediates" in current directory. This parameter is equivalent to `--intermediate_dir` in the `--help` menu.
* `samples` `` `[required]`: The directory holding modern metagenomics reads folders. This parameter is equivalent to `--modern_metagenomes` in the `--help` menu.

* `search_report_mode` `<"-k,int">` `[optional]`:   The upper limit on the number of alignments Bowtie 2 should report in the search of alignments. Default: "-k,5". (The higher the value is set the more computational time required , but more accurate alignment returned.)

* `bowtie2_threads` `` `[optional]`:   The thread number for bowtie2 to process each sample. Default: [1].

* `minimum_mapping_quality` `` `[optional]`:   Bases with quality score lower than the specified value will be ignored in reconstructing genome alignment. Default: [30].

* `minimum_mapping_length` `` `[optional]`:   Aligned reads with length shorter than the specified value will be ignored in reconstructing genome alignment. Default: [30].

* `maximum_snp_edit_distance` `` `[optional]`:   Reads with SNP edit distance greater than the specified value will be ignored in reconstructing genome alignment. Default: [0.03].

* `nproc` `` `[optional]`:   The number of processors for handling multiple samples in parallel. Default: [1].

* `minimum_coverage` `` `[optional]`:   A position with a coverage depth lower than the specified value will be ignored in reconstructing genome alignment. Default: [5].

* `dominant_allele_frequency` `` `[optional]`:   A position with the degree of dominant allele lower than the specified value will be ignored in reconstructing genome alignment. Default: [0.8].


3. Assembled genomes input as contigs (contigs):

This section is specific to the configuration of processing assembled genomes as input.

  • input_type <string> [optional]:   The type of genomic information. Default: "contigs".

* `reference_genome` `` `[required]`:   The reference genome representative of a microbial species, in FASTA format. This parameter is equivalent to `--reference` in the `--help` menu.

* `intermediate` `` `[optional]`:   directory for storing intermediate files. Default: "". This parameter is equivalent to `--intermediate_dir` in the `--help` menu.
* `samples` `` `[required]`: Specify the directory holding assembled contigs files. Default: "". This parameter is equivalent to `--genome_assemblies` in the `--help` menu.

* `homolog_length` `` `[optional]`:   The minimum length for the aligned part of a contig to be considered as a homolog in order to be used in reconstructing genome alignment. Default: [500].

* `homolog_identity` `` `[optional]`:   The minimum identity for the aligned part of a contig to be considered as a homolog in order to be used in reconstructing genome alignment. Default: [95.0]

* `blastn_threads` `` `[optional]`:   Specify the number of threads used for blastn. Note: please use < 8 CPUs if you are using blastn >2.7.0

Clone this wiki locally