This script will map metagenomic reads to bacterial reference genomes and quantify nucleotide variation along the entire genome (i.e. the read depth and observed alleles at each position). After running this script for multiple samples, you can perform pooled-sample, core-genome SNP calling read more. You can either target one or more specific species (--species_id
), or provide this script a species abundance file
The pipeline can be broken down into the following steps:
- build a database of genome sequences for abundant bacterial species (1 representative genome/species)
- map high-quality reads to the database
- generate pileups and count alleles at each genomic site
Usage: run_midas.py snps <outdir> [options]
positional arguments:
outdir Path to directory to store results.
Directory name should correspond to sample identifier
optional arguments:
-h, --help show this help message and exit
--remove_temp Remove intermediate files generated by MIDAS (False).
Useful to reduce disk space of MIDAS output
Pipeline options (choose one or more; default=all):
--build_db Build bowtie2 database of pangenomes
--align Align reads to pangenome database
--pileup Run samtools mpileup and count 4 alleles across genome
Database options (if using --build_db):
-d DB Path to reference database
By default, the MIDAS_DB environmental variable is used
--species_cov FLOAT Include species with >X coverage (3.0)
--species_topn INT Include top N most abundant species
--species_id CHAR Include specified species. Separate ids with a comma
Read alignment options (if using --align):
-1 M1 FASTA/FASTQ file containing 1st mate if using paired-end reads.
Otherwise FASTA/FASTQ containing unpaired reads.
Can be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2)
-2 M2 FASTA/FASTQ file containing 2nd mate if using paired-end reads.
Can be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2)
--interleaved FASTA/FASTQ file in -1 are paired and contain forward AND reverse reads
-s {very-fast,fast,sensitive,very-sensitive}
Bowtie2 alignment speed/sensitivity (very-sensitive)
-m {local,global} Global/local read alignment (global)
-n MAX_READS # reads to use from input file(s) (use all)
-t THREADS Number of threads to use (1)
Pileup options (if using --pileup):
--mapid FLOAT Discard reads with alignment identity < MAPID (94.0)
--mapq INT Discard reads with mapping quality < MAPQ (20)
--baseq INT Discard bases with quality < BASEQ (30)
--readq INT Discard reads with mean quality < READQ (20)
--aln_cov FLOAT Discard reads with alignment coverage < ALN_COV (0.75)
--trim INT Trim N base-pairs from 3'/right end of read (0)
--discard Discard discordant read-pairs (False)
--baq Enable BAQ: per-base alignment quality (False)
--adjust_mq Adjust MAPQ (False)
-
run entire pipeline using defaults:
run_midas.py snps /path/to/outdir -1 /path/to/reads_1.fq.gz -2 /path/to/reads_2.fq.gz
-
run entire pipeline for a specific species:
run_midas.py snps /path/to/outdir --species_id Bacteroides_vulgatus_57955 -1 /path/to/reads_1.fq.gz -2 /path/to/reads_2.fq.gz
-
just align reads, use faster alignment, only use the first 10M reads, use 4 CPUs:
run_midas.py snps /path/to/outdir --align -1 /path/to/reads_1.fq.gz -2 /path/to/reads_2.fq.gz -s very-fast -n 10000000 -t 4
-
just count variants, keep reads with >=95% alignment identity and keep bases with quality-scores >=35:
run_midas.py snps /path/to/outdir --pileup --mapid 95 --baseq 35
output: directory of per-species output files; files are tab-delimited, gzip-compressed, with header
species.txt:
list of species_ids included in local database
summary.txt: tab-delimited with header; summarizes alignment results per-species
log.txt: log file containing parameters used
temp: directory of intermediate files; run with --remove_temp
to remove these files
output/<species_id>.snps.gz:
- ref_id: id of reference scaffold/contig/genome
- ref_pos: position in reference scaffold (1-indexed)
- ref_allele: reference nucleotide
- depth: number of mapped reads
- count_a: count of A allele
- count_c: count of C allele
- count_g: count of G allele
- count_t: count of T allele
summary.txt
- species_id: species id
- genome_length: number of base pairs in representative genome
- covered_bases: number of reference sites with at least 1 mapped read
- fraction_covered: proportion of reference sites with at least 1 mapped read
- mean_coverage: average read-depth across reference sites with at least 1 mapped read
- aligned_reads: number of aligned reads BEFORE quality filtering
- mapped_reads: number of aligned reads AFTER quality filtering
- Memory usage will depend on the number of species you search and the number of reference genomes sequenced per species.
- In practice, peak memory usage should not exceed 3.5 Gb for most samples
- Speed will depend on the number of species you search and the number of sequenced reference genomes per species.
- For a single species with 1 reference genome, expect ~16,000 reads/second
- Use
-n
and-t
to increase throughput