Single-nucleotide-polymorphism prediction

This script will map metagenomic reads to bacterial reference genomes and quantify nucleotide variation along the entire genome (i.e. the read depth and observed alleles at each position). After running this script for multiple samples, you can perform pooled-sample, core-genome SNP calling read more. You can either target one or more specific species (--species_id), or provide this script a species abundance file

The pipeline can be broken down into the following steps:

build a database of genome sequences for abundant bacterial species (1 representative genome/species)
map high-quality reads to the database
generate pileups and count alleles at each genomic site

Usage

Usage: run_midas.py snps <outdir> [options]

positional arguments:
  outdir                Path to directory to store results.
                        Directory name should correspond to sample identifier

optional arguments:
  -h, --help            show this help message and exit
  --remove_temp         Remove intermediate files generated by MIDAS (False).
                        Useful to reduce disk space of MIDAS output

Pipeline options (choose one or more; default=all):
  --build_db            Build bowtie2 database of pangenomes
  --align               Align reads to pangenome database
  --pileup              Run samtools mpileup and count 4 alleles across genome

Database options (if using --build_db):
  -d DB                 Path to reference database
                        By default, the MIDAS_DB environmental variable is used
  --species_cov FLOAT   Include species with >X coverage (3.0)
  --species_topn INT    Include top N most abundant species
  --species_id CHAR     Include specified species. Separate ids with a comma

Read alignment options (if using --align):
  -1 M1                 FASTA/FASTQ file containing 1st mate if using paired-end reads.
                        Otherwise FASTA/FASTQ containing unpaired reads.
                        Can be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2)
  -2 M2                 FASTA/FASTQ file containing 2nd mate if using paired-end reads.
                        Can be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2)
  --interleaved         FASTA/FASTQ file in -1 are paired and contain forward AND reverse reads
  -s {very-fast,fast,sensitive,very-sensitive}
                        Bowtie2 alignment speed/sensitivity (very-sensitive)
  -m {local,global}     Global/local read alignment (global)
  -n MAX_READS          # reads to use from input file(s) (use all)
  -t THREADS            Number of threads to use (1)

Pileup options (if using --pileup):
  --mapid FLOAT         Discard reads with alignment identity < MAPID (94.0)
  --mapq INT            Discard reads with mapping quality < MAPQ (20)
  --baseq INT           Discard bases with quality < BASEQ (30)
  --readq INT           Discard reads with mean quality < READQ (20)
  --aln_cov FLOAT       Discard reads with alignment coverage < ALN_COV (0.75)
  --trim INT            Trim N base-pairs from 3'/right end of read (0)
  --discard             Discard discordant read-pairs (False)
  --baq                 Enable BAQ: per-base alignment quality (False)
  --adjust_mq           Adjust MAPQ (False)

Examples

run entire pipeline using defaults:
run_midas.py snps /path/to/outdir -1 /path/to/reads_1.fq.gz -2 /path/to/reads_2.fq.gz
run entire pipeline for a specific species:
run_midas.py snps /path/to/outdir --species_id Bacteroides_vulgatus_57955 -1 /path/to/reads_1.fq.gz -2 /path/to/reads_2.fq.gz
just align reads, use faster alignment, only use the first 10M reads, use 4 CPUs:
run_midas.py snps /path/to/outdir --align -1 /path/to/reads_1.fq.gz -2 /path/to/reads_2.fq.gz -s very-fast -n 10000000 -t 4
just count variants, keep reads with >=95% alignment identity and keep bases with quality-scores >=35:
run_midas.py snps /path/to/outdir --pileup --mapid 95 --baseq 35

Output files

output: directory of per-species output files; files are tab-delimited, gzip-compressed, with header
species.txt: list of species_ids included in local database
summary.txt: tab-delimited with header; summarizes alignment results per-species
log.txt: log file containing parameters used
temp: directory of intermediate files; run with --remove_temp to remove these files

Output formats

output/<species_id>.snps.gz:

ref_id: id of reference scaffold/contig/genome
ref_pos: position in reference scaffold (1-indexed)
ref_allele: reference nucleotide
depth: number of mapped reads
count_a: count of A allele
count_c: count of C allele
count_g: count of G allele
count_t: count of T allele

summary.txt

species_id: species id
genome_length: number of base pairs in representative genome
covered_bases: number of reference sites with at least 1 mapped read
fraction_covered: proportion of reference sites with at least 1 mapped read
mean_coverage: average read-depth across reference sites with at least 1 mapped read
aligned_reads: number of aligned reads BEFORE quality filtering
mapped_reads: number of aligned reads AFTER quality filtering

Memory usage

Memory usage will depend on the number of species you search and the number of reference genomes sequenced per species.
In practice, peak memory usage should not exceed 3.5 Gb for most samples

Speed

Speed will depend on the number of species you search and the number of sequenced reference genomes per species.
For a single species with 1 reference genome, expect ~16,000 reads/second
Use -n and -t to increase throughput

Next step

Merge results across samples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snvs.md

snvs.md

Single-nucleotide-polymorphism prediction

Usage

Examples

Output files

Output formats

Memory usage

Speed

Next step

Files

snvs.md

Latest commit

History

snvs.md

File metadata and controls

Single-nucleotide-polymorphism prediction

Usage

Examples

Output files

Output formats

Memory usage

Speed

Next step