Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix typos and use cpu_count for vsearch num threads #94

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions docs/overview.md
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Overview

MIDAS is an integrated pipeline that leverages >30,000 reference genomes to estimate bacterial species abundance and strain-level genomic variation, including gene content and SNPs, from shotgun metagnomes. Given a metagenome, MIDAS produces the following:
MIDAS is an integrated pipeline that leverages >30,000 reference genomes to estimate bacterial species abundance and strain-level genomic variation, including gene content and SNPs, from shotgun metagenomes. Given a metagenome, MIDAS produces the following:

* Species relative abundance profile
* Nucleotide variants in abundant species (e.g. >10x depth)
* Gene copy number variants in abundant species (e.g. >5x depth)

## How it works

<b>Reference database:</b> MIDAS relies on a reference database that contains 31,007 bacterial genomes. These genomes were hierarchically clustered into 5,952 species groups based on the sequence similarity between genomes at 30 universal genes. For each species, we extracted three genomic features: 15 universal-single-copy genes, a representative genome, and a pan-genome [...read more](ref_db.md).

<img src="../images/database.jpg" width="700" align="middle"/>
Expand All @@ -24,15 +24,15 @@ MIDAS is an integrated pipeline that leverages >30,000 reference genomes to esti
Before going through the trouble of installing and running the program, it's worth considering these things:

1. MIDAS requires at least one sequenced genome to quantify the abundance or population genomics of a species. Most species from the human microbiome meet this critereon, but species from other communities may not. For poorly characterized communities, try [building your own reference database](build_db.md).
2. MIDAS does NOT estimate the abundance of strains. It simply identifies species abundance and strain-level genomic variants (SNPs and CNVs).
3. Strain-level phylogenetic trees are based on consensus alleles. This means that there will be only one leaf per species per sample. These trees are accurate when within-sample intra-species heterogentity is low (i.e. one dominant strain) but may not be accurate when there is a mixture of two or more common strains. You can use SNPs to quantify the level of within-sample heterogeneity.
4. The pan-genomes of many species are imcomplete. If you're concerned about this, try [building your own reference database](build_db.md).
2. MIDAS does NOT estimate the abundance of strains. It simply identifies species abundance and strain-level genomic variants (SNPs and CNVs).
3. Strain-level phylogenetic trees are based on consensus alleles. This means that there will be only one leaf per species per sample. These trees are accurate when within-sample intra-species heterogeneity is low (i.e. one dominant strain) but may not be accurate when there is a mixture of two or more common strains. You can use SNPs to quantify the level of within-sample heterogeneity.
4. The pan-genomes of many species are incomplete. If you're concerned about this, try [building your own reference database](build_db.md).
5. The default MIDAS database is 17Gb and requires 34Gb of free space to decompress

## Examples
<img src="../images/enrichment.jpg" width="600" align="middle"/>
<b>Comparative genomics of <i>Bacteroides ovatus</i> strains across host microbiomes</b>
<b>Comparative genomics of <i>Bacteroides ovatus</i> strains across host microbiomes</b>
<sub> <b>A)</b> Presence or absence of genes in the <i>Bacteroides ovatus</i> pangenome across human faecal metagenomes. Column colors indicate whether a gene is core (blue; occurs in >95% of samples), auxiliary (red; occurs in 1-95% of samples ), or absent (green; occurs in < 1% of samples). <b>B)</b> Gene set enrichment analysis identifies functions overrepresented in the core genome, auxiliary genome, and genes that only occur in reference genomes.</sub>

## Next step
[Install or update software](docs/install.md)
[Install or update software](install.md)
3 changes: 2 additions & 1 deletion test/test_midas.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import subprocess
import sys
from distutils.version import StrictVersion
from multiprocessing import cpu_count

def run(command):
""" run shell command & return unix exit code """
Expand Down Expand Up @@ -147,7 +148,7 @@ class _15_BuildDB(unittest.TestCase):
def test_class(self):
command = 'tar -zxvf genomes.tar.gz'
err, code = run(command)
command = 'build_midas_db.py genomes genomes.mapfile db --threads 10'
command = 'build_midas_db.py genomes genomes.mapfile db --threads ' + str(cpu_count())
err, code = run(command)
self.assertTrue(code==0, msg=err)

Expand Down