Blaxter Lab, Institute of Evolutionary Biology, University of Edinburgh
Goal: To create blobplots or Taxon-Annotated-GC-Coverage plots (TAGC plots) to visualise the contents of genome assembly data sets as a QC step.
This repository accompanies the paper:
Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots.
Sujai Kumar, Martin Jones, Georgios Koutsovoulos, Michael Clarke, Mark Blaxter
(submitted 2013-10-01 to Frontiers in Bioinformatics and Computational Biology special issue : Quality assessment and control of high-throughput sequencing data).
It contains bash/perl/R scripts for running the analysis presented in the paper to create a preliminary assembly, and to create and collate GC content, read coverage and taxon annotation for the preliminary assembly, which can be visualised, such as Figure 2a from the paper showing TAGC plots/blobplots for Caenorhabditis sp. 5:
Note: This is an update to the code at github.com/sujaikumar/assemblage which was used in my thesis. I could have updated the code in that repository, but enough things have changed (the basic file formats as well) that I thought it made sense to create a new repo. Please use this version from now on.
git clone git://github.com/blaxterlab/blobology.git
# add this directory to your path, e.g.:
# export PATH=$PATH:/path/to/blobology
You also need the following software in your path:
- samtools (tested with version 0.1.19) http://sourceforge.net/projects/samtools/files/samtools/0.1.19/
- R (tested with version 2.15.2)
- ggplot2, an R graphics package (tested with ggplot2_0.9.3.1)
- ABySS - optional if you already have a preliminary assembly from another assembler (tested with version 1.3.6, compiled with mpi) - http://www.bcgsc.ca/platform/bioinfo/software/abyss/releases/1.3.6
- fastq-mcf - not needed if you already have quality and adapter trimmed reads (from the ea-utils suite, tested with version 1.1.2-537) - https://ea-utils.googlecode.com/files/ea-utils.1.1.2-537.tar.gz
And the following databases:
-
NCBI nt blasted (download from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ nt.??.tar.gz)
wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.??.tar.gz" for a in nt.*.tar.gz; do tar xzf $a; done # or, if you have gnu parallel installed, highly recommended: # parallel tar xzf ::: nt.*.tar.gz
-
NCBI taxonomy dump (download from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz)
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz tar xzf taxdump.tar.gz
Only the nodes.dmp and names.dmp files are needed (nodes.dmp stores the taxon ids and their parent-child relationships, whereas names.dmp stores their common and scientific names)
Run blobology.bash from this repository. Comment out the lines that you don't need (e.g., if you prefer using a different assembler for the preliminary assembly, or a different alignment tool for mapping the reads)
Broad overview of the pipeline (Figure 1 in the paper)
Run blobology.bash from this repository. The only things you should really need to change are the read files in the ABySS and Bowtie 2 steps. Even these steps won't be needed if you already have a preliminary assembly and BAM files from aligning your raw reads back to this assembly.
A tab separated values (TSV) text file is created by gc_cov_annotate.pl and the ggplot2 R
The Blobsplorer visualiser for the TSV file created above was coded by Martin Jones, and is available at github.com/mojones/blobsplorer
See separate_reads.bash from this repository for the commands that were used to remove contaminants from the Caenorhabditis sp. 5 preliminary assembly.
Your own data set will require you to devise your own positive (to keep contigs, and reads, of interest) and negative (to discard contigs and reads belonging to contaminants) filters.
If you want to use the extensive modifications to the blobology toolset R script (makeblobplot.R), use the new version makeblobplot_LB_2.0.R changes made in comparison to the original R script
- additional script argument: sequence length cutoff & plot title
- legend now includes the number of assigned sequences in a bin.
- added colours to the plot
- added a new bin "below threshold"
- apply sequence cutoff steps before nr-of-colours cutoff