Many methods have been developed to detect transposable element (TE) insertions from whole genome shotgun next-generation sequencing (NGS) data, each of which has different dependencies, run interfaces, and output formats. Here, we have developed a meta-pipeline to download, install and run six available methods for detecting TE insertions in NGS data, which generates output in the UCSC Browser extensible data (BED) format.
The pipeline requires a fasta reference genome, a fasta consensus set of TE sequences present in the organism and fastq paired end sequencing reads. Optionally if detailed annotation of the reference TE sequences has been performed, a GFF file with the locations of known TEs present in the reference genome and a tab delimited hierarchy file linking these individual insertions to the consensus they belong to (an example of this file is included in the test folder as sac_cer_te_families.tsv) can be supplied.
- ngs_te_mapper - Linheiro and Bergman (2012)
- PoPoolationTE - Kofler et al. (2012)
- RelocaTE - Robb et al. (2013)
- [TE-locate](http://zendto.gmi.oeaw.ac.at/pickup.php?claimID= Y3tZVfN5xipYyBDN&claimPasscode=NArXMbTjmkorWjSM&emailAddr=te_locate%40gmx.at "Click to go to download location") - Platzer et al. (2012)
- RetroSeq - Keane et al. (2012)
- TEMP - Zhuang et al. (2014)
All of the software systems must be run on a unix based system with the software dependencies listed per method below. FastQC is an optional step, if the software is not present then mcclintock will skip the step and you will not receive a quality report for your fastq input in the results folder. The versions used to run this pipeline are indicated in parentheses and no guarantee is made that it will function using alternate versions.
-
Optional software for the pipeline
-
FastQC (Command line installation v0.11.2)
-
RepeatMasker (v.4.0.2) (Necessary if no GFF is supplied)
-
ngs_te_mapper
-
R (v.3.0.2)
-
BWA (v.0.7.4-r385)
-
PoPoolationTE
-
Perl (v.5.14.2)
-
RepeatMasker (v.4.0.2)
-
SAMtools (v.0.1.19-44428cd)
-
BWA (v.0.7.4-r385)
-
RelocaTE
-
Perl (v.5.14.2)
-
BioPerl (v.1.006901)
-
SAMtools (v.0.1.19-44428cd)
-
Blat (v.35x1)
-
Bowtie (v.1.0.0)
-
TE-locate
-
Perl (v.5.14.2)
-
Java (v.1.6.0_24)
-
BWA (v.00.7.4-r385)
-
RetroSeq
-
Perl (v.5.14.2)
-
BEDTools (v.2.17.0)
-
SAMtools (v.0.1.19-44428cd)
-
BCFTools (v.0.1.19-44428cd)
-
Exonerate (v.2.2.0)
-
BWA (v.0.7.4-r385)
-
TEMP
-
Perl (v.5.14.2)
-
BioPerl (v.1.006901)
-
BWA (v.0.7.4-r385)
-
SAMtools (v.0.1.19-44428cd)
-
BEDTools (v.2.17.0)
-
twoBitToFa (ucsc-tools v.294)
###Installation To install the software, from the main pipeline folder, first clone the repository:
git clone [email protected]:bergmanlab/mcclintock.git
Then cd into the project directory and run the script install.sh with no arguments:
cd mcclintock
sh install.sh
This will download and unpack all of the TE detection pipelines and check that the required dependencies are available in your path. Missing dependencies will be reported and you must install or make sure these are available to run the full pipeline.
###Running on a test dataset A script is included to run the full pipeline on a test Illumina resequencing dataset from the yeast genome. To run this test script change directory into the folder named test and run the script runttest.sh.
cd test
sh runtest.sh
This script will download the UCSC sacCer2 yeast reference genome, an annotation of TEs in the yeast reference genome from Carr, Bensasson and Bergman (2012), and a pair of fastq files from SRA, then run the full pipeline.
###Running the pipeline The pipeline is invoked by running the mcclintock.sh script in the main project folder. This script takes the following 6 input files, specified as options:
- -m : The methods that the user wishes to run (for example adding -m "RelocaTE TEMP ngs_te_mapper" will launch only those three methods). The default behaviour is to run all six methods
- -r : A reference genome sequence in fasta format. (Required)
- -c : The consensus sequences of the TEs for the species in fasta format. (Required)
- -g : The locations of known TEs in the reference genome in GFF 3 format. This must include a unique ID attribute for every entry. (Optional)
- -t : A tab delimited file with one entry per ID in the GFF file and two columns: the first containing the ID and the second containing the TE family it belongs to. The family should correspond to the names of the sequences in the consensus fasta file. (Optional - required if GFF (option -g) is supplied)
- -b : Retain the sorted and indexed BAM file of the paired end data aligned to the reference genome.
- -i : If this option is specified then all sample specific intermediate files will be removed, leaving only the overall results. The default is to leave sample specific intermediate files (may require large amounts of disk space).
- -1 : The absolute path of the first fastq file from a paired end read, this should be named ending _1.fastq. (Required)
- -2 : The absolute path of the second fastq file from a paired end read, this should be named ending _2.fastq. (Required)
- -p : The number of processors to use for parallel stages of the pipeline. (Default is 1)
- -h : Prints this help guide.
Example pipeline run:
sh mcclintock.sh -m "RelocaTE TEMP ngs_te_mapper" -r reference.fasta -c te_consensus.fasta -g te_locations.gff -t te_families.tsv -1 sample_1.fastq -2 sample_2.fastq -p 2 -i -b
Data created during pre-processing will be stored in a folder in the main directory named after the reference genome used with individual sub-directories for samples.
###Output format The output of the run scripts is a bed format file with the 4th column containing the name of the TE name and whether it is a novel insertion (new) or a TE shared with the reference (old). The outputs also include a header line for use with the UCSC genome browser. The final results files are located in a results folder saved in the specific sample folder within the directory named after the reference genome. If FastQC was present on the system then output of FastQC will be stored in the folder fastqc_analysis, within the results folder. It is also possible to view the original results files produced by each method, these are stored in the folder originalmethodresults, within the results folder.
###Running individual TE detection methods Each folder contains one of the TE detection methods tested in the review. In addition to the standard software there is also a file named runXXXX.sh. Running this file without arguments will explain to the user what input files should be used to execute the method. These arguments should be supplied after the script name with spaces in between, as follows:
sh runXXXX.sh argument1 argument2 argument3 ...