This repository tests and compares the performance of our released transcript assembly method Scallop with other two leading transcript assemblers, StringTie and TransComb. Please refer to our paper published at Nature Biotechnology. A podcast about Scallop (thanks to Roman Cheplyaka) is available at bioinformatics.chat and iTunes. Here we provide scripts to download datasets, run the three methods, evaluated the predicted transcripts, and reproduce the results and figures in the paper.
The pipeline involves in the followint four steps:
- Download necessary datasets (
data
directory). - Download and/or compile necessary programs (
programs
directory). - Run the methods to produce results (
results
directory). - Summarize results and produce figures (
plots
directory).
We compare the three methods on three datasets, namely encode10, encode65, and sequin.
Besides, we also need the annotation files for evaluation purposes.
In directory data
, we provide metadata for these datasets, and also provide scripts to download them.
The first dataset, namely encode10, contains 10 human RNA-seq samples downloaded from ENCODE project (2003--2012). All these samples are sequenced with strand-specific and paired-end protocols. For each of these 10 samples, we align it with three RNA-seq aligners, TopHat2, STAR, and HISAT2. Among them the STAR and HISAT2 alignments are available at doi:10.26208/8c06-w247 (same data used in another research work).
The second dataset, namely encode65,
contains 65 human RNA-seq samples downloaded from ENCODE project (2013--present).
This dataset includes 50 strand-specific samples and 15 non-strand samples.
These samples have pre-computed reads alignments, and can be downloaded by the script in data
directory.
./download.encode65.sh
The downloaded files will appear under data/encode65
.
NOTE: The total 65 reads alignments files take about 390GB storage space.
The third dataset, namely sequin,
contains 8 spike-in RNA-seq samples (see paper).
Again, for each of these 8 samples, we align it with three RNA-seq aligners,
TopHat2,
STAR, and
HISAT2.
We have uploaded all these reads alignments to CMU box.
Use this link to download these files.
Please keep the identical directory structure and files names
(i.e., data/sequin/ACCESSION/ALIGNER.sort.bam
) as we used there.
For encode10 and encode65 datasets, we use human annotation database as reference;
for sequin, we use the known synthetic annotation as reference.
Use the following script in data
to download annotations:
./download.annotation.sh
The downloaded files will appear under data/ensembl
.
Our experiments involve the following four programs:
Program | Version | Description |
---|---|---|
Scallop | v0.9.8 | Transcript assembler |
StringTie | v1.3.2d | Transcript assembler |
TransComb | v.1.0 | Transcript assembler |
Cufflinks | v2.2.1 | Transcript assembler |
gffcompare | v0.9.9c | Evaluate assembled transcripts |
gtfcuff | RNA-seq tool | |
gtfformat | RNA-seq tool | |
gtfmerge | RNA-seq tool |
You need to download and/or complile them,
and then link them to programs
directory.
Make sure that the program names are in lower cases (i.e., scallop
, stringtie
, transcomb
, and gffcompare
)
in programs
directory.
Once the datasets and programs are all available, use the following scripts in results
to run the methods assemblers on the datasets:
./run.encode10.sh
./run.encode65.sh
./run.sequin.sh
In each of these three scripts, you can modify it to run different
methods (Scallop, StringTie, TransComb, and Cufflinks), and to run
with different minimum coverage threshold. For each run,
you need to specify a run-id
, which will be used later on when
collecting the results. You can also modify the scripts to specify
how many CPU cores you want to use to run the jobs in parallel.
Once the results have been generated, one can use the following scripts in plots
to reproduce the figures:
./build.figures.sh
You may need to install R packages VennDiagram
and tikzDevice
.
You may also need to modify these scripts to match the run-id(s)
you
specified when you run these assemblers.