VIGNETTE

        *** This file is autogenerated. Don't edit it directly. ***

                           Vignette for DETONATE

   This document provides a brief tutorial on how to run DETONATE. We
   focus on the basic commands, inputs, and outputs involved. For
   information on the definitions, motivations, and validation of
   DETONATE's methods, see our paper [1] and the rest of the [1]DETONATE
   website.

   [1] Bo Li*, Nathanael Fillmore*, Yongsheng Bai, Mike Collins, James
   A. Thompson, Ron Stewart, Colin N. Dewey. Evaluation of de novo
   transcriptome assemblies from RNA-Seq data.

   * = equal contributions

Step 1: Acquire RNA-Seq data and build de novo assemblies

   Before we can evaluate assemblies, we need to acquire RNA-Seq data
   and construct the assemblies, using our favorite transcriptome
   assembly software (Trinity, Oases, SOAPdenovo-Trans, Trans-ABySS,
   etc.). For the purposes of this vignette, we will use a tiny example
   dataset, available in examples/toy_SE.fq, and three assemblies of
   this data, examples/toy_assembly_1.fa, examples/toy_assembly_2.fa,
   and examples/toy_assembly_3.fa.

   (All paths in this vignette are relative to the root directory of the
   DETONATE distribution.)

Step 2: Build the DETONATE software

   Another preliminary step is to build the DETONATE software. To do so,
   simply run make in the root directory of the DETONATE distribution.
   This will build both RSEM-EVAL and REF-EVAL, plus several
   dependencies.

Step 3: Run RSEM-EVAL on each assembly

   Now we are ready to run either RSEM-EVAL or REF-EVAL to evaluate our
   assemblies. Let's start with RSEM-EVAL. We run RSEM-EVAL on the three
   assemblies of our reads as follows.

 $ ./rsem-eval/rsem-eval-calculate-score examples/toy_SE.fq examples/toy_assembly_1.fa examples/rsem_eval_1 76 --transcript-length-parameters rsem-eval/true_transcript_length_distribution/mouse.txt -p 16
 $ ./rsem-eval/rsem-eval-calculate-score examples/toy_SE.fq examples/toy_assembly_2.fa examples/rsem_eval_2 76 --transcript-length-parameters rsem-eval/true_transcript_length_distribution/mouse.txt -p 16
 $ ./rsem-eval/rsem-eval-calculate-score examples/toy_SE.fq examples/toy_assembly_3.fa examples/rsem_eval_3 76 --transcript-length-parameters rsem-eval/true_transcript_length_distribution/mouse.txt -p 16

   Above, the first argument to RSEM-EVAL specifies the reads, the
   second argument specifies the assembly, and the third argument
   specifies the prefix of RSEM-EVAL's output.

   The fourth argument, 76, is the read length in our data. If the reads
   were of varying lengths, the fourth argument would be the average
   read length. For paired-end data, the fourth argument would be the
   (average) fragment length.

   The --transcript-length-parameters option instructs RSEM-EVAL to
   parameterize its prior distribution using the mean and standard
   deviation of the transcript lengths in the Ensembl mouse annotation.
   These parameters can also be estimated from a species more closely
   related to the one you are interested in, using
   ./rsem-eval/rsem-eval-estimate-transcript-length-distribution. If
   --transcript-length-parameters is not provided, default
   transcript-length parameters, estimated from the human Ensembl
   annotation, will be used.

   The -p option tells RSEM-EVAL how many threads to use, 16 in this
   example.

   After running RSEM-EVAL as above, the RSEM-EVAL scores for our
   assemblies will be in the files examples/rsem_eval_1.score,
   examples/rsem_eval_2.score, and examples/rsem_eval_3.score. Let's
   look at one of these score files:

 $ cat examples/rsem_eval_1.score
 Score   -87426.14
 BIC_penalty     -8.25
 Prior_score_on_contig_lengths   -7.91
 Prior_score_on_contig_sequences -867.82
 Data_likelihood_in_log_space_without_correction -86542.17
 Correction_term -0.00
 Number_of_contigs       1
 Expected_number_of_aligned_reads_given_the_data 3812.00
 Number_of_contigs_smaller_than_expected_read/fragment_length    0
 Number_of_contigs_with_no_read_aligned_to       0
 Maximum_data_likelihood_in_log_space    -86541.97
 Number_of_alignable_reads       3812
 Number_of_alignments_in_total   3812
 Transcript_length_distribution_related_factors  -7.91

   The first line contains the RSEM-EVAL score. The remaining lines
   break down this score and provide other related information.

   Now let's compare the three assemblies' RSEM-EVAL scores:

 $ cat examples/rsem_eval_1.score | awk '$1 == "Score"'
 Score   -87426.14
 $ cat examples/rsem_eval_2.score | awk '$1 == "Score"'
 Score   -201465.39
 $ cat examples/rsem_eval_3.score | awk '$1 == "Score"'
 Score   -254935.35

   The first assembly has a substantially better RSEM-EVAL score than
   the other two assemblies, and the second assembly has a slightly
   better RSEM-EVAL score than the third.

   NOTE: Higher RSEM-EVAL scores are better than lower scores. This is
   true despite the fact that the scores are always negative.
   Concretely, in the above example, -87426.14 is better than
   -201465.39, since -87426.14 is greater than -201465.39.

Step 4: Estimate the "true" assembly.

   We now proceed to a reference-based comparison of the assemblies. A
   reference for our toy example is available in examples/toy_ref.fa. It
   contains a single transcript.

   We will compare our assemblies to an estimated "true" assembly. To do
   so, we first need to construct this estimate, and this can be done
   using ref-eval/ref-eval-estimate-true-assembly.

   A preliminary step, though, is to run RSEM relative to the set of
   full-length reference transcripts. (The point here is not to compute
   the expression levels of each transcript, which is RSEM's primary
   purpose, but rather to compute the posterior probability of each
   read's alignment to the reference.) To do so, download RSEM from
   [2]its website, unpack it, and build it by typing make in its root
   directory. Let's say that you have installed RSEM in the directory
   /path/to/rsem. (This example uses version 1.2.17.) Then we run RSEM
   as follows:

 $ /path/to/rsem/rsem-prepare-reference --bowtie examples/toy_ref.fa examples/toy_rsem_ref
 $ /path/to/rsem/rsem-calculate-expression -p 12 examples/toy_SE.fq examples/toy_rsem_ref examples/toy_rsem_expr

   Now we are ready to estimate the "true" assembly:

 $ ./ref-eval/ref-eval-estimate-true-assembly --reference examples/toy_rsem_ref --expression examples/toy_rsem_expr --assembly examples/ta --alignment-policy best

   The first two options (--reference and --expression) tell REF-EVAL
   where to find the alignment information output by RSEM.

   The third option (--assembly) tells REF-EVAL to output the estimated
   "true" assembly in a file with prefix examples/ta. Specifially, the
   estimated "true" assembly will be output to examples/ta_0.fa.

   The fourth option (--alignment-policy best) tells REF-EVAL to use the
   highest-probability alignment of each read in constructing the
   estimated "true" assembly. In the paper, we used --alignment-policy
   sample, but we use best here so that our results are deterministic.

   The estimated "true" assembly contains one contig in this case; is a
   bit shorter than the full-length transcript.

Step 5: Compute the kmer-compression score for each assembly.

   Next, we will compare each assembly to the estimated "true" assembly
   using the kmer-compression (KC) score.

   To do so, a preliminary step is to run RSEM again. This time, we will
   run RSEM to estimate the expression levels of each sequence in the
   estimated "true" assembly, as follows:

 $ /path/to/rsem/rsem-prepare-reference --bowtie examples/ta_0.fa examples/ta_0_ref
 $ /path/to/rsem/rsem-calculate-expression -p 12 examples/toy_SE.fq examples/ta_0_ref examples/ta_0_expr

   We now compute the KC score of each assembly as follows:

 $ ./ref-eval/ref-eval --scores kc --A-seqs examples/toy_assembly_1.fa --B-seqs examples/ta_0.fa --B-expr examples/ta_0_expr.isoforms.results --kmerlen 76 --readlen 76 --num-reads 46988 | tee examples/kc_1.txt
 $ ./ref-eval/ref-eval --scores kc --A-seqs examples/toy_assembly_2.fa --B-seqs examples/ta_0.fa --B-expr examples/ta_0_expr.isoforms.results --kmerlen 76 --readlen 76 --num-reads 46988 | tee examples/kc_2.txt
 $ ./ref-eval/ref-eval --scores kc --A-seqs examples/toy_assembly_3.fa --B-seqs examples/ta_0.fa --B-expr examples/ta_0_expr.isoforms.results --kmerlen 76 --readlen 76 --num-reads 46988 | tee examples/kc_3.txt

   The above commands instruct REF-EVAL to compute the KC score
   (--scores kc) of the assembly (e.g., --A-seqs
   examples/toy_assembly_1.fa) versus the estimated "true" assembly
   (--B-seqs examples/ta_0.fa). The expression profile of the estimated
   "true" assembly is given by --B-expr
   examples/ta_0_expr.isoforms.results. We also provide REF-EVAL with
   the kmer length (--kmerlen 76) to use in computing the KC; this will
   typically be the read length or average read length. Finally, we
   provide REF-EVAL with the number of reads (--num-reads 46988) and the
   read length (--readlen 76); here, what is important is that the
   number of reads times the read length equals the total number of
   nucleotides in the read set.

   Each of these the output files, e.g., examples/kc_1.txt, contains the
   KC score and its two constitutive terms:

 $ cat examples/kc_1.txt
 weighted_kmer_recall    0.862069
 inverse_compression_rate        0.000175297
 kmer_compression_score  0.861894

   Now let's compare the three assemblies' KC scores:

 $ cat examples/kc_1.txt | awk '$1 == "kmer_compression_score"'
 kmer_compression_score  0.861894
 $ cat examples/kc_2.txt | awk '$1 == "kmer_compression_score"'
 kmer_compression_score  0.520331
 $ cat examples/kc_3.txt | awk '$1 == "kmer_compression_score"'
 kmer_compression_score  0.509861

   Like for the RSEM-EVAL score, the first assembly has a substantially
   better KC score than the other two assemblies, and the second
   assembly has a slightly better KC score than the third.

Step 6: Compute the alignment-based scores for each assembly.

   Finally, we compute the contig and nucleotide F1 scores.

   To do so, we need to align each assembly to the estimated "true"
   assembly, and vice versa, using Blat. Download Blat from [3]its
   website (here, we use version 3.4), unpack it, and build it. Let's
   say that you have installed Blat at /path/to/blat. Then we run Blat
   as follows:

 $ /path/to/blat -minIdentity=80 examples/ta_0.fa examples/toy_assembly_1.fa examples/toy_assembly_1_to_ta_0.psl
 $ /path/to/blat -minIdentity=80 examples/ta_0.fa examples/toy_assembly_2.fa examples/toy_assembly_2_to_ta_0.psl
 $ /path/to/blat -minIdentity=80 examples/ta_0.fa examples/toy_assembly_3.fa examples/toy_assembly_3_to_ta_0.psl
 $ /path/to/blat -minIdentity=80 examples/toy_assembly_1.fa examples/ta_0.fa examples/ta_0_to_toy_assembly_1.psl
 $ /path/to/blat -minIdentity=80 examples/toy_assembly_2.fa examples/ta_0.fa examples/ta_0_to_toy_assembly_2.psl
 $ /path/to/blat -minIdentity=80 examples/toy_assembly_3.fa examples/ta_0.fa examples/ta_0_to_toy_assembly_3.psl

   We can now compute the contig and nucleotide scores as follows:

 $ ./ref-eval/ref-eval --scores contig,nucl --weighted no --A-seqs examples/toy_assembly_1.fa --B-seqs examples/ta_0.fa --A-to-B examples/toy_assembly_1_to_ta_0.psl --B-to-A examples/ta_0_to_toy_assembly_1.psl --min-frac-identity 0.90 | tee examples/contig_nucl_1.txt
 $ ./ref-eval/ref-eval --scores contig,nucl --weighted no --A-seqs examples/toy_assembly_2.fa --B-seqs examples/ta_0.fa --A-to-B examples/toy_assembly_2_to_ta_0.psl --B-to-A examples/ta_0_to_toy_assembly_2.psl --min-frac-identity 0.90 | tee examples/contig_nucl_2.txt
 $ ./ref-eval/ref-eval --scores contig,nucl --weighted no --A-seqs examples/toy_assembly_3.fa --B-seqs examples/ta_0.fa --A-to-B examples/toy_assembly_3_to_ta_0.psl --B-to-A examples/ta_0_to_toy_assembly_3.psl --min-frac-identity 0.90 | tee examples/contig_nucl_3.txt

   Each of these the output files, e.g., examples/contig_nucl_1.txt,
   contains the (unweighted) contig and nucleotide precision, recall,
   and F1 scores:

 $ cat examples/contig_nucl_1.txt
 unweighted_nucl_precision       0.998403
 unweighted_nucl_recall  0.998403
 unweighted_nucl_F1      0.998403
 unweighted_contig_recall        1
 unweighted_contig_precision     1
 unweighted_contig_F1    1

   Now let's compare the three assemblies' contig F1 scores:

 $ cat examples/contig_nucl_1.txt | awk '$1 == "unweighted_contig_F1"'
 unweighted_contig_F1    1
 $ cat examples/contig_nucl_2.txt | awk '$1 == "unweighted_contig_F1"'
 unweighted_contig_F1    0
 $ cat examples/contig_nucl_3.txt | awk '$1 == "unweighted_contig_F1"'
 unweighted_contig_F1    0

   The first assembly recovered the single contig in the estimated
   "true" assembly, but the other two assemblies did not recover it, at
   least not to > 90 percent identity. This fact is also obvious from
   looking at the assemblies.

   Now let's compare the three assemblies' nucleotide F1 scores:

 $ cat examples/contig_nucl_1.txt | awk '$1 == "unweighted_nucl_F1"'
 unweighted_nucl_F1      0.998403
 $ cat examples/contig_nucl_2.txt | awk '$1 == "unweighted_nucl_F1"'
 unweighted_nucl_F1      0.419405
 $ cat examples/contig_nucl_3.txt | awk '$1 == "unweighted_nucl_F1"'
 unweighted_nucl_F1      0.815516

References

   Visible links
   1. http://deweylab.biostat.wisc.edu/detonate/
   2. http://deweylab.biostat.wisc.edu/rsem/
   3. http://users.soe.ucsc.edu/~kent/src/