-
Notifications
You must be signed in to change notification settings - Fork 4
/
VIGNETTE
268 lines (207 loc) · 13.2 KB
/
VIGNETTE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
*** This file is autogenerated. Don't edit it directly. ***
Vignette for DETONATE
This document provides a brief tutorial on how to run DETONATE. We
focus on the basic commands, inputs, and outputs involved. For
information on the definitions, motivations, and validation of
DETONATE's methods, see our paper [1] and the rest of the [1]DETONATE
website.
[1] Bo Li*, Nathanael Fillmore*, Yongsheng Bai, Mike Collins, James
A. Thompson, Ron Stewart, Colin N. Dewey. Evaluation of de novo
transcriptome assemblies from RNA-Seq data.
* = equal contributions
Step 1: Acquire RNA-Seq data and build de novo assemblies
Before we can evaluate assemblies, we need to acquire RNA-Seq data
and construct the assemblies, using our favorite transcriptome
assembly software (Trinity, Oases, SOAPdenovo-Trans, Trans-ABySS,
etc.). For the purposes of this vignette, we will use a tiny example
dataset, available in examples/toy_SE.fq, and three assemblies of
this data, examples/toy_assembly_1.fa, examples/toy_assembly_2.fa,
and examples/toy_assembly_3.fa.
(All paths in this vignette are relative to the root directory of the
DETONATE distribution.)
Step 2: Build the DETONATE software
Another preliminary step is to build the DETONATE software. To do so,
simply run make in the root directory of the DETONATE distribution.
This will build both RSEM-EVAL and REF-EVAL, plus several
dependencies.
Step 3: Run RSEM-EVAL on each assembly
Now we are ready to run either RSEM-EVAL or REF-EVAL to evaluate our
assemblies. Let's start with RSEM-EVAL. We run RSEM-EVAL on the three
assemblies of our reads as follows.
$ ./rsem-eval/rsem-eval-calculate-score examples/toy_SE.fq examples/toy_assembly_1.fa examples/rsem_eval_1 76 --transcript-length-parameters rsem-eval/true_transcript_length_distribution/mouse.txt -p 16
$ ./rsem-eval/rsem-eval-calculate-score examples/toy_SE.fq examples/toy_assembly_2.fa examples/rsem_eval_2 76 --transcript-length-parameters rsem-eval/true_transcript_length_distribution/mouse.txt -p 16
$ ./rsem-eval/rsem-eval-calculate-score examples/toy_SE.fq examples/toy_assembly_3.fa examples/rsem_eval_3 76 --transcript-length-parameters rsem-eval/true_transcript_length_distribution/mouse.txt -p 16
Above, the first argument to RSEM-EVAL specifies the reads, the
second argument specifies the assembly, and the third argument
specifies the prefix of RSEM-EVAL's output.
The fourth argument, 76, is the read length in our data. If the reads
were of varying lengths, the fourth argument would be the average
read length. For paired-end data, the fourth argument would be the
(average) fragment length.
The --transcript-length-parameters option instructs RSEM-EVAL to
parameterize its prior distribution using the mean and standard
deviation of the transcript lengths in the Ensembl mouse annotation.
These parameters can also be estimated from a species more closely
related to the one you are interested in, using
./rsem-eval/rsem-eval-estimate-transcript-length-distribution. If
--transcript-length-parameters is not provided, default
transcript-length parameters, estimated from the human Ensembl
annotation, will be used.
The -p option tells RSEM-EVAL how many threads to use, 16 in this
example.
After running RSEM-EVAL as above, the RSEM-EVAL scores for our
assemblies will be in the files examples/rsem_eval_1.score,
examples/rsem_eval_2.score, and examples/rsem_eval_3.score. Let's
look at one of these score files:
$ cat examples/rsem_eval_1.score
Score -87426.14
BIC_penalty -8.25
Prior_score_on_contig_lengths -7.91
Prior_score_on_contig_sequences -867.82
Data_likelihood_in_log_space_without_correction -86542.17
Correction_term -0.00
Number_of_contigs 1
Expected_number_of_aligned_reads_given_the_data 3812.00
Number_of_contigs_smaller_than_expected_read/fragment_length 0
Number_of_contigs_with_no_read_aligned_to 0
Maximum_data_likelihood_in_log_space -86541.97
Number_of_alignable_reads 3812
Number_of_alignments_in_total 3812
Transcript_length_distribution_related_factors -7.91
The first line contains the RSEM-EVAL score. The remaining lines
break down this score and provide other related information.
Now let's compare the three assemblies' RSEM-EVAL scores:
$ cat examples/rsem_eval_1.score | awk '$1 == "Score"'
Score -87426.14
$ cat examples/rsem_eval_2.score | awk '$1 == "Score"'
Score -201465.39
$ cat examples/rsem_eval_3.score | awk '$1 == "Score"'
Score -254935.35
The first assembly has a substantially better RSEM-EVAL score than
the other two assemblies, and the second assembly has a slightly
better RSEM-EVAL score than the third.
NOTE: Higher RSEM-EVAL scores are better than lower scores. This is
true despite the fact that the scores are always negative.
Concretely, in the above example, -87426.14 is better than
-201465.39, since -87426.14 is greater than -201465.39.
Step 4: Estimate the "true" assembly.
We now proceed to a reference-based comparison of the assemblies. A
reference for our toy example is available in examples/toy_ref.fa. It
contains a single transcript.
We will compare our assemblies to an estimated "true" assembly. To do
so, we first need to construct this estimate, and this can be done
using ref-eval/ref-eval-estimate-true-assembly.
A preliminary step, though, is to run RSEM relative to the set of
full-length reference transcripts. (The point here is not to compute
the expression levels of each transcript, which is RSEM's primary
purpose, but rather to compute the posterior probability of each
read's alignment to the reference.) To do so, download RSEM from
[2]its website, unpack it, and build it by typing make in its root
directory. Let's say that you have installed RSEM in the directory
/path/to/rsem. (This example uses version 1.2.17.) Then we run RSEM
as follows:
$ /path/to/rsem/rsem-prepare-reference --bowtie examples/toy_ref.fa examples/toy_rsem_ref
$ /path/to/rsem/rsem-calculate-expression -p 12 examples/toy_SE.fq examples/toy_rsem_ref examples/toy_rsem_expr
Now we are ready to estimate the "true" assembly:
$ ./ref-eval/ref-eval-estimate-true-assembly --reference examples/toy_rsem_ref --expression examples/toy_rsem_expr --assembly examples/ta --alignment-policy best
The first two options (--reference and --expression) tell REF-EVAL
where to find the alignment information output by RSEM.
The third option (--assembly) tells REF-EVAL to output the estimated
"true" assembly in a file with prefix examples/ta. Specifially, the
estimated "true" assembly will be output to examples/ta_0.fa.
The fourth option (--alignment-policy best) tells REF-EVAL to use the
highest-probability alignment of each read in constructing the
estimated "true" assembly. In the paper, we used --alignment-policy
sample, but we use best here so that our results are deterministic.
The estimated "true" assembly contains one contig in this case; is a
bit shorter than the full-length transcript.
Step 5: Compute the kmer-compression score for each assembly.
Next, we will compare each assembly to the estimated "true" assembly
using the kmer-compression (KC) score.
To do so, a preliminary step is to run RSEM again. This time, we will
run RSEM to estimate the expression levels of each sequence in the
estimated "true" assembly, as follows:
$ /path/to/rsem/rsem-prepare-reference --bowtie examples/ta_0.fa examples/ta_0_ref
$ /path/to/rsem/rsem-calculate-expression -p 12 examples/toy_SE.fq examples/ta_0_ref examples/ta_0_expr
We now compute the KC score of each assembly as follows:
$ ./ref-eval/ref-eval --scores kc --A-seqs examples/toy_assembly_1.fa --B-seqs examples/ta_0.fa --B-expr examples/ta_0_expr.isoforms.results --kmerlen 76 --readlen 76 --num-reads 46988 | tee examples/kc_1.txt
$ ./ref-eval/ref-eval --scores kc --A-seqs examples/toy_assembly_2.fa --B-seqs examples/ta_0.fa --B-expr examples/ta_0_expr.isoforms.results --kmerlen 76 --readlen 76 --num-reads 46988 | tee examples/kc_2.txt
$ ./ref-eval/ref-eval --scores kc --A-seqs examples/toy_assembly_3.fa --B-seqs examples/ta_0.fa --B-expr examples/ta_0_expr.isoforms.results --kmerlen 76 --readlen 76 --num-reads 46988 | tee examples/kc_3.txt
The above commands instruct REF-EVAL to compute the KC score
(--scores kc) of the assembly (e.g., --A-seqs
examples/toy_assembly_1.fa) versus the estimated "true" assembly
(--B-seqs examples/ta_0.fa). The expression profile of the estimated
"true" assembly is given by --B-expr
examples/ta_0_expr.isoforms.results. We also provide REF-EVAL with
the kmer length (--kmerlen 76) to use in computing the KC; this will
typically be the read length or average read length. Finally, we
provide REF-EVAL with the number of reads (--num-reads 46988) and the
read length (--readlen 76); here, what is important is that the
number of reads times the read length equals the total number of
nucleotides in the read set.
Each of these the output files, e.g., examples/kc_1.txt, contains the
KC score and its two constitutive terms:
$ cat examples/kc_1.txt
weighted_kmer_recall 0.862069
inverse_compression_rate 0.000175297
kmer_compression_score 0.861894
Now let's compare the three assemblies' KC scores:
$ cat examples/kc_1.txt | awk '$1 == "kmer_compression_score"'
kmer_compression_score 0.861894
$ cat examples/kc_2.txt | awk '$1 == "kmer_compression_score"'
kmer_compression_score 0.520331
$ cat examples/kc_3.txt | awk '$1 == "kmer_compression_score"'
kmer_compression_score 0.509861
Like for the RSEM-EVAL score, the first assembly has a substantially
better KC score than the other two assemblies, and the second
assembly has a slightly better KC score than the third.
Step 6: Compute the alignment-based scores for each assembly.
Finally, we compute the contig and nucleotide F1 scores.
To do so, we need to align each assembly to the estimated "true"
assembly, and vice versa, using Blat. Download Blat from [3]its
website (here, we use version 3.4), unpack it, and build it. Let's
say that you have installed Blat at /path/to/blat. Then we run Blat
as follows:
$ /path/to/blat -minIdentity=80 examples/ta_0.fa examples/toy_assembly_1.fa examples/toy_assembly_1_to_ta_0.psl
$ /path/to/blat -minIdentity=80 examples/ta_0.fa examples/toy_assembly_2.fa examples/toy_assembly_2_to_ta_0.psl
$ /path/to/blat -minIdentity=80 examples/ta_0.fa examples/toy_assembly_3.fa examples/toy_assembly_3_to_ta_0.psl
$ /path/to/blat -minIdentity=80 examples/toy_assembly_1.fa examples/ta_0.fa examples/ta_0_to_toy_assembly_1.psl
$ /path/to/blat -minIdentity=80 examples/toy_assembly_2.fa examples/ta_0.fa examples/ta_0_to_toy_assembly_2.psl
$ /path/to/blat -minIdentity=80 examples/toy_assembly_3.fa examples/ta_0.fa examples/ta_0_to_toy_assembly_3.psl
We can now compute the contig and nucleotide scores as follows:
$ ./ref-eval/ref-eval --scores contig,nucl --weighted no --A-seqs examples/toy_assembly_1.fa --B-seqs examples/ta_0.fa --A-to-B examples/toy_assembly_1_to_ta_0.psl --B-to-A examples/ta_0_to_toy_assembly_1.psl --min-frac-identity 0.90 | tee examples/contig_nucl_1.txt
$ ./ref-eval/ref-eval --scores contig,nucl --weighted no --A-seqs examples/toy_assembly_2.fa --B-seqs examples/ta_0.fa --A-to-B examples/toy_assembly_2_to_ta_0.psl --B-to-A examples/ta_0_to_toy_assembly_2.psl --min-frac-identity 0.90 | tee examples/contig_nucl_2.txt
$ ./ref-eval/ref-eval --scores contig,nucl --weighted no --A-seqs examples/toy_assembly_3.fa --B-seqs examples/ta_0.fa --A-to-B examples/toy_assembly_3_to_ta_0.psl --B-to-A examples/ta_0_to_toy_assembly_3.psl --min-frac-identity 0.90 | tee examples/contig_nucl_3.txt
Each of these the output files, e.g., examples/contig_nucl_1.txt,
contains the (unweighted) contig and nucleotide precision, recall,
and F1 scores:
$ cat examples/contig_nucl_1.txt
unweighted_nucl_precision 0.998403
unweighted_nucl_recall 0.998403
unweighted_nucl_F1 0.998403
unweighted_contig_recall 1
unweighted_contig_precision 1
unweighted_contig_F1 1
Now let's compare the three assemblies' contig F1 scores:
$ cat examples/contig_nucl_1.txt | awk '$1 == "unweighted_contig_F1"'
unweighted_contig_F1 1
$ cat examples/contig_nucl_2.txt | awk '$1 == "unweighted_contig_F1"'
unweighted_contig_F1 0
$ cat examples/contig_nucl_3.txt | awk '$1 == "unweighted_contig_F1"'
unweighted_contig_F1 0
The first assembly recovered the single contig in the estimated
"true" assembly, but the other two assemblies did not recover it, at
least not to > 90 percent identity. This fact is also obvious from
looking at the assemblies.
Now let's compare the three assemblies' nucleotide F1 scores:
$ cat examples/contig_nucl_1.txt | awk '$1 == "unweighted_nucl_F1"'
unweighted_nucl_F1 0.998403
$ cat examples/contig_nucl_2.txt | awk '$1 == "unweighted_nucl_F1"'
unweighted_nucl_F1 0.419405
$ cat examples/contig_nucl_3.txt | awk '$1 == "unweighted_nucl_F1"'
unweighted_nucl_F1 0.815516
References
Visible links
1. http://deweylab.biostat.wisc.edu/detonate/
2. http://deweylab.biostat.wisc.edu/rsem/
3. http://users.soe.ucsc.edu/~kent/src/