forked from laurentnoe/yass
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
274 lines (214 loc) · 11.7 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
* YASS
* Similarity search in DNA sequences
* version 1.16
** LICENSE
YASS is distributed under the dual-licence of the CeCILL (version 2 or any later
version), and the GPL (version 2 or any later version), so you can use, modify
and redistribute the program under these licences with almost no restriction.
** COMMAND SYNTAX
* Usage :
yass [options] { file.mfas | file1.mfas file2.mfas }
-h display this Help screen
-d <int> 0 : Display alignment positions (kept for compatibility)
1 : Display alignment positions + alignments + stats (default)
2 : Display blast-like tabular output
3 : Display light tabular output (better for post-processing)
4 : Display BED file output
5 : Display PSL file output
-r <int> 0 : process forward (query) strand
1 : process Reverse complement strand
2 : process both forward and Reverse complement strands (default)
-o <str> Output file
-l mask Lowercase regions (seed algorithm only)
-s <int> Sort according to
0 : alignment scores
1 : entropy
2 : mutual information (experimental)
3 : both entropy and score
4 : positions on the 1st file
5 : positions on the 2nd file
6 : alignment % id
7 : 1st file sequence % id
8 : 2nd file sequence % id
10-18 : (0-8) + sort by "first fasta-chunk order"
20-28 : (0-8) + sort by "second fasta-chunk order"
30-38 : (0-8) + sort by "both first/second chunks order"
40-48 : equivalent to (10-18) where "best score for fst fasta-chunk" replaces "..."
50-58 : equivalent to (20-28) where "best score for snd fasta-chunk" replaces "..."
60-68 : equivalent to (70-78), where "sort by best score of fst fasta-chunk" replaces "..."
70-78 : BLAST-like behavior : "keep fst fasta-chunk order", then trigger all hits for all good snd fasta-chunks
80-88 : equivalent to (30-38) where "best score for fst/snd fasta-chunk"" replaces "..."
-v display the current Version
-M <int> select a scoring Matrix (default 3):
[Match,Transversion,Transition],(Gopen,Gext)
0 : [ 1, -3, -2],( -8, -2) 1 : [ 2, -3, -2],(-12, -4)
2 : [ 3, -3, -2],(-16, -4) 3 : [ 5, -4, -3],(-16, -4)
4 : [ 5, -4, -2],(-16, -4)
-C <int>[,<int>[,<int>[,<int>]]]
reset match/mismatch/transistion/other Costs (penalties)
you can also give the 16 values of matrix (ACGT order)
-G <int>,<int> reset Gap opening/extension penalties
-L <real>,<real> reset Lambda and K parameters of Gumbel law
-X <int> Xdrop threshold score (default 25)
-E <int> E-value threshold (default 10)
-e <real> low complexity filter :
minimal allowed Entropy of trinucleotide distribution
ranging between 0 (no filter) and 6 (default 2.80)
-O <int> memory limit of the number of ungapped alignments (default 1000000)
-S <int> Select sequence from the first multi-fasta file (default 0)
* use 0 to select the full first multi-fasta file
-T <int> forbid aligning too close regions (e.g. Tandem repeats)
valid for single sequence comparison only (default 16 bp)
-p <str> seed Pattern(s)
* use '#' for match
* use '@' for match or transition
* use '-' or '_' for joker
* use ',' for seed separator (max: 32 seeds)
- example with one seed :
yass file.fas -p "#@#--##--#-##@#"
- example with two complementary seeds :
yass file.fas -p "##-#-#@#-##@-##,##@#--@--##-#--###"
(default "###-#@-##@##,###--#-#--#-###")
-c <int> seed hit Criterion : 1 or 2 seeds to consider a hit (default 2)
-t <real> Trim out over-represented seeds codes
ranging between 0.0 (no trim) and +inf (default 0.001)
-a <int> statistical tolerance Alpha (%) (default 5%)
-i <int> Indel rate (%) (default 8%)
-m <int> Mutation rate (%) (default 25%)
-W <int>,<int> Window <min,max> range for post-processing and grouping
alignments (default <64,65536>)
-w <real> Window size coefficient for post-processing and grouping
alignments (default 16)
NOTE : -w 0 disables post-processing
Scoring System :
-M <int> : choose a preselected matrix and gap opening and extension penalty
-----------------------------------------------------------------------------
the default "-M 3" gives fowoling scores :
+5 for match reward,
-4 for transversion penalty,
-2 for transition penalty,
and
-16 for gap opening penalty.
-4 for gap extention penalty.
but you can also choose the
-C <int>(,<int>(,<int>(,<int>))) : choose a scoring system
-----------------------------------------------------------------------------
if you give 2 parameters : match reward, mismatch penalty
3 parameters : match reward, transversion penalty,
transition penalty
4 parameters : match reward, transversion penalty,
transition penalty,
other penalty (non ACGTU letters)
-G <int>,<int> : choose a gap opening and extension penalty
------------------------------------------------------------------------------
for gap opening penalty, gap extention penalty.
-d <int> : choose Display preferences
-------------------------------------
-d 0 gives maximal positions of alignments
-d 1 gives ... + statistical parameters and quick alignments
(it means that those alignments are computed using seeds
as anchors, which is not always the best solution, but
the fastest one)
-d 2 blast like tabular output
-d 3 "position - size" tiny display
-s <int> : choose Sorting
-------------------------
it is possible to sort according to score (-s 1), positions on
the first (-s 4) or the second file (-s 5), and also, for
multifasta files to sort according to the first file chunks (-s 10
"to" 15), the secongd file chunks (-s 20 "to" 25) or both of them
(-s 30 "to" 35)
-S <int> : choose the first multifasta chunk
--------------------------------------------
this parameter is by default 1, since it selects the first chunk
of the first multifasta file : you can choose any valid chunk or
either fix it to "0" if you want to treat al ofl them
-p <pattern> : modify the seed Pattern
--------------------------------------
This element represents the seed pattern to be searched
* examples of seed patterns:
- PatternHunter : ###__##_#__###
- Mandala : ###__#_##_### (forcing small span)
####_##_###
- YASS : #@##_#__##__#@# (on Bernoulli
model with constrains on seed jokers/tr-free elements)
##@_#@#__#_### (on pure Bernoulli model)
model with mow (minimally overlapping words) seeds
RYNNNNNNnnnNNNNN
- 2 seeds :
some examples to detect :
- very small random regions :
###-###-####,###-#--#--#-####
- small random regions :
##-#----###-####,###--#-#-#--#--###
- random regions :
##-#-##---##-##--#,##-###--#---#----###
- large random regions :
##--#---#-#--###---##,##-#--#-#--#--##-##
- very small transition rich random regions :
###@##-##-#@#,###-@-#--@--#####
- small transition rich random regions :
##@#-#-##@-###,#-##--#--#@#--@-###
- transition rich random regions :
###-#----#-#--#@#-@#,##-#--@#-#-###--@#
- large transition rich random regions :
#@#--#----###--@-##,#-#-#@---#-#-@-##-##
other examples :
mixed regions (chr-inver-fungi) :
#@-##-##-#--@-###,#-##@---##---#---#@#-#
mixed regions (chr-best) :
#-###@#--#-#-@##,###@-#-#------@#-#--##
mixed regions (chr-avg) :
##-#-#@#-#--@###,#-##@--#----#--##@##
-e : alignment Entropy
-----------------------
YASS does a filtering step based on the triplet composition
of the alignment
-e 0.0 : filter is off.
-e 3.10 : filter is set to its default level
-e 5.99 : filter is set to its highest level (don't use this)
PS : this option is subject to changes in future versions
-r <int> : reverse strain considered
------------------------------------
YASS considers both forward and complementary reverse strain
of the first sequence if -r 2 set. You can select with -r 0 only
direct repeats or only complementary inverted ones with -r 1
(default is both forward and complementary : -r 2 )
-S <int> Select query sequence from a multifasta file (default 1)
----------------------------------------------------------------
If the first file provided is not fasta but multifasta, you can choose
(using this parameter) which fasta sequence will be the query (by
default the first one).
-T <int> "anti Tandem trick"
------------------------------
Forbid aligning too close regions (e.g. Tandem repeats)
valid only for comparing a single-sequence file against itself
-a -i -m : seed grouping parameters
-----------------------------------
YASS groups seeds before extension and then extends them once grouped.
These parameters enable you to modify the grouping criteria and extention
limits that are fixed (before post processing).
"-m" gives the expected mutation rate (25%) : it can be increased to 30%
or 40% but no more in practice.
"-i" gives the expected indel rate (8%) : can be also modified to ~ 10%
"-a" give the bound so 95% of (when fixed to 5%) of the distribution of
indels/mutations between seeds are captured during the extention process.
-W <min,max> -w <mul> : Post-processing parameters
--------------------------------------------------
In order to group some consecutive alignments into a better
scoring one, post processing tries to group neighbor alignments in
an iterative process : by applying several time a sliding windows
on the text and estimating score of possible groups formed.
Windows size can be controlled according to a geometric pattern
<mul> and two bounds <min,max>.
-w 0 (or 1) disables this post-processing step.
References
==========
YASS is presented in the following paper. If you use YASS, please cite
one (or more) of the following papers in your work :
[1] L. Noe, G. Kucherov,
YASS: enhancing the sensitivity of DNA similarity search,
2005, Nucleic Acids Research, 33(2):W540-W543.
[2] L. Noe, G. Kucherov,
Improved hit criteria for DNA local alignment,
2004, BMC Bioinformatics, 5:149.