-
Notifications
You must be signed in to change notification settings - Fork 50
/
repeatmasker.help
1749 lines (1377 loc) · 81.1 KB
/
repeatmasker.help
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
######################################################################
RepeatMasker
Developed by Arian Smit and Robert Hubley
Please refer to: Smit, AFA, Hubley, R. & Green, P "RepeatMasker" at
http://www.repeatmasker.org
######################################################################
RepeatMasker is a program that screens DNA sequences for interspersed
repeats and low complexity DNA sequences. The output of the program is
a detailed annotation of the repeats that are present in the query
sequence as well as a modified version of the query sequence in which
all the annotated repeats have been masked (default: replaced by
Ns). Sequence comparisons in RepeatMasker are performed by the program
cross_match, an efficient implementation of the Smith-Waterman-Gotoh
algorithm developed by Phil Green, or by WU-Blast developed by Warren
Gish.
This help file discusses the following topics:
0 Basic input and output
1 Options
1.1 Species and contamination check options
1.2 Options effecting which repeats get masked
1.3 Speed and search parameters
1.4 Output and formatting
1.5 ProcessRepeats options
2 Methodology and quality of output
2.1 Methodology
2.2 Scoring matrices
2.3 Databases
2.4 Sensitivity and speed
2.5 Selectivity and matches to coding sequences
2.6 Low complexity DNA and simple repeats
3 How to read the results
3.1 The annotation (.out) file
3.2 Alignments
3.3 The summary (.tbl) file
4 Applications
4.1 Use in database searches
4.2 Identification of DNA source and bacterial insertions
4.3 DateRepeats - Masking lineage-specific repeats for genomic alignments
4.4 Use with gene prediction programs and other applications
5 References
0 INPUT and OUTPUT
Input format:
Sequences have to be in the ' FASTA format':
>sequencename all kind of info
AGCGATCGCATCGAGCGCATTCGCATGGGG
>sequencename2 all kind of info
GCCCATGCGATCGAGCTTCGCTAGCATAGCGATCA
The program accepts FASTA format with errors and raw sequence files,
but does not work with other formats like GenBank, Staden, etc..
You can use RepeatMasker on a file containing multiple FASTA format
sequences and on multiple sequence files at the same time:
RepeatMasker *.fasta
This command will mask all files that end with .fasta in the current
directory and give separate reports for each file. Note that if you
have multiple small sequences it is considerably faster to run
RepeatMasker on one batch file than on many single sequence files. The
summary file will be more informative as well. However, analysis on
single files (when larger than 2 kb each) can be slightly more
accurate, since GC levels for each sequence will be calculated and
used to choose appropriate parameters.
Standard output:
RepeatMasker returns a .masked file containing the query sequence(s)
with all identified repeats and low complexity sequences masked. These
masked sequences are listed and annotated in the .out file. The masked
sequences are printed in the same order as they are in the submitted
file, whereas the sequences are presented alphabetically in the
annotation table. The .tbl file is a summary of the repeat content of
the analyzed sequence.
1 OPTIONS
1.1 Species options
-species <query species> Indicate source species of query DNA
-lib [filename] Allows the use of a custom library
contamination checking options
-is_only only clips E coli insertion elements out of FASTA and .qual files
-is_clip clips IS elements before analysis (default: IS only reported)
-no_is skips bacterial insertion element check
-rodspec only checks for rodent specific repeats (no RepeatMasker run)
-primspec only checks for primate specific repeats (no RepeatMasker run)
For detailed explanation of the contamination detection options, see
"4.2 Identification of DNA source" below.
-spec
Interspersed repeats mostly are copies of transposable elements in
different states of erosion. Thus, dependent on the time of activity
of the source transposable element, interspersed repeats generally are
specific to a (clade of) species, and different redatabase
(http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html). In
principal, all unique clade names occurring in this database can be
used. Examples are:
-species "sus scrofa"
-species chimpanzee
-species arabidopsis
-species canidae
-species mammals
Capitalization is ignored, multiple words need to bound by apostrophes.
RepeatMasker builds one or more repeat consensus files the first time
a species/group has been chosen, or when a new database has been
downloaded. These will be written in a subdirectory of the Libraries
directory named after the date of the repeat database version and the
Latin name of the clade. For example, "-species monocotyledons"
creates the file
"..../RepeatMasker/Libraries/20040616/liliopsida/specieslib".
Currently, only for mammalian species multiple files are created,
bearing names like "shortcutlib" and "longlib", which the queries are
compared to sequentially.
The creation of these files takes some time (a few seconds sometimes),
but the next times RepeatMasker is run on the same species these
existing files will be used. When Wu-BlAST is used as the search
engeine (see 1.3), blastable libraries are built, again as a one time
event for each species.
After multiple database updates, the libraries could hog some space,
and you may consider deleting the older
"..../RepeatMasker/Libraries/<date>" directories.
The files contain all repeats of the RepeatMasker database that have
been found in the genome of the given species, or have been found in a
related species and are thought to predate the speciation time of the
two. For example, -species gorilla, will create a gorilla repeat file
that is almost as big as the human file, because almost all repeats in
human predate the 6-10 million years that separates us from the
gorilla, though none of the consensus sequences have been derived from
Gorilla DNA. A repeat file for hyraxes, for which order no repeats
have been submitted to the database yet, will contain all repeats
found in the human genome that are thought to be older than the origin
of most mammalian orders.
If a group of species is indicated, all repeats are included that are
found in any species belonging to this clade. Thus, "-species diptera"
leads to comparison against repeats found in the genomes of any
diptera species, currently primarily represented by fruitfly and
mosquitoes, and "-species murinae" compares the query to all known
murine repeats, including rat and mouse.
Not all "common" English names occur in the taxonomy database. For
example, "chimp", "squirrels", "grasses", or "carnivores" are not
present. The program will suggest functional names using Soundex, with
oftentimes unexpected results. Using Latin names is always safest.
famdb.py
famdb.py provides an interface to the Dfam and RepeatMasker libraries
in FamDB format, and allows you to see what species are covered and
how many repeats are assigned to them. For example:
`./famdb.py -i Libraries/Dfam.h5 lineage --ancestors --descendants human --format totals`
shows that there are 8 lineage-specific repeats and 1337 ancestral
(e.g. hominids, primates, mammals, and insertion artifacts).
You can run `./famdb.py --help` for more information about the available
commands.
These are the numbers and bp of repeat consensus sequences (excluding
simple repeats and RNAs) as of May 2009 for the best represented clades
species # of consensi total bp
All mammals combined 3081 4253979
Primates * 585 902148
Rodents * 606 931299
Carnivores * 130 158362
Perissodactyls * 130 220814
Ruminants * 112 130320
Bats * 131 112724
Marsupials 554 863923
Monotremes 102 159182
Birds 425 644078
Amphibia (mostly frog) 230 428828
Teleost fish 1140 2807233
Tunicates 134 368438
Sea urchins 211 560185
Flies 306 906766
Mosquitos 363 914943
Other insects 356 1080649
Nematodes 461 698036
Flatworms 209 641758
Cnidarians 911 3057775
Fungi 256 695278
Arabidopsis 544 1460558
Other dicot plants 742 2563646
Rice 575 1430176
Maize / corn 439 1566688
Other monocot plants 303 912057
Algae 186 533952
* Only order-specific elements; these genomes are also matched to 400+
consensus sequences for elements active before the origin of orders.
-lib
The majority of species are of course not yet covered in the repeat
databases and many are far from complete, but you may have your own
collection. At other times you may want to mask or study only a
particular type of repeat.
For these types of siutations, you can use the -lib option to
specify a custom library of sequences to be masked in the query. The
library file needs to contain sequences in FASTA format. Unless a full
path is given on the command line the file is assumed to be in the
same directory as the sequence file.
The recommended format for IDs in a custom library is:
>repeatname#class/subclass
or simply
>repeatname#class
In this format, the data will be processed (overlapping repeats are
merged etc), alternative output (.ace or .gff) can be created and an
overview .tbl file will be created. Classes that will be displayed in
the .tbl file are 'SINE', 'LINE', 'LTR', 'DNA', 'Satellite', anything
with 'RNA' in it, 'Simple_repeat', and 'Other' or 'Unknown' (the
latter defaults when class is missing). Subclasses are plentiful. They
are not all tabulated in the .tbl file or necessarily spelled
identically as in the repeat files, so check the RepeatMasker.embl
file for names that can be parsed into the .tbl file.
You can combine the repeats available in the RepeatMasker library
with a custom set of consensus sequences. To accomplish this
use the famdb.py tool:
`./famdb.py -i Libraries/RepeatMaskerLib.h5 families --format fasta_name --ancestors --descendants 'species name' --include-class-in-name`
The resulting sequences can be concatenated to your own set of sequences in a
new library file.
1.2 Masking options (options that determine what kind of repeats are masked)
-cutoff [number] sets cutoff score for masking repeats when using -lib
(default cutoff 225)
-nolow does not mask low complexity DNA or simple repeats
-l(ow) same as nolow (historical)
-(no)int only masks low complex/simple repeats (no interspersed repeats)
-alu only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA)
-div [number] masks only those repeats that are less than [number] percent
diverged from the consensus sequence
-cutoff
When using a local library you may want to change the minimum score
for reporting a match. The default is 225, lowering it below 200 will
usually start to give you significant numbers of false matches,
raising it to 250 will guarantee that all matches are real. Note that
low complexity regions in otherwise complex repeat sequences in your
library are most likely to give false matches.
-nolow / -l(ow)
With the option -nolow or -l(ow) only interspersed repeats are
masked. By default simple tandem repeats and low complexity
(polypurine, AT-rich) regions are masked besides the interspersed
repeats. For database searches the default setting is recommended, but
sometimes, e.g. when using the masked sequence to predict the presence
of exons, it may be better to skip the low complexity masking.
-noint / -int
When using the -noint or -int option only low complexity DNA and
simple repeats will be masked in the query sequence.
Inexact simple repeats may be spanned and hidden by an interspersed
repeat annotation. In particular, most A-rich simple repeats derived
from the poly A tails of SINEs and LINES are merged with the
annotation of the SINE or LINE (i.e. you can't tell there is a simple
repeat). Thus, if you're interested in finding the location of
potentially polymorphic simple repeats, this option is recommended.
-norna
Because of their close similarity to SINEs and the abundance of some
of their pseudogenes, RepeatMasker by default screens for matches to
small pol III transcribed RNAs (mostly tRNAs and snRNAs). When you're
interested in small RNA genes, you should use the -norna option that
leaves these sequences unmasked, while still masking SINEs.
-alu
-div
You can limit the masking and annotation to (primate) Alu repeats with
the -alu option and to a subset of less diverged (younger) repeats
with the option -div. For example,
"RepeatMasker -div 20 -mus mysequence"
will mask only those rodent repeats and simple repeats that are less
than 20% diverged from the consensus sequence and
"RepeatMasker -div 10 -alu mysequence"
will mask Alus that are less than 10% diverged from the Alu consensus
sequences and no other repeats.
The -div option may be used to limit the masking to those repeats that
are specific to a species group for use in subsequent comparison of
orthologous genomic loci. Notice that a more sophisticated method to
mask lineage-specific repeats (currently only in mammals) is now
available with the script DateRepeats (4.3).
1.3 Options effecting speed and search parameters
-q Quick search; 5-10% less sensitive, 3-4 times faster than default
-qq Rush job; about 10% less sensitive,
-s Slow search; 0-5% more sensitive, 2.5 times slower than default.
-pa(rallel) [number]
Number of processors to use in parallel (only works for
batch files or sequences larger than 50 kb)
-engine [crossmatch|wublast|decypher]
Select a non-default search engine to use. If not specified
RepeatMasker will use the default configured at install time.
-w(ublast) Use WU-blast, rather than cross_match as engine
**DEPRECATED** Use -engine [crossmatch|wublast|decypher] now.
-frag [number] Maximum sequence length masked without fragmenting
(default 40000).
-gc [number] Use matrices calculated for 'number' percentage background
GC level.
-gccalc Program calculates the GC content even for batch files/small
sequences.
-nocut Skips the steps in which repeats are excised.
-noisy Prints cross_match progress report to screen (defaults to
.stderr file)
-s -q -qq
RepeatMasker can be run at four different sensitivity/speed levels,
with the option -q providing quick (less sensitive) and -s slow
(sensitive) results compared to default. The option -qq has been added
for when you're in a frightful hurry. Each higher gear is about 2-3
times faster, and 90% as sensitive as the next lower gear. See "2.4
Sensitivity and Speed" below for details
-w(ublast)
**DEPRECATED** See -engine.
-engine [crossmatch|wublast|decypher]
By default, RepeatMasker uses the search engine configured
during installation as the default. To use the non-default
search engine you can specify it with the -engine parameter.
Before June 2004, the script MaskerAid (written by Joey Bedell, Ian
Korf and Warren Gish at the St Louis Washington University Genome
Center) was necessary to use WU-BLAST with RepeatMasker, but that
functionality is now built in. RepeatMasker includes a search engine
object that allows relatively straightforward integration of other
search engines. Currently only WU-BLAST has the flexibility to accept
all cross_match options.
For longer sequences, default RepeatMasker runs with WU-BLAST take
about as long as cross_match powered runs at -qq settings (see "2.4
Sensitivity and speed"). The speed settings have relatively little
effect on the speed when using WU-BLAST, with the fastest settings
1.25-1.75 as fast as the slowest settings, while the sensitivity
increases significantly. Thus, I recommend to always run RepeatMasker
in sensitive (-s) or default mode when using WU-BLAST. I've made the
difference in parameters between sensitive and default settings larger
at -w settings, to make these speed options more meaningful and gain
more sensitivity (with little cost in speed).
Even with these more extreme parameters, the sensitivity can't quite
reach that of the sensitive settings using cross_match, but it comes
very close, and the huge difference in speed make this option
very attractive.
The output format with the -w option is identical to default and
scores are comparable, as the same complexity adjustment is applied.
The only difference is that, when using the wublast option, hyphens
in the sequence are retained (in default mode all non-letters were
deleted from the sequence). WU-BLAST uses hyphens to indicate
insurmountable barriers and alignments will not span hyphens.
-pa(rallel)
For sequences over 50 kb long or files wit multiple sequences,
RepeatMasker can use multiple processors. When you type:
RepeatMasker -par 10 <file>
A batch file of sequences will run with up to 10 sequences at the
time, until all sequences are done, while a file with one large
sequence will analyze the sequence in up to 10 fragments at the same
time. The minimum fragment size is 25 or 33 kb, the maximum 66 kb (all
sequences over 100 kb are divided in 33-66 kb fragments). For the
batch files no minimum size exists. Thus,
If <file> contains: RM runs in parallel:
one 60 kb sequence two 30 kb fragments
one 400 kb sequence ten 40 kb fragments
one 1 Mb sequence ten 50 kb fragments, twice
ten 500 bp sequences ten 500 bp sequences
two 500 kb sequences ten 50 kb fragments, twice
Processing of the detected matches takes place after all batches or
fragments have been cross-matched with the databases.
Beware that, generally, you have a limited number of processor IDs
allotted. RepeatMasker uses 4 PIDs for each parallel job, so if you're
allotted 64 user PIDs, you can 'only' run 16 fragments/batches in
parallel.
-frag
Even when the -par option is not used, RepeatMasker transparently
fragments sequences over 40 kb in fragments of equal sizes with 1 kb
overlaps. Similarly, sequence batches containing more than 51 kb are
subdivided in batches of 40 kb or less. The -frag option sets the
maximum fragment and batch size
The only visible effect of the fragmentation is in the alignment
files, where alignments at the edges of the fragments can be
duplicated and/or truncated. The 1 kb overlap between fragments
almost guarantees that there is no loss in sensitivity at the
edges. Fragmentation initially was implemented to allow the size of
sequences and sequence batches to be unlimited. Cross_match can be
very memory intensive when SW alignments have to be performed in large
matrices. This may happen with short minmatch and large bandwidth
settings. Note that RepeatMasker should not croak when cross_match
runs out of memory; it will redo the failed search with a higher word
length or smaller bandwidth until it succeeds. However, this will lead
to gradually less sensitive comparisons.
Fragmentation also can improve repeat detection when a genomic
sequence contains large regions of DNA with significantly different GC
levels (isochores), since sets of scoring matrices are chosen based on
the GC level of a fragment.
-gc
-gccalc
Neutral mutation patterns differ significantly depending on the GC
richness of a locus and we have calculated optimal scoring matrices
for the alignment to consensus sequences in a range of background GC
levels (see 2.2). Usually, RepeatMasker calculates the percentage of
the sequence consisting of Gs and Cs and uses the appropriate
matrices. However, the program defaults to using 'average' 43% GC
matrices when the query is shorter than 2000 bp or a batch file is
analyzed. This is because short sequences can diverge greatly from the
GC level of the locus. For example, CpG islands and exons are more GC
rich than the surrounding DNA, whereas a LINE-1 element can be more AT
rich than the background. In a batch file, RepeatMasker analyses all
sequences together with the same matrices. The percentage GC in all
the sequences combined may be inappropriate for some sequence entries;
using high GC level matrices in AT rich sequences (and vice versa) may
result in false masking.
One can override this behavior in two ways:
With the option -gc you can set the GC level to a certain percentage:
RepeatMasker -gc 37 mybatchofsequences.fa
lets the program use matrices appropriate for 37% GC background. The
batch could, for example, contain ESTs from a single locus with a
known GC level.
Alternatively, the -gccalc option forces RepeatMasker to use the
actual GC level of a short sequence or the average GC level of a batch
of sequences. The latter sequences, for example, may be contigs in a
sequencing project.
-nocut
The option -nocut skips a step in the default procedure for human and
rodent queries, in which full-length younger insert are spliced out of
the query to reconstruct a pre-insertion situation. RepeatMasker is
generally more sensitive and efficient including the deletion step as
it can unearth older repeats that were interrupted by these younger
elements.
1.4 Output options
-a shows the alignments in a .align output file; -ali(gnments) also works
-inv alignments are presented in the orientation of the repeat (with option -a)
-cut saves a sequence (in file.cut) from which full-length repeats are excised
(temporarily disfunctional)
-small returns complete .masked sequence in lower case
-xsmall returns repetitive regions in lowercase (rest capitals) rather than masked
-x returns repetitive regions masked with Xs rather than Ns
-poly reports simple repeats that may be polymorphic (in file.poly)
-ace creates an additional output file in ACeDB format
-gff creates an additional General Feature Finding format output
-u creates an untouched annotation file besides the manipulated file
-xm creates an additional output file in cross_match format (for parsing)
-fixed creates an (old style) annotation file with fixed width columns
-no_id leaves out final column with unique ID for each element
-e(xcln) calculates repeat densities (in .tbl) excluding runs of >25 Ns in query
-noisy prints cross_match progress report to screen (defaults to .stderr file)
-a / -ali(gnments)
-inv
Alignments are saved in a .align file when using the option -a. They
are shown in the orientation of the query sequence, unless you use the
option -inv as well, which will return alignments in the orientation
of the repeats (see 3.2 Alignments).
-cut
The -cut option to RepeatMasker is not supported in this release. It
will be rolled into a new annotation utility in the near future. If
you need this functionality sooner please send an email to Robert
Hubley ( [email protected] ). Thanks for your patience.
The option made the program save a file "file.cut" which contains
an intermediate sequence in the masking progress. In this sequence all
full-length elements, young LINE-1 3' ends, and close to perfect simple
repeats were deleted.
-x
When -x is used the repeat sequences are replaced by Xs instead of
Ns. The latter allows one to distinguish the masked areas from
possibly existing ambiguous bases in the original sequence. However,
when running BLAST searches (and maybe other programs) Xs are deleted
out of the query and the returned BLAST matches will have position
numbers not necessarily corresponding to that of the original
sequence.
-xsmall
When the option -xsmall is used a sequence is returned in the .masked
file in which repeat regions are in lower case and non-repetitive
regions are in capitals.
-poly
You can get a list of potentially polymorphic microsatellites with the
option -poly. This is simply a subset of the list in .out, with
dimeric to tetrameric repeats less than 10 % diverged from perfection.
-xm
When using the -xm option an additional output file (.out.xm) is
created that contains the same information as the .out file (excluding
the low-complexity/simple DNA), but then in the original cross_match
format. This output is harder to read but there are programs that
require the exact cross_match output format.
-u
The script ProcessRepeats adjusts the original RepeatMasker output so
that the annotation more closely reflects reality. With the option -u
a .ori.out file is created that contains the original (but sorted)
cross_match summary lines.
-ace
With the -ace option the script creates an .ace file. This is merely a
suggestion. The columns in the table currently are:
Motif_homol <repeat-name> RepeatMasker(method) <percent divergence>
<start in query> <end in query> <orientation> <start in consensus>
<end in consensus>
-gff
The script creates a .gff file with the annotation in 'General Feature
Finding' format. See http://www.sanger.ac.uk/Software/GFF for
details. The current output follows a Sanger convention:
<seqname> RepeatMasker Similarity <start in query> <end in query>
<percent divergence> <orientation> . Target "Motif:<repeat-name>"
<start in consensus> <end in consensus>
In this line, 'RepeatMasker' becomes 'RepeatMasker_SINE' if the match
is against an Alu. I don't know why.
-fixed
Since April 1999 the column widths in the annotation table are
adjusted to the maximum length of any string occurring in a column;
this allows long sequence names to be spelled out completely.
Previously, a fixed column width table was returned, which can still
be obtained by using the -fixed option. Parsing should not be effected
by this change of default behavior, as the same number of columns with
the same formatted text are still separated by white space.
-no_id
Since September 2000 a column displaying a unique number (ID) for each
integrated element is printed by default. This used to be optional
(-id). Fragments of a single element, separated from each other by
subsequent insertions of other elements, deletions or recombinations,
carry the same number. This feature allows better interpretation of
the data and should greatly help proper graphical display of the
repeats.
The column follows all other columns, except for the (rare) indication
that an annotation overlaps another annotation (*). This change, which
was announced in the previous release, should not hinder most parsing
scripts. If it causes problems, the old format can be retrieved with
the option -no_id.
-excln
The percentages displayed in the .tbl file are calculated using a
total sequence length excluding runs of 25 Ns or more. This is useful
when analyzing draft sequences that are often concatenated contigs
separated by (sometimes very) long stretches of Ns. This option can
be used with ProcessRepeats as well. The number of Ns in long runs in
the query are apparent in the .tbl file, and you only need to run
ProcessRepeats with the option on the .cat file.
-noisy
RepeatMasker used to print the voluminous cross_match progress reports
to the screen. Since the Dec 1998 version this output is stored in a
.stderr file and a more informative much smaller progress report is
printed to the screen. The option -noisy allows one to see the
cross-match reports coming by on the screen (yeah).
1.5 ProcessRepeats options
When you have already run RepeatMasker and want to recreate the .out
or .tbl file, you only need to rerun ProcessRepeats on the .cat
file(s), which will take just a small fraction of the time required to
rerun RepeatMasker. Such a situation can occur when you've
accidentally deleted the .out or .tbl file or want additional or
differentially formatted output files. Note that alignment files
cannot be created unless RepeatMasker was run with the -a option and
that the original .tbl and .out file will be overwritten unless you
rename them.
ProcessRepeats -species mus -nolow -gff -excln myhumongousmousesequence.cat
Repeat matches are processed differently for different query species,
so the -species mus option is necessary. With the -nolow option, the
.out file will not contain information on simple repeats and low
complexity DNA anymore. The -gff option creates an additional output
file in GFF format, and the -excln option displays the density of
repeats in the .tbl file as a percentage of those bp that are not
contained in long stretches of Ns.
The options/flags for ProcessRepeats are:
-species <query species> Identical as for the RepeatMasker script
-lib skips most of processing, does not produce a .tbl file unless the
custom library is in the >name#class format.
-nolow does not display simple repeats or low_complexity DNA in the annotation
-noint skips steps specific to interspersed repeats, saving lots of time
-u creates an untouched annotation file besides the manipulated file
-xm creates an additional output file in cross_match format (for parsing)
-ace creates an additional output file in ACeDB format
-gff creates an additional Gene Feature Finding format
-poly creates an output file listing only potentially polymorphic simple repeats
-no_id leaves out final column with unique number for each element (was default)
-fixed creates an (old style) annotation file with fixed width columns
-excln calculates repeat densities excluding long stretches of Ns in the query
-orf2 results in sometimes negative coordinates for L1 elements; all L1 subfamilies
are aligned over the ORF2 region, sometimes improving interpretation of data
-a shows the alignments in a .align output file
2 METHODOLOGY AND QUALITY OF OUTPUT
2.1 Methodology
RepeatMasker compares the query sequence against one or more files of
FASTA sequences. The sequences in the libraries provided with
RepeatMasker are consensus sequences derived from alignment of
multiple copies of interspersed or satellite repeats. For interspersed
repeats, a consensus tends to approach the sequence of the
transposable element from which the repeat is derived.
Both cross_match and WU-blast perform their Smith-Waterman (SW)
alignments by first identifying exact word matches and restricting the
alignment to a band or matrix surrounding this exact
match(es). Overlapping matrices are merged. The speed settings of
RepeatMasker are purely changes in the minimum word length from which
an alignment can be seeded and, in some cases, changes in the width of
the band. A wider bandwidth allows more gaps in the alignment and,
more importantly, increases the likelihood that neighboring matrices
overlap.
Cross_match does a low complexity adjustment of the raw SW score. When
WU-blast is used, the RepeatMasker script performs this adjustment. Low
complexity matches are the primary cause of false matches, and this
adjustment contributes significantly to the high selectivity of
RepeatMasker (see 2.5)
As a result of the existence of many related consensus sequences in
the database, usually multiple repeats match one region in the query
at the same time. Generally, cross_match and WU-blast report to the
script only those matches that are less than 80-90% overlapped by a
higher scoring match. This implies that, at first approximation, names
are assigned to repeats based on the highest SW score. Given
appropriate consensus sequences and alignment parameters, this is
intuitively correct as well. However, the scripts have a lot of code
to improve on this first approximation, primarily to deal with partial
matches.
The cut-off SW score above which matches are reported is empirically
derived (see '2.5 selectivity' below). Note that there is no cut-off
divergence level; reported matches can be less than 60% identical.
The alignments parameters -substitution matrices, and gap initiation
and extension penalties- are derived from data harbored in multiple
alignments of a special subset of interspersed repeats. The derived
matrices are theoretically optimal for a series of conditions (see
below). The gap penalties are sub-optimal, primarily because gap
lengths have a non-linear distribution and are poorly represented by a
single gap-extension penalty.
For primate, rodent and other mammalian DNA, the query is compared to
consecutive subsets of repeat libraries. For primates, perfect simple
repeats, full-length Alus, full-length short interspersed repeats, and
young L1 3' ends are first (and in that order) clipped from the
sequence to expose underlying older elements. Subsequently, the query
is compared to most repeats, a set of ancient elements under
especially sensitive settings, a large set of long retroviral
sequences under faster settings (to save time), and AT-rich L1 3' ends
that may have been discarded earlier as low complexity
matches. Finally, simple repeats and low complexity regions are
masked.
2.2 Scoring matrices
We have calculated statistically optimal scoring matrices for the
alignment of neutrally diverging (non-selected) sequences in human DNA
to their original sequence. These matrices have been in use since the
May 1998 release. The matrices were derived from alignments of DNA
transposon fossils to their consensus sequences. A series of different
matrices are used dependent on the divergence level (14-25%) of the
repeats and the background GC level (35-53%, neutral mutation patterns
differ significantly in different isochores).
These matrices are (close to) optimal for human genomic sequences
longer than 10 kb, for which length the GC level usually is
representative of the isochore in which the sequence lives. However,
the GC level of small fragments can diverge a lot from the surrounding
(e.g. a fragment spanning a CpG island, a GC rich exon or an AT-rich
LINE-1 element) and RepeatMasker defaults to using matrices derived for
a 43% GC background when a sequence is shorter than 2000 bp or when a
batch file is submitted. When the appropriate background GC level is
known, this can be entered with the -gc option.
(Note that these matrices are an integral portion of RepeatMasker and
are covered under the same restrictions as the scripts and databases
as described in the signed software agreement).
2.3 Repeat databases
The RepeatMasker program are distributed with a copy of the
Dfam database ( www.dfam.org ). Dfam is a small but growing "open"
databases of Transposable Element seed alignments, profile Hidden
Markov Models and consensus sequences.
RepeatMasker is also compatible with the RepBase database managed by
the Genetic Information Research Institute and requires a license to
use. Up until 2019 we maintained the "Repbase RepeatMasker Edition"
libraries as co-editor of RepBase Update. For newer versions of
RepBase users will need to use the sequences in FASTA format with
RepeatMasker's "-lib" option.
2.4 Sensitivity and speed
The program can be run at four levels of sensitivity. The only
difference between these settings is the minimum match or word length
in the initial (not quite) hashing step of the cross_match program
(see the cross_match/PHRAP documentation). For mammalian queries, he
"slow" setting will find and mask 0-5% more repetitive DNA sequences
than by default, whereas the "quick" settings miss 5-10%, and the
"rush" (-qq) settings may miss 10-25% of the sequences masked by
default. The alignments may extend more or be somewhat more accurate
in the more sensitive settings as well.
Following are benchmark times for random 1 Mbp of sequences of a
variety of different species run in parallel on 4 Pentium4 2.4Ghz
processors with 3 GB RAM with June 2004 RepeatMasker databases. The
percentage of the query masked is given in parentheses.
------------------------ cross_match ------------------------
Species WUBlast (Def) Rush Quick Default Slow
------- ------------- ------------- ------------- ------------- -------------
Human 02:54 (39.26) 01:54 (33.91) 05:05 (36.85) 22:15 (39.92) 57:54 (40.58)
Human-reversed 01:09 ( 1.98) 01:05 ( 2.00) 03:39 ( 2.06) 18:44 ( 2.07) 53:37 ( 2.09)
Chimpanzee 03:00 (40.83) 01:50 (35.24) 04:45 (38.70) 20:22 (41.59) 53:14 (42.24)
Mouse 03:31 (54.02) 01:47 (48.65) 04:21 (51.74) 18:54 (54.15) 47:26 (55.18)
Rat 04:46 (66.07) 02:05 (62.07) 04:32 (63.84) 19:41 (65.97) 48:23 (67.20)
Dog 02:24 (34.62) 01:32 (29.15) 03:07 (32.44) 12:29 (35.09) 30:14 (35.69)
Arabidopsis 01:01 ( 3.02) 00:51 ( 2.95) 04:41 ( 3.00) 46:52 ( 3.12) 1:46:53 ( 3.13)
Ciona savigny 01:25 (15.64) 01:02 (13.12) 01:30 (14.45) 06:13 (15.90) 15:24 (16.30)
C. elegans 02:35 (22.63) 01:38 (20.84) 02:39 (22.52) 12:12 (23.21) 25:15 (23.59)
Drosophila 01:59 (47.21) 01:23 (43.08) 02:30 (45.60) 15:49 (47.51) 39:24 (48.38)
Chicken 00:42 ( 6.52) 00:35 ( 6.18) 00:58 ( 6.42) 04:59 ( 6.53) 11:48 ( 6.58)
Fugu 00:35 ( 5.89) 00:34 ( 5.40) 00:49 ( 5.70) 03:51 ( 5.89) 09:20 ( 6.05)
The human-reversed sequence is the "human" sequence reversed but not
complemented. 2% of this sequence is (properly) masked as simple
repeats or low complexity DNA.
Note that for many non-mammalian species the slower settings do not
dramatically increase the percentage recognized as interspersed
repeats. Most of the repeats in the databases for these species are
relatively young and thus are easily detected. This particular 1Mbp
Arabidopsis sequence is an extreme example, where at slow settings in
almost two hours only 1800 bp more is masked than at rush settings in
51 seconds (the Arabidopsis database is large).
The speed is also dependent on the repeat content of the sequence. For
human sequences, Alu rich sequences are analyzed fastest, LINE rich
sequences somewhat slower, repeat poor regions slower still, and long
satellite regions can take a while.
If you have several shorter sequences it is much faster to run
RepeatMasker on a batch file (all sequences in one file). On above
computer, in the rush mode (cross_match), a batch of 10 5 kb sequences
is analyzed in 23 seconds, 20 5kb in 34 sec., etc.
The user time for larger sequences or sequence batches (50 kb and up)
is linearly related to the length of the query due to the
fragmentation of the query sequence.
The increase in speed by using multiple processors is dependent on the
usage of the computer and the above-mentioned non-linear relationships
of sequence length and processing time. However, under the right
circumstances, using 2 processors can increase the speed close to
twofold, because the most time-consuming processes are performed in
parallel.
2.5 Selectivity and matches to coding sequences
The cutoff Smith-Waterman scores for masking interspersed repeats are
conservative, since masking of one short potentially interesting
region generally is more harmful than not masking a number of hard to
find matches. If there are any false matches, they tend to have
scores close to the cutoff, which is 225 for most repeats, 300 for the
low-complexity LINE-1 search*, and 180 for the very old MIR, LINE2 and
MER5 sequences.
* most LINE-1s are detected with a 225 cut-off, but in one step in
RepeatMasker the low-complexity score adjustment is turned off to find
ancient A-rich L1 elements.
With each release, we test for the occurrence of false matches in
randomized and in inverted (but not complemented) DNA including a
range of isochores from 36% to 54% GC. To retain seeds for Smith
Waterman alignments, sequences are randomized at the 10 bp word
level. Note that the inverted sequences retain the low complexity and
simple repeat patterns of the original sequences. Even at sensitive
settings, for which false matches are most likely, the 1998-2004
versions of RepeatMasker have reported no (false) matches at all to
interspersed repeats in the randomized or inverted sequences. No
simple repeats were reported in the randomized queries.
In a 1999 test, RepeatMasker returned only a single probably false
match (71 bp) when analyzing a batch of 4440 coding regions in human
mRNAs (7.2 Mb) at sensitive settings. The coding regions were
collected from GenBank, based on annotations, filtered for the
presence of complete ORFs and initiator methionines, and made more or
less non-redundant. When each coding region was analyzed individually
using the -gccalc option, 5 matches (414 bp, 0.006%) were falsely
masked (156 bp at default speed, 76 bp at quick settings). In this
analysis each sequence was analyzed with matrices chosen based on the
actual GC level, even for very short sequences, while in the batch
analysis of the coding regions the 'average' 43% GC matrices were
used.
The 1998 and later versions of RepeatMasker show somewhat more false
masking when a pre-1998 version of cross_match is used. These are
primarily the result of improper assumptions of the background
nucleotide frequency used in the scoring matrix calculation when
adjusting for the complexity of a match. Specifically, a very GC rich
region in an AT-rich isochore (like an exon) may improperly match a GC
rich repeat, since the scores for C/G matches are higher in the used
scoring matrix than for AT matches (calculated for this AT rich
background) whereas the old cross_match assumed that a 50% GC
background in these calculations and equal scores for A/T and G/C
matches have been given. The new version of cross_match reads the
correct nucleotide background level from the matrix used.
2.6 Simple repeats and low complexity DNA
Low-complexity DNA
By default, along with the interspersed repeats, RepeatMasker masks
low-complexity DNA. Simple repeats (micro-satellites) can originate at
any site in the genome, and therefore have an interspersed
character. Other low-complexity DNA, primarily poly-purine/
poly-pyrimidine stretches, or regions of extremely high AT or GC
content will result in spurious matches in some database searches as
well (especially in the ungapped BLASTN searches). For example,
extremely AT-rich regions consistently will give very low probability
matches to mitochondrial DNA in BLASTN searches. The settings are very
stringent, and we think that few if any sequences informative in
database searches are masked as low-complexity DNA. However, you can
skip the low-complexity DNA masking using the option -nolow or -l(ow).
Under the current settings a 100 bp stretch of DNA is masked when it
is >87% AT or >89% GC, a 30 bp stretch has to contain 29 A/T (or GC)
nucleotides. The settings are slightly more stringent than the
original settings, partly because the gapped BLAST programs are less
sensitive to short regions of low complexity then the old gapless
BLAST. In coding regions I have not yet found extensive regions (>10
bp) masked as low complexity DNA that would not be masked by the
combined XNU and SEG filters routinely used in BLASTX.
Annotation of simple repeats
Although RepeatMasker does a good job in masking simple repeats to
avoid spurious matches in database searches, it is not written to find
and indicate all possibly polymorphic simple repeat sequences. Only
di- to pentameric and some hexameric repeats are scanned for and
simple repeats shorter than 20 bp are ignored. The -poly option prints
out a separate list of simple repeats of < 10% divergence from a
perfect repeat. However, even long perfect repeats may not be
presented in this list; e.g. two perfect 40 bp long (CA)n repeats
interrupted by 10 Ts are aligned in one piece and may be reported as
having > 10% divergence from the consensus. Many perfect hexameric or
longer unit repeats will be listed as more or less diverged smaller
unit repeats and may not appear in the .polyout file.
Also note that, in the default output, simple repeats expanded from
the poly A tails of Alus and LINE-1 are now included in the Alu or
LINE-1 annotation. This cleans up the annotation a bit and lets the
stand-alone poly A regions stand out (they may indicate the presence
of a processed pseudogene). However, even perfect simple repeats in
such tails will be hidden in the .out file.
A program optimized to quickly find all dimeric to pentameric repeats
is sputnik, available at http://espressosoftware.com/pages/sputnik.jsp.
Local duplications, tandem repeats and satellites.
Gary Benson's program "Tandem Repeat Finder" (another catchy name)
currently is the standard for finding satellites and all other direct
repeats (http://tandem.bu.edu/trf/trf.html).
Any local duplications (tandem, inverted, interrupted) can be detected
with the program miropeats (http://www.genome.ou.edu/miropeats.html),
which presents this similarity information graphically.
3 HOW TO READ THE RESULTS
3.1 The annotation (.out) file
The annotation file contains the cross_match summary lines. It lists
all best matches (above a set minimum score) between the query