Skip to content

Commit

Permalink
first commit
Browse files Browse the repository at this point in the history
  • Loading branch information
Robert Hubley committed Nov 18, 2024
1 parent 4553692 commit df54426
Show file tree
Hide file tree
Showing 16 changed files with 276,038 additions and 0 deletions.
54 changes: 54 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
#####################################################################
#
# RepeatScout Project Build Script
#
#####################################################################
# $Log: Makefile,v $
# Revision 1.4 2008/08/07 21:59:12 rhubley
# - An error reported by Eric Ganko in the filter-stage-1.prl script.
#
#
####################################################################

# Set the version here
VERSION = 1.0.6

# Installation Directory
INSTDIR = /usr/local/RepeatScout-$(VERSION)

CFLAGS = -O3 -Wall
LIBS = -lm
OBJ= cmd_line_opts.o version.o

HDR= cmd_line_opts.h

all: RepeatScout build_lmer_table

RepeatScout: build_repeat_families.o build_repeat_families.h $(OBJ) $(HDR)
$(CC) build_repeat_families.o $(OBJ) -o $@ $(LIBS)

build_lmer_table: build_lmer_table.o build_lmer_table.h $(OBJ) $(HDR)
$(CC) build_lmer_table.o $(OBJ) -o $@ $(LIBS)

version.c: Makefile
echo "char const* Version = \"$(VERSION)\";" > version.c

clean:
@rm *.o build_lmer_table RepeatScout

.c.o:
$(CC) $(CFLAGS) -c $< -o $*.o $(CCINCLUDES)

install: all
@mkdir $(INSTDIR)
cp RepeatScout $(INSTDIR)
cp README $(INSTDIR)
cp build_lmer_table $(INSTDIR)
cp filter-stage-1.prl $(INSTDIR)
cp filter-stage-2.prl $(INSTDIR)
cp merge-lmer-tables.prl $(INSTDIR)
cp compare-out-to-gff.prl $(INSTDIR)

distribution:
rm *~
(cd ../; tar zcvf RepeatScout-$(VERSION).tar.gz RepeatScout-1 --exclude RepeatScout-1/orig --exclude RepeatScout-1/tests --exclude RepeatScout-1/CVS --exclude RepeatScout-1/rc-change-w-debug)
144 changes: 144 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# RepeatScout

RepeatScout is a software tool for the de novo identication of repetitive element families in DNA sequences.

## Description

This repository contains the Dfam consortium maintained version of RepeatScout. The original RepeatScout
software was developed by Price A.L., Jones N.C. and Pevzner P.A. and described in detail in the following paper:

Price, Alkes L., Neil C. Jones, and Pavel A. Pevzner.
"De novo identification of repeat families in large genomes."
Bioinformatics 21.suppl_1 (2005): i351-i358.

RepeatScout identifies repeats by first identifying short, exact matches (seeds). The seed consensus is extended
into the flanking regions by peforming (banded) pairwise alignment between each seed location and each possible
consensus extension {A,C,G,T}. A "fit-preferred" alignment score is calculated for each extension, capping the
penalties for partial repeat instances in the extending set. The most abundant seeds are processed first, and
following extension the consensus is used to update the seed counts to reduce the chance that a family is
rediscovered in subsequent iterations. In addition, seeds instances are only counted if they are a minimum
distance apart, to reduce the chance of building consensi for satellite repeats.

The output of RepeatScout is a FASTA file containing the consensus sequences of the repeat families identified.

We have made some modifications to the original RepeatScout code to improve its performance and to make it easier
to use. These include:

* Adding boundary checks to the seed extension process to prevent seeds from extending across sequence boundaries.
* Switched post-processing simple-repeat checking from NSEG to DUSTMASKER.
* Fixed a few minor bugs in the scripts/code.

## Getting Started

### Dependencies

* C-compiler, Make
* Perl 5.5 or better (see http://www.perl.com)
* dustmasker (part of NCBI BLAST+ - ftp://ftp.ncbi.nlm.nih.gov/blast/executables/)
* trf (Tandem Repeats Finder - http://tandem.bu.edu/trf/trf.html)

### Installing

To build the RepeatScout software, follow these steps:
* download the source code tarball RepeatScout-#.#.#.tar.gz
from https://github.com/Dfam-consortium/RepeatScout/releases
* gunzip and untar it (e.g., tar -xvfz RepeatScout-#.#.#.tar.gz).
A directory named RepeatScout-### will be created.
* build the software by typing the following commands:
```
cd RepeatScout-###
make
```
This will generate two programs: build_lmer_table and RepeatScout.
You may leave these binaries where they are or copy them to any other location;
no external libraries are needed.

### Executing program

Running RepeatScout proceeds in four phases. First, build_lmer_table
creates a file that tabulates the frequency of all l-mers in the
sequence to be analyzed. Second, RepeatScout takes this table and
the sequence and produces a fasta file that contains all the repetitive
elements that it could find. Third, the "filter-stage-1.prl" script
is run on the output of RepeatScout to remove low-complexity and
tandem elements; RepeatMasker is run on the sequence of interest using
this filtered RepeatScout library. The program "filter-stage-2.prl"
then filters out any repeat element that does not appear a certain number
of times (by default, 10). Finally, the locations of the repeats found
by RepeatMasker are used, in conjuction with GFF files that describe
segmental duplications or exons or other such "uninteresting" regions
to remove sequences from the library that are likely to not be mobile
elements; the program "compare-out-to-gff.prl" does exactly this.

The RepeatScout program requires a substantial amount of memory
and a fair amount of time. On the human X chromosome, it requires
approximately _ hours on a 3 Ghz PC while using _ Gb of memory. We are
currently working on a way to decrease the memory usage of the program so
that it can run on much larger sequences (whole genomes) in a reasonable
amount of time.

You can see a list of command line parameters for each program by calling
the program with the "--h" flag.

### Parameter Choices

The repeat library, as constructed in the paper, was created with the
parameter "-stopafter" set to 500. ("-stopafter" is essentially how far
RepeatScout will continue searching for an alignment when the alignment
score is not improving.) The current default value for this parameter
is 100, which decreases running time significantly.

The default value of l, which is the "length of l-mer to consider", is set
to be ceil(log_4(L)+1) where:
ceil(x) = smallest integer greater than x
log_4(x) = log base 4 of x
L is the length of the input sequence
This value can be adjusted by giving the "-l" parameter, but it is essential
that the same value of -l be given to both build_lmer_table and RepeatScout.
It is not clear that values of l other than the default are sensible, but
the options are there if you need them.

See the help file for the RepeatScout program (--h) for a list of other tunable
parameters.


## Authors

Original Code Authors:
* Alexander L. Price
* Neil C. Jones
* Pavel A. Pevzner

Contributors to the Dfam consortium maintained version of RepeatScout include:
* Robert Hubley
* Arian Smit


## Version History

* 1.0.6
* Switched filter-stage-1.prl from deprecated NSEG to the
DUSTMASKER tool.
* 1.0.5
* Bug fix to filter-stage-1.prl reported by Eric Ganko.
* 1.0.4
* filter-stage-1.prl skipped the last sequence in the input file.
Thanks to Gyorgy Abrusan for reporting this.
* 1.0.3
* Sequence boundaries are now honored in calculations. The previous
version concatenates the sequences together and allowed seeds to
extend across sequence boundaries.
( Submitted by Robert Hubley, Institute for Systems Biology
<[email protected]> )
* 1.0.2
* Bug fix to handle IUPAC codes in build_lmer_table
* 1.0.1
* Bug fix (parameter settings)

## Original Code

Original code/links include [although some appear to be broken]:
* http://www-cse.ucsd.edu/groups/bioinformatics/software.html
* http://repeatscout.bioprojects.org/
* http://bix.ucsd.edu/repeatscout/

Loading

0 comments on commit df54426

Please sign in to comment.