Nextflow DSL2 script for running RepeatMasker on large assemblies/chromosomes/contigs in a cluster environment.
Workflow Process:
- Breakup the input sequence into N-sized non-overlapping batches
- Search each batch using RepeatMasker with the provided options
- Adjust batch local output sequence names/coordinates to global sequence names/coordinates
- Combine files and fix linkage IDs in both out and align files (if alignments requested)
- Generate a summary file (similar to 'tbl' file)
- Compress output files
Prerequisites:
- Java JDK 11-19
- Nextflow 24.10+
- Three UCSC Utilities:
- linux/windows: https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64
- macos: https://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64
- twoBitToFa
- faToTwoBit
- bedSort
- RepeatMasker 4.x installed and configured
Configuration:
- Edit the RepeatMasker_Nextflow.nf script and make the following customizations
for your environment:
- Set the dependency locations: "ucscToolsDir", and "repeatMaskerDir"
- Setup a cluster environment for your cluster
Parameters:
--species : Dfam species library ( or use inputLibrary for custom lib )
--nolow : Use RepeatMasker '-nolow' option. Not recommended under normal
circumstances. Gives a major boost to false positives.
--xsmall : Use RepeatMasker '-xsmall' option.
--s : Use RepeatMasker -s option -- not a big impact for RMBlast.
--inputSequence : FASTA file optionally compressed with gzip.
--inputLibrary : Uncompressed FASTA file containing consensi.
--outputDir : Directory to store the results. Should already exist.
--engine : Specify engine to use [ default: rmblast ]
--batchSize : Size of each cluster job in bp [ default: 50mb ]
--cluster : Either "local", "quanah", "nocona" or "griz"
Examples:
NOTE: On some clusters it will be necessary to use full paths to all files specified as parameters.
o Run with standard libraries and a specified species:
nextflow run /path/RepeatMasker_Nextflow.nf \
--inputSequence /full_path_required/GCA_003113815.1.fna.gz \
--species "human" \
--cluster nocona
o Run with a custom library:
nextflow run /path/RepeatMasker_Nextflow.nf \
--inputSequence /full_path_required/GCA_003113815.1.fna.gz \
--inputLibrary /full_path_required/GCA_003113815.1-consensi.fa \
--cluster griz
Robert Hubley, 2020-2024