Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Softmasking genome with custom de novo library failure: WARNING: Comparison failed. Retrying with larger minmatch (10) #298

Open
Windyxia11 opened this issue Dec 10, 2024 · 1 comment
Labels

Comments

@Windyxia11
Copy link

Describe the issue
Hello, I'm having a trouble when I used a custom de novo library to softmask the sugarcane genome. When I used Saccharum's custom de novo library creating by RepeatModeler to softmask the genome, it ran with no problem. But when I merged the RepeatMasker's library of Saccharum and Saccharum's custom de novo library and used it to softmask the genome, it failed and output "WARNING: Comparison failed. Retrying with larger minmatch (10)". I don't know how to solve it. I ran it for a single chromosome(~100M), following are my command lines and errors.Any help would be greatly appreciated!
A concise description of the bug, including any error messages.

Reproduction steps

  1. Steps to reproduce the behavior, including the command lines given to the program
    RepeatMasker -nolow -norna -pa 8 -xsmall -dir softmasked_genome/Saccharum/softmask-genome/test/ -lib softmasked_genome/Saccharum/softmask-genome/test/Saccharum_merged.lib softmasked_genome/Saccharum/softmask-genome/test/Chr8A.fa
  • and links to publicly available genome assemblies and other data files (if available).

Log output

Please paste or attach any and all log output, which includes useful information including data file statistics and version numbers. An easy way to capture this is to redirect the log output to a file e.g RepeatMasker myseq.fa >& output.log
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
WARNING: Comparison failed. Retrying with larger minmatch (10)
Environment (please include as much of the following information as you can find out):

  • How did you install RepeatMasker? e.g. manual installation from repeatmasker.org, bioconda, the Dfam TE Tools container, or as part of another bioinformatics tool?
    manual installation

  • Which version of RepeatMasker do you have? The output of RepeatMasker -v can be used to find this.
    RepeatMasker version open-4.0.3

  • Have you installed RepBase RepeatMasker Edition, or the full Dfam database?
    RepBase RepeatMasker Edition

  • Operating system and version. The output of uname -a and lsb_release -a can be used to find this.
    Linux loginb2 3.10.0-327.el7.x86_64 [Bug Report]: A small bug in *.out file, which wrongly commbined two closed alignment in *.align  #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
    Additional context

  • Add any other context you have about the problem here. Some possible examples:

    • If an older version of RepeatMasker worked before
    • If the problem only happens with specific data files
@Windyxia11 Windyxia11 added the bug label Dec 10, 2024
@rmhubley
Copy link
Member

rmhubley commented Dec 10, 2024

That is a very very old version of RepeatMasker. While probably not causing this issue, I would highly recommend upgrading to 4.1.7 soon. The message you are getting is caused by the search engine exiting with an error. You didn't paste the start of your run, so I am not sure which search engine you are using (nhmmer, RMBlast, crossmatch, abblast etc). These are pretty stable tools and typically only fail if they run out of resources and are killed by the operating system. You noticed this after increasing the size of the library, so I suspect this is a memory issue.

There could be a couple of different ways this could be going wrong. The first is that you are using "-nolow" an option that should only be used in special circumstances. This option prohibits RepeatMasker from screening out potential false-attractors for TE alignment, leading to a potentially large increase in alignments. I am not sure why people commonly use this option, my guess is that they simply do not want to see simple repeat annotation in the final output. In recent versions of RepeatMasker the tool generates a clarifying warning when it's used.

Another problem might be with the library itself. A library should be curated to some extent before use with any annotation tool. Redundancy within the library will have several undesirable impacts on the search: 1) Redundant sequences are redundant experiments and lead to higher false positives (multiple testing problem), 2) Redundant sequences (same family, different identifier) lead to mixed/incorrect annotation for a given insertion and leading to imprecise summary statistics, and 3) Redundancy costs alignment time, and generates a massive amount of alignment data that needs to be adjudicated. I suspect one or more of these may be problematic for your run.

Typically in cases like this the problem occurs during the alignment adjudication phase (ProcessRepeats), when the massive amount of excess alignments is processed at once. The way around that has been to run RepeatMasker on a broken up genome and then the results combined manually. Here you can't use that approach as the problem occurs in the search phase and would still occur. If you have access to a machine with more main memory that might be worth a try.

@rmhubley rmhubley added question and removed bug labels Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants