Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in Clustering Step of LTR Pipeline #241

Open
simone-says opened this issue Mar 25, 2024 · 5 comments
Open

Error in Clustering Step of LTR Pipeline #241

simone-says opened this issue Mar 25, 2024 · 5 comments
Labels

Comments

@simone-says
Copy link

Describe the issue

Trying to run the LTR Pipeline alone to add to some libraries and then re-mask some genomes. So far, this is only happening with one genome. I use the -LTRStruc flag for RepeatModeler runs with no issues, not sure what the issue is here.

Reproduction steps

My exact commands are:
srun apptainer exec --bind=/projects:/projects /common/contrib/containers/tetools-v1.88.sif LTRPipeline ${species_name}.genome.fa -threads 40

I don't know how to reproduce this exactly, I ran it twice when it failed and got the same error message.
This is the genome that's giving an error: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_025583915.1/

Log output

Running LtrHarvest...     : 01:05:25 (hh:mm:ss) Elapsed Time
JANCLY010000001 is not in /projects/tollis_lab/busco_phylo/squamates/ref/omes/anolisSagrei.genome.fa.2bit
Running Ltr_retriever...  : 00:46:13 (hh:mm:ss) Elapsed Time
Aligning instances...     : 00:09:39 (hh:mm:ss) Elapsed Time
Clustering...LTRPipeline: Error - could not cluster MAFFT results.
             : 00:00:01 (hh:mm:ss) Elapsed Time
LTRPipeline : Error - could not open /projects/tollis_lab/busco_phylo/squamates/ref/omes/LTR_3321156.MonMar251237332024/clusters.dat! at /opt/RepeatModeler/LTRPipeline line 333.
srun: error: cn8: task 0: Exited with exit code 2

Environment (please include as much of the following information as you can find out):
Using TETools apptainer on Slurm HPC

  • How did you install RepeatModeler? e.g. manual installation from repeatmasker.org, bioconda, the Dfam TE Tools container, or as part of another bioinformatics tool?
    *TE Tools v1.88

  • Which version of RepeatModeler do you have? The output of RepeatModeler without any options will be a help page with the version of the program displayed at the top.

  • Version 2.0.5

  • Which version of RepeatMasker is this RepeatModeler installation using? Have you installed RepBase RepeatMasker Edition for RepeatMasker, or the full Dfam database?

  • RepeatMasker 4.1.6

  • Operating system and version. The output of q and lsb_release -a can be used to find this.
    Linux wind 4.18.0-513.9.1.el8_9.x86_64 Could not open .../round-2/families.stk for reading! #1 SMP Thu Nov 16 10:29:04 EST 2023 x86_64 x86_64 x86_64 GNU/Linux

@rmhubley
Copy link
Member

rmhubley commented Jun 18, 2024

Sorry for the delay. If you still have these files, could you check that "JANCLY010000001" is in fact a sequence in your input file:

% fgrep JANCLY010000001 ${species_name}.genome.fa

and that it also occurs in the twobit file:

% twoBitInfo /projects/tollis_lab/busco_phylo/squamates/ref/omes/anolisSagrei.genome.fa.2bit stdout | grep JANCLY010000001

When I download the assembly from the link you provided (GCF_025583915.1_AnoSag2.1_genomic.fna.gz) I do not see a sequence named JANCLY010000001. Did you alter the assembly in any way, or get it from a different source?

@simone-says
Copy link
Author

For that genome I ended up using an older version of RepeatModeler, a standalone singularity container version, but that one gave me a segmentation fault error on several assemblies (I believe related to this) so I'm still trouble-shooting the TE Tools version.

Yes it is in the FASTA file (this is a different assembly but same exact error):
Running LtrHarvest... : 21:50:53 (hh:mm:ss) Elapsed Time JAUPFR010000001 is not in /projects/tollis_lab/TE/birds/repeatmasking/2_Results/1_RepMod/oreortyxPictus/oreortyxPictus.genome.fa.2bit Running Ltr_retriever... : 00:17:21 (hh:mm:ss) Elapsed Time Aligning instances... : 00:05:10 (hh:mm:ss) Elapsed Time Clustering...LTRPipeline: Error - could not cluster MAFFT results. : 00:00:00 (hh:mm:ss) Elapsed Time LTRPipeline : Error - could not open /projects/tollis_lab/TE/birds/repeatmasking/2_Results/1_RepMod/oreortyxPictus/LTR_4104555.ThuJul40845152024/clusters.dat! at /opt/RepeatModeler/LTRPipeline line 333. srun: error: cn25: task 0: Exited with exit code 2 (base) [smg655@wind /projects/tollis_lab/TE/birds/repeatmasking/2_Results/logs ]$
`(base) [smg655@wind /projects/tollis_lab/TE/birds/genomes/2_refs_container/new_genomes ]$ grep "JAUPFR010000001" oreortyxPictus.genome.fa

JAUPFR010000001.1 Oreortyx pictus voucher MVZ:Bird:192823 isolate RCKB1079 SCAF_1, whole genome shotgun sequence
(base) [smg655@wind /projects/tollis_lab/TE/birds/genomes/2_refs_container/new_genomes ]$ `

And the twobit file:
(base) [smg655@wind /projects/tollis_lab/TE/birds/repeatmasking/2_Results/1_RepMod/oreortyxPictus ]$ twoBitInfo oreortyxPictus.genome.fa.2bit stdout | grep "JAUPFR010000001" JAUPFR010000001.1 184626880 (base) [smg655@wind /projects/tollis_lab/TE/birds/repeatmasking/2_Results/1_RepMod/oreortyxPictus ]$

@simone-says
Copy link
Author

Oh also, I forgot to mention that initially the error I got from the Anolis sagrei assembly was because I was using the GenBank version and then switched to the RefSeq version but forgot to remove the output directory before re-running RepeatModeler. But the "clustering" error I submitted I'm getting on several assemblies, and it doesn't seem to be related to size or content.

@rmhubley
Copy link
Member

Ok...I have tracked down the issue here. The problem is in LTR_retriever. It deals with long sequence identifiers (>13 characters) in a strange way. It attempts to truncate the identifiers (should that still create a unique set for the genome), and warn in the log output that it worked around the issue on its own. The problem with that approach is that RepeatModeler doesn't know how to translate those new shortened identifiers back to the full-length ones. I will have to add some code to RepeatModeler to fix this. In the meantime, the only way to get this to work is to make sure you only feed RepeatModeler genomes with sequence identifiers <= 13 characters long or leave out the LTR pipeline from the run.

@xo2003
Copy link

xo2003 commented Oct 9, 2024

Hi,

I encountered the same problem only when using TEtools 1.88.5. The container was established using Docker engine.
Here are my results from testing TEtools 1.88, 1.88.5, and 1.89.2:

# TEtools 1.88 Finishing running the LTRPipeline
RepeatModeler Version 2.0.5
===========================
Using output directory = /work/RModeler_1.88/RM_33.TueOct80422222024
Search Engine = rmblast 2.14.1+
Threads = 96
Dependencies: TRF 4.09, RECON 1.08, RepeatScout 1.0.6, RepeatMasker 4.1.6
LTR Structural Analysis: Enabled ( GenomeTools 1.6.4, LTR_Retriever v2.9.0,
                                   Ninja 0.97-cluster_only, MAFFT 7.471,
                                   CD-HIT 4.8.1 )
LTR Structural Analysis
=======================
Running LtrHarvest...     : 00:06:42 (hh:mm:ss) Elapsed Time
Running Ltr_retriever...  : 00:05:41 (hh:mm:ss) Elapsed Time
Aligning instances...     : 00:04:11 (hh:mm:ss) Elapsed Time
Clustering...             : 00:00:02 (hh:mm:ss) Elapsed Time
Refining families...      : 00:16:36 (hh:mm:ss) Elapsed Time
Program Time: 00:33:12 (hh:mm:ss) Elapsed Time
  -- Clustering results with previous rounds...
       - 393 RepeatScout/RECON families
       - 125 LTRPipeline families
       - Removed 18 redundant LTR families.
       - Final family count = 500
LTRPipeline Time: 00:33:51 (hh:mm:ss) Elapsed Time
# TEtools 1.88.5 Failing to run the LTRPipeline
RepeatModeler Version 2.0.5
===========================
Using output directory = /work/RModeler/RM_19.FriOct41654342024
Search Engine = rmblast 2.14.1+
Threads = 70
Dependencies: TRF 4.09, RECON 1.08, RepeatScout 1.0.6, RepeatMasker 4.1.6
LTR Structural Analysis: Enabled ( GenomeTools 1.6.4, LTR_Retriever v2.9.0,
                                   Ninja /work/RModeler/../ftw2.1_genome, MAFFT 7.471,
                                   CD-HIT 4.8.1 )
LTR Structural Analysis
=======================
Running LtrHarvest...     : 00:06:59 (hh:mm:ss) Elapsed Time
Running Ltr_retriever...  : 00:04:59 (hh:mm:ss) Elapsed Time
Aligning instances...     : 00:04:04 (hh:mm:ss) Elapsed Time
Clustering...LTRPipeline: Error - could not cluster MAFFT results.
             : 00:00:01 (hh:mm:ss) Elapsed Time
LTRPipeline : Error - could not open /work/RModeler/RM_19.FriOct41654342024/LTR_463105.FriOct42143312024/clusters.dat! at /opt/RepeatModeler/LTRPipeline line 333.
LTRPipeline Time: 00:16:05 (hh:mm:ss) Elapsed Time
# TEtools 1.89.2 Finishing running the LTRPipeline
RepeatModeler Version 2.0.5
===========================
Using output directory = /work/RModeler/RM_29.FriOct40950072024
Search Engine = rmblast 2.14.1+
Threads = 70
Dependencies: TRF 4.09, RECON 1.08, RepeatScout 1.0.6, RepeatMasker 4.1.7
LTR Structural Analysis: Enabled ( GenomeTools 1.6.4, LTR_Retriever v2.9.0,
                                   Ninja 1.00-cluster_only, MAFFT 7.471,
                                   CD-HIT 4.8.1 )
LTR Structural Analysis
=======================
Running LtrHarvest...     : 00:06:57 (hh:mm:ss) Elapsed Time
Running Ltr_retriever...  : 00:05:01 (hh:mm:ss) Elapsed Time
Aligning instances...     : 00:04:10 (hh:mm:ss) Elapsed Time
Clustering...             : 00:00:03 (hh:mm:ss) Elapsed Time
Refining families...      : 00:15:46 (hh:mm:ss) Elapsed Time
Program Time: 00:31:57 (hh:mm:ss) Elapsed Time
  -- Clustering results with previous rounds...
       - 359 RepeatScout/RECON families
       - 125 LTRPipeline families
       - Removed 24 redundant LTR families.
       - Final family count = 460
LTRPipeline Time: 00:32:21 (hh:mm:ss) Elapsed Time

LTR pipeline failed to run in TEtools 1.88.5, but it worked fine in both 1.88 and 1.89.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants