Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rule pfam_annotation error: customising hmmsearch log directory location #15

Open
mhyleung opened this issue Mar 25, 2022 · 6 comments
Open

Comments

@mhyleung
Copy link

Dear all

We have encountered the following error when running db_update on head node on an AWS EC2 instance with SLURM capability and working nodes also in EC2 instances.

Error in rule pfam_annotation:
    jobid: 4
    output: /shared-efs/marcus2/database/pfam_annotation/pfam_annotations.tsv
    log: logs/pfannot_stdout.log, logs/pfannot_stderr.err (check log file(s) for error message)
    conda-env: /shared-efs/marcus2/agnostos-wf/db_update/.snakemake/conda/ae0ca3f79801625c00fe17fbeced98d9
    shell:
 
        set -x
        set -e

        export OMPI_MCA_btl=^openib
        export OMP_NUM_THREADS=28
        export OMP_PROC_BIND=FALSE

        NPFAM=$(grep -c '^NAME' /shared-efs/marcus2/agnostos-wf/databases/Pfam-A.hmm)
        NSEQS=$(grep -c '^>' /shared-efs/marcus2/database/gene_prediction/orf_seqs.fasta)
        N=$(($NSEQS * $NPFAM))

        # Run hmmsearch (MPI-mode)
        srun --mpi=pmi2 /shared-efs/marcus2/agnostos-wf/bin/hmmsearch --mpi --cut_ga -Z "${N}" --domtblout /shared-efs/marcus2/database/pfam_annotation/hmmsearch_pfam_annot.out -o /shared-efs/marcus2/database/pfam_annotation/hmmsearch_pfam_annot.log /shared-efs/marcus2/agnostos-wf/databases/Pfam-A.hmm /shared-efs/marcus2/database/gene_prediction/orf_seqs.fasta 2>logs/pfannot_stderr.err 1>logs/pfannot_stdout.log

        # Collect the results
        grep -v '^#' /shared-efs/marcus2/database/pfam_annotation/hmmsearch_pfam_annot.out > /shared-efs/marcus2/database/pfam_annotation/pfam_annotations.tsv

 
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    cluster_jobid: Submitted batch job 6

Error executing rule pfam_annotation on cluster (jobid: 4, external: Submitted batch job 6, jobscript: /shared-efs/marcus2/agnostos-wf/db_update/.snakemake/tmp.3xj2r3io/snakejob.pfam_annotation.4.sh). For error details see the cluster log and the log files of the involved rule(s).
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2022-03-25T144023.337224.snakemake.log

We do not see the /shared-efs/marcus2/database/pfam_annotation/ being generated, suggesting that something went wrong with hmmsearch. However, when we ran the hmmsearch command interactively (without the srun command), AND changing the path of the log directory, it seemed to be running without any errors.

This suggests that AGNOSTOS is having trouble accessing the log directory location with the original script. If this is the core of the problem, how do we go about changing the log directory? Is it in any of the config files where we can change the log directory location?

Thank you very much

Marcus

@genomewalker
Copy link
Contributor

Hi @mhyleung
it seems that the problem might be related to snakemake not starting the MPI job properly. What is on your cluster config files for this rule? First I would try if you can run the bash script you submitted through slurm. Some of the variables we are loading are very specific to our system. If it fails check if removing the following lines help:

export OMPI_MCA_btl=^openib
export OMP_NUM_THREADS=28
export OMP_PROC_BIND=FALSE

Antonio

@mhyleung
Copy link
Author

Thanks Antonio. Since my last message, we have changed the directory of the log from line 16 and 17 of the pfam_annotation.smk file (i.e. from log/pfannot_stout.log, log/pfannot_stderr.err to log: /shared-efs/marcus2/agnostos-wf/db_update/logs/pfannot_stdout.log, /shared-efs/marcus2/agnostos-wf/db_update/logs/pfannot_stderr.err

We have also hashed out the three lines as recommended above. Now, we have encountered another error. The pfannot_stderr.err file shows the following:

[que1-dy-workerr58xlarge-2:25635] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[que1-dy-workerr58xlarge-2:25635] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

This message actually repeats itself multiple times in the file, and after the repeats, we have this message in the same file:

--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[que1-dy-workerr58xlarge-1:29346] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[que1-dy-workerr58xlarge-1:29347] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Is this error related to MPI infrastructure? Thanks

Marcus

@genomewalker
Copy link
Contributor

genomewalker commented Mar 25, 2022

Yes, this is an MPI problem. You might want to change the mpi_runner here for something that works on your system. Also by the error it looks like that OpenMPI was built without SLURM support. Maybe here you have some hints https://jiaweizhuang.github.io/blog/aws-hpc-guide/#install-hpc-software-stack-with-spack

@mhyleung
Copy link
Author

Thank you Antonio. We will check this out and keep you updated. You are a legend.

@genomewalker
Copy link
Contributor

@mhyleung do you have any updates on this?

@mhyleung
Copy link
Author

mhyleung commented Apr 8, 2022

Hi Antonio @genomewalker

We had been looking into this, and it seems like the SLURM version we have on AWS does not support "pmi2" on the agnostos config file. Anything after SLURM v16.05 (which we have) should use pmix. We got carried away with other things, so will try setting this up soon and see how it goes. Keep you posted.

Marcus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants