Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem spawning processes on ubuntu-18.04 with openmpi 2.1.1 #12

Open
hndgzkn opened this issue Feb 18, 2021 · 0 comments
Open

Problem spawning processes on ubuntu-18.04 with openmpi 2.1.1 #12

hndgzkn opened this issue Feb 18, 2021 · 0 comments

Comments

@hndgzkn
Copy link
Collaborator

hndgzkn commented Feb 18, 2021

Unit tests fail on ubuntu 18.04 with openmpi 2.1.1 after renaming dicodile.py to _dicodile.py and exposing dicodile function in
__init__.py as

from ._dicodile import dicodile

__all__ = ['dicodile']

While running the test:

dicodile/update_z/tests/test_dicod.py::test_stopping_criterion[6-signal_support0-atom_support0]

It returns

0 Exception
[hande-VirtualBox:04908] [[59073,0],0] ORTE_ERROR_LOG: Not found in file orted/pmix/pmix_server_dyn.c at line 87
1 Exception
[hande-VirtualBox:04908] [[59073,0],0] ORTE_ERROR_LOG: Not found in file orted/pmix/pmix_server_dyn.c at line 87
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 6 slots
that were requested by the application:
  /home/hande/dev/dicodile/env/bin/python

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
2 Exception
[hande-VirtualBox:04908] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
[hande-VirtualBox:04908] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
3 Exception
[hande-VirtualBox:04908] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
4 Exception
[hande-VirtualBox:04908] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
5 Exception
[hande-VirtualBox:04932] OPAL ERROR: Timeout in file base/pmix_base_fns.c at line 195
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Timeout" (-15) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

Each exception occurs at line

print(i, "Exception")

while trying to spawn workers at line

comm = MPI.COMM_SELF.Spawn(

The code spawns specified number of processes (6 in this case). The processes start
executing the specified main_worker.py script. However it stops at

from dicodile.utils import constants

where it tries to import from dicodile package.

I've tried adding lines before the import line, all run until the import line. But then it fails silently.

For nb_workers = [1, 2], the code runs without problems.

For nb_workers = 6, it raises exception in spawning processes.

I thought, the code was not able to access hostfile_test, however I realized that the loop starting at line

for i in range(10):

continues running and spawning the specified number of processes in each iteration. It complains about insufficient number of slots when the number of slots in hostfile_test would exceed at that iteration.

For example for the above example, hostfile_test specifies 16 slots. For 1st iteration, it spawns 6 processes, then raises
exception. However the processes continue to run. For second iteration it starts 6 more processes, 12 in total. For 3rd iteration, as it has 3 slots left, it complains that there are not enough slots.

I tried the same with 20 slots and it complained in 4th iteration after initializing 18 processes in the first 3.

Similar problem while running plot_mandrill.py example with 16 slots in hostfile with the command:
mpirun -np 1 --hostfile hostfile python -m mpi4py examples/plot_mandrill.py

Replace is False and data exists, so doing nothing. Use replace=True to re-download the data.
[DEBUG:DICODILE] Lambda_max = 11.274413430904202
0 Exception
[hande-VirtualBox:05655] [[58362,0],0] ORTE_ERROR_LOG: Not found in file orted/pmix/pmix_server_dyn.c at line 87
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 9 slots
that were requested by the application:
  /home/hande/dev/dicodile/env/bin/python

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
1 Exception
[hande-VirtualBox:05655] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
[hande-VirtualBox:05655] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2 Exception
[hande-VirtualBox:05655] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
3 Exception
[hande-VirtualBox:05655] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
4 Exception
[hande-VirtualBox:05655] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
5 Exception
[hande-VirtualBox:05655] 1 more process has sent help message help-orte-rmaps-base.txt / orte-rmaps-base:alloc-error
6 Exception
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Timeout" (-15) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[hande-VirtualBox:05664] OPAL ERROR: Timeout in file base/pmix_base_fns.c at line 195
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant