Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test mpi versions #20

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from
Draft

Test mpi versions #20

wants to merge 11 commits into from

Conversation

hndgzkn
Copy link
Collaborator

@hndgzkn hndgzkn commented Mar 2, 2021

Runs the tests:

  • on ubuntu-18.04 and ubuntu 20.04
  • using mpich and openmpi implementations
  • system and conda installation of mpi implementations

Tests with openmpi on ubuntu-18.04 fails due to #12.

Tests with mpich on both ubuntu-18.04 and ubuntu 20.04 fail due to #19.

@codecov
Copy link

codecov bot commented Mar 2, 2021

Codecov Report

Merging #20 (75a5004) into main (0aad2ea) will not change coverage.
The diff coverage is n/a.

❗ Current head 75a5004 differs from pull request most recent head 909cdcf. Consider uploading reports for the commit 909cdcf to get more accurate results
Impacted file tree graph

@@           Coverage Diff           @@
##             main      #20   +/-   ##
=======================================
  Coverage   74.29%   74.29%           
=======================================
  Files          41       41           
  Lines        2587     2587           
=======================================
  Hits         1922     1922           
  Misses        665      665           
Flag Coverage Δ
unittests 74.29% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0aad2ea...909cdcf. Read the comment docs.

@hndgzkn hndgzkn force-pushed the test_mpi_versions branch 2 times, most recently from 2f39e76 to aaa4638 Compare March 11, 2021 15:41
@hndgzkn hndgzkn force-pushed the test_mpi_versions branch from aaa4638 to 75a5004 Compare March 25, 2021 11:04
@tomMoral
Copy link
Owner

tomMoral commented Mar 26, 2021

I am not sure why it is still in fail fast. Did you rebased on master? I saw you did, not sure why the tests are stopped then.

It would be nice to have all these tests, potentially with xfail on the configurations known to cause problems?

@hndgzkn
Copy link
Collaborator Author

hndgzkn commented Mar 26, 2021

I am not sure why it is still in fail fast. Did you rebased on master? I saw you did, not sure why the tests are stopped then.

It would be nice to have all these tests, potentially with xfail on the configurations known to cause problems?

They are stopped because of time out.

The tests with mpich hang at a point due to #19, then it waits until max timeout for github actions. When it is over, they are cancelled.

@hndgzkn
Copy link
Collaborator Author

hndgzkn commented Mar 26, 2021

@tomMoral The main problem for tests with mpich is that we need to run the tests with mpiexec -np 1 pytest .. due to pmodels/mpich#4853 .
But when tests are run with mpiexec (for both openmpi and mpich) there is a problem with stopping spawned processes. I do not know how to release resources started by mpi for the tests. (The problem appears only when running tests) Do you have any idea?

@tomMoral
Copy link
Owner

the problem seems to be on the init of MPI with an issue on argument no?

It seems that the process hangs just before calling dicodile/tests/test_dicodile.py::test_dicodile.
image
I think one of the issue is that from mpi4py import MPI will only return when MPI_Init complete. This call is triggered by the import so it is hard to think of a way to detect the failure if the call itself is not.

One way to detect this would be to wrap the import with a faulthandler.dump_traceback_later(timeout=120) and a faulthalder.cancel_dump_traceback_later() that would exit if it hangs for more than 2m with info that might help with debugging.

WDYT?

@hndgzkn
Copy link
Collaborator Author

hndgzkn commented Mar 29, 2021

the problem seems to be on the init of MPI with an issue on argument no?

It seems that the process hangs just before calling dicodile/tests/test_dicodile.py::test_dicodile.
image
I think one of the issue is that from mpi4py import MPI will only return when MPI_Init complete. This call is triggered by the import so it is hard to think of a way to detect the failure if the call itself is not.

One way to detect this would be to wrap the import with a faulthandler.dump_traceback_later(timeout=120) and a faulthalder.cancel_dump_traceback_later() that would exit if it hangs for more than 2m with info that might help with debugging.

WDYT?

@tomMoral As far as I understand, this is message is due to Singleton feature not being implemented in mpich, see mpich issue on github.

details are explained in #19.

I think with mpich we need to run the tests with:

mpirun -np 1 --host localhost:16 pytest

Note: Actually we can use the same command for both mpich and openmpi. As hostfile format for mpich and openmpi are not the same host localhost:16 would avoid to set a hostfile.

When we use the above command with:

  • openmpi: All tests pass, however it cannot stop the processes spawned by the last test, it hangs.
  • mpich: some test_dicodile pass, but like openmpi it cannot stop spawned processes. test_dicod has another problem.

I think the openmpi version should be able to stop spawned processes properly. That makes me think that the code to stop spawned processes might not be reliable.

@hndgzkn
Copy link
Collaborator Author

hndgzkn commented Mar 30, 2021

@tomMoral I tried using mpich with a very simple MPI program that spawns a number of processes (gets the hostfile from env.) to see if the problem arises from dicodile code.

With openmpi I can run the prog as:

python prog.py

If I do the same with mpich, I get the above error; ie. unrecognized argument pmi_args. I need to run it as:

mpirun -np 1 python prog.py

I think this is really due to Singleton not being implemented in mpich.

I propose to change the testing command to

mpirun -np 1 --host localhost:16 python -m pytest

and fix the hanging problem and other possible problems afterwards.

WDYT?

@hndgzkn hndgzkn force-pushed the test_mpi_versions branch from fac864b to cf55093 Compare April 2, 2021 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants