-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raptor termination failure #3119
Comments
This is intentional really: from the perspective of RP and the pilot, Raptor masters and workers are just tasks, and the pilot should indeed not terminate of those tasks die. It would be up to the application to watch state of submitted raptor entities and take actions (like pilot termination) if a |
@andre-merzky, it seems like something is wrong or I am missing something: this is a small test I did to test the state change of the MPI-Worker, and on purpose, there is no In [17]: worker = raptor.submit_workers(rp.TaskDescription(
...: {'mode': rp.RAPTOR_WORKER,
...: 'raptor_class': 'MPIWorker'}))[0]
task raptor.0000.0001 : TMGR_SCHEDULING_PENDING
task raptor.0000.0001 : TMGR_SCHEDULING
task raptor.0000.0001 : TMGR_STAGING_INPUT_PENDING
task raptor.0000.0001 : TMGR_STAGING_INPUT
task raptor.0000.0001 : AGENT_STAGING_INPUT_PENDING
task raptor.0000.0001 : AGENT_STAGING_INPUT
task raptor.0000.0001 : AGENT_SCHEDULING_PENDING
task raptor.0000.0001 : AGENT_SCHEDULING
task raptor.0000.0001 : AGENT_EXECUTING_PENDING
task raptor.0000.0001 : AGENT_EXECUTING
task raptor.0000.0001 : AGENT_STAGING_OUTPUT_PENDING
task raptor.0000.0001 : AGENT_STAGING_OUTPUT
task raptor.0000.0001 : TMGR_STAGING_OUTPUT_PENDING
task raptor.0000.0001 : TMGR_STAGING_OUTPUT
task raptor.0000.0001 : DONE
In [18]: worker.state
Out[18]: 'DONE'
In [19]: worker.exception
In [20]: worker.exit_code
Out[20]: 0
In [21]: worker.exception_detail In the case above the cat raptor.0000.0001.err
Traceback (most recent call last):
File "/home/aymen/ve/test_rpex_final/bin/radical-pilot-raptor-worker", line 52, in <module>
run(sys.argv[1], sys.argv[2], sys.argv[3])
File "/home/aymen/ve/test_rpex_final/bin/radical-pilot-raptor-worker", line 30, in run
worker = cls(raptor_id)
File "/home/aymen/ve/test_rpex_final/lib/python3.8/site-packages/radical/pilot/raptor/worker_mpi.py", line 592, in __init__
from mpi4py import MPI # noqa
ModuleNotFoundError: No module named 'mpi4py' Instead, am I getting a
|
@andre-merzky While this is a valid approach to terminating on the worker's failure. Unfortunately, the approach you proposed is not sufficient for RPEX at least from my understanding. I have my main |
The point is that the master can react on the worker's demise by terminating itself ( |
This is addressed in #3123
That is only available in Failed state (so should work with the above patch) |
Hotfix release 1.46.2 was pushed to pypi which resolves the invalid state transition - the worker now ends up in |
This makes sense now. Thanks, Andre. |
I can confirm this is working now, and the state is reported correctly thanks @andre-merzky , @mtitov : In [14]: tmgr.submit_raptors(rp.TaskDescription({'mode': rp.RAPTOR_MASTER}))
Out[14]: [<Raptor object, uid raptor.0000>]
task raptor.0000 : TMGR_SCHEDULING_PENDING
task raptor.0000 : TMGR_SCHEDULING
task raptor.0000 : TMGR_STAGING_INPUT_PENDING
task raptor.0000 : TMGR_STAGING_INPUT
task raptor.0000 : AGENT_STAGING_INPUT_PENDING
task raptor.0000 : AGENT_STAGING_INPUT
task raptor.0000 : AGENT_SCHEDULING_PENDING
task raptor.0000 : AGENT_SCHEDULING
task raptor.0000 : AGENT_EXECUTING_PENDING
task raptor.0000 : AGENT_EXECUTING
In [15]: tmgr.submit_workers(rp.TaskDescription({'mode': rp.RAPTOR_WORKER, 'raptor_class': 'MPIWorker', 'raptor_id': 'raptor.0000'}))
Out[15]: [<RaptorWorker object, uid raptor.0000.0000>]
task raptor.0000.0000 : TMGR_SCHEDULING_PENDING
task raptor.0000.0000 : TMGR_SCHEDULING
task raptor.0000.0000 : TMGR_STAGING_INPUT_PENDING
task raptor.0000.0000 : TMGR_STAGING_INPUT
task raptor.0000.0000 : AGENT_STAGING_INPUT_PENDING
task raptor.0000.0000 : AGENT_STAGING_INPUT
task raptor.0000.0000 : AGENT_SCHEDULING_PENDING
task raptor.0000.0000 : AGENT_SCHEDULING
task raptor.0000.0000 : AGENT_EXECUTING_PENDING
task raptor.0000.0000 : AGENT_EXECUTING
task raptor.0000.0000 : AGENT_STAGING_OUTPUT_PENDING
task raptor.0000.0000 : AGENT_STAGING_OUTPUT
task raptor.0000.0000 : TMGR_STAGING_OUTPUT_PENDING
task raptor.0000.0000 : TMGR_STAGING_OUTPUT
task raptor.0000.0000 : FAILED |
This is related to #3116 and Parsl/parsl#3013.
The main issue is if a missing module or an issue happens during the initialization of the master or worker, RAPTOR does not send a termination signal to the pilot, and as a consequence, everything hangs.
The text was updated successfully, but these errors were encountered: