Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raptor termination failure #3119

Closed
AymenFJA opened this issue Jan 18, 2024 · 9 comments
Closed

Raptor termination failure #3119

AymenFJA opened this issue Jan 18, 2024 · 9 comments

Comments

@AymenFJA
Copy link
Contributor

This is related to #3116 and Parsl/parsl#3013.

The main issue is if a missing module or an issue happens during the initialization of the master or worker, RAPTOR does not send a termination signal to the pilot, and as a consequence, everything hangs.

@andre-merzky
Copy link
Member

The main issue is if a missing module or an issue happens during the initialization of the master or worker, RAPTOR does not send a termination signal to the pilot, and as a consequence, everything hangs.

This is intentional really: from the perspective of RP and the pilot, Raptor masters and workers are just tasks, and the pilot should indeed not terminate of those tasks die. It would be up to the application to watch state of submitted raptor entities and take actions (like pilot termination) if a FAILED state is detected.

@andre-merzky
Copy link
Member

@AymenFJA : please have a look at #3121. It changes the raptor example code to demonstrate how worker failures can be caught by the master which then terminated, and how the client side reacts on master termination. Is that approach usable for RPEX?

@AymenFJA
Copy link
Contributor Author

AymenFJA commented Feb 1, 2024

@andre-merzky, it seems like something is wrong or I am missing something:

this is a small test I did to test the state change of the MPI-Worker, and on purpose, there is no mpi4py in the RAPTOR env:

In [17]: worker = raptor.submit_workers(rp.TaskDescription(
    ...:             {'mode': rp.RAPTOR_WORKER,
    ...:              'raptor_class': 'MPIWorker'}))[0]

  task raptor.0000.0001              : TMGR_SCHEDULING_PENDING
  task raptor.0000.0001              : TMGR_SCHEDULING
  task raptor.0000.0001              : TMGR_STAGING_INPUT_PENDING
  task raptor.0000.0001              : TMGR_STAGING_INPUT
  task raptor.0000.0001              : AGENT_STAGING_INPUT_PENDING
  task raptor.0000.0001              : AGENT_STAGING_INPUT
  task raptor.0000.0001              : AGENT_SCHEDULING_PENDING
  task raptor.0000.0001              : AGENT_SCHEDULING
  task raptor.0000.0001              : AGENT_EXECUTING_PENDING
  task raptor.0000.0001              : AGENT_EXECUTING
  task raptor.0000.0001              : AGENT_STAGING_OUTPUT_PENDING
  task raptor.0000.0001              : AGENT_STAGING_OUTPUT
  task raptor.0000.0001              : TMGR_STAGING_OUTPUT_PENDING
  task raptor.0000.0001              : TMGR_STAGING_OUTPUT
  task raptor.0000.0001              : DONE
In [18]: worker.state
Out[18]: 'DONE'

In [19]: worker.exception

In [20]: worker.exit_code
Out[20]: 0

In [21]: worker.exception_detail

In the case above the worker should and must fail because there is no mpi4py and the worker.err did raise an exception:

cat raptor.0000.0001.err
Traceback (most recent call last):
  File "/home/aymen/ve/test_rpex_final/bin/radical-pilot-raptor-worker", line 52, in <module>
    run(sys.argv[1], sys.argv[2], sys.argv[3])
  File "/home/aymen/ve/test_rpex_final/bin/radical-pilot-raptor-worker", line 30, in run
    worker = cls(raptor_id)
  File "/home/aymen/ve/test_rpex_final/lib/python3.8/site-packages/radical/pilot/raptor/worker_mpi.py", line 592, in __init__
    from mpi4py import MPI                                            # noqa
ModuleNotFoundError: No module named 'mpi4py'

Instead, am I getting a DONE state? Any ideas? I am happy to open a corresponding ticket regarding why RAPTOR worker (which is a task) has:

  1. Wrong state
  2. No Exception or Exception details.

@AymenFJA
Copy link
Contributor Author

AymenFJA commented Feb 1, 2024

@AymenFJA : please have a look at #3121. It changes the raptor example code to demonstrate how worker failures can be caught by the master which then terminated, and how the client side reacts on master termination. Is that approach usable for RPEX?

@andre-merzky While this is a valid approach to terminating on the worker's failure. Unfortunately, the approach you proposed is not sufficient for RPEX at least from my understanding. I have my main state_cb which checks for failure and so on in the main Parsl-Executor which should trigger the shutdown from Parsl. I think doing that on the master level which is in a separate file and namespace gives me no control to tell Parsl that we failed.

@andre-merzky
Copy link
Member

The point is that the master can react on the worker's demise by terminating itself (self.stop()) which then triggers the respective callback on the client side, i.e., in your state_cb in the Parse executor.

@andre-merzky
Copy link
Member

  • Wrong state

This is addressed in #3123

  • No Exception or Exception details.

That is only available in Failed state (so should work with the above patch)

@andre-merzky
Copy link
Member

Hotfix release 1.46.2 was pushed to pypi which resolves the invalid state transition - the worker now ends up in FAILED on missing module dependencies.

@AymenFJA
Copy link
Contributor Author

AymenFJA commented Feb 2, 2024

The point is that the master can react on the worker's demise by terminating itself (self.stop()) which then triggers the respective callback on the client side, i.e., in your state_cb in the Parse executor.

This makes sense now. Thanks, Andre.

@AymenFJA
Copy link
Contributor Author

AymenFJA commented Feb 2, 2024

I can confirm this is working now, and the state is reported correctly thanks @andre-merzky , @mtitov :

In [14]: tmgr.submit_raptors(rp.TaskDescription({'mode': rp.RAPTOR_MASTER}))
Out[14]: [<Raptor object, uid raptor.0000>]

  task raptor.0000                   : TMGR_SCHEDULING_PENDING
  task raptor.0000                   : TMGR_SCHEDULING
  task raptor.0000                   : TMGR_STAGING_INPUT_PENDING
  task raptor.0000                   : TMGR_STAGING_INPUT
  task raptor.0000                   : AGENT_STAGING_INPUT_PENDING
  task raptor.0000                   : AGENT_STAGING_INPUT
  task raptor.0000                   : AGENT_SCHEDULING_PENDING
  task raptor.0000                   : AGENT_SCHEDULING
  task raptor.0000                   : AGENT_EXECUTING_PENDING
  task raptor.0000                   : AGENT_EXECUTING
In [15]: tmgr.submit_workers(rp.TaskDescription({'mode': rp.RAPTOR_WORKER, 'raptor_class': 'MPIWorker', 'raptor_id': 'raptor.0000'}))
Out[15]: [<RaptorWorker object, uid raptor.0000.0000>]
  task raptor.0000.0000              : TMGR_SCHEDULING_PENDING

  task raptor.0000.0000              : TMGR_SCHEDULING
  task raptor.0000.0000              : TMGR_STAGING_INPUT_PENDING
  task raptor.0000.0000              : TMGR_STAGING_INPUT
  task raptor.0000.0000              : AGENT_STAGING_INPUT_PENDING
  task raptor.0000.0000              : AGENT_STAGING_INPUT
  task raptor.0000.0000              : AGENT_SCHEDULING_PENDING
  task raptor.0000.0000              : AGENT_SCHEDULING
  task raptor.0000.0000              : AGENT_EXECUTING_PENDING
  task raptor.0000.0000              : AGENT_EXECUTING
  task raptor.0000.0000              : AGENT_STAGING_OUTPUT_PENDING
  task raptor.0000.0000              : AGENT_STAGING_OUTPUT
  task raptor.0000.0000              : TMGR_STAGING_OUTPUT_PENDING
  task raptor.0000.0000              : TMGR_STAGING_OUTPUT
  task raptor.0000.0000              : FAILED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants