Raptor termination failure #3119

AymenFJA · 2024-01-18T20:39:46Z

This is related to #3116 and Parsl/parsl#3013.

The main issue is if a missing module or an issue happens during the initialization of the master or worker, RAPTOR does not send a termination signal to the pilot, and as a consequence, everything hangs.

andre-merzky · 2024-01-20T13:24:51Z

The main issue is if a missing module or an issue happens during the initialization of the master or worker, RAPTOR does not send a termination signal to the pilot, and as a consequence, everything hangs.

This is intentional really: from the perspective of RP and the pilot, Raptor masters and workers are just tasks, and the pilot should indeed not terminate of those tasks die. It would be up to the application to watch state of submitted raptor entities and take actions (like pilot termination) if a FAILED state is detected.

andre-merzky · 2024-01-26T11:18:04Z

@AymenFJA : please have a look at #3121. It changes the raptor example code to demonstrate how worker failures can be caught by the master which then terminated, and how the client side reacts on master termination. Is that approach usable for RPEX?

AymenFJA · 2024-02-01T15:40:01Z

@andre-merzky, it seems like something is wrong or I am missing something:

this is a small test I did to test the state change of the MPI-Worker, and on purpose, there is no mpi4py in the RAPTOR env:

In [17]: worker = raptor.submit_workers(rp.TaskDescription(
    ...:             {'mode': rp.RAPTOR_WORKER,
    ...:              'raptor_class': 'MPIWorker'}))[0]

  task raptor.0000.0001              : TMGR_SCHEDULING_PENDING
  task raptor.0000.0001              : TMGR_SCHEDULING
  task raptor.0000.0001              : TMGR_STAGING_INPUT_PENDING
  task raptor.0000.0001              : TMGR_STAGING_INPUT
  task raptor.0000.0001              : AGENT_STAGING_INPUT_PENDING
  task raptor.0000.0001              : AGENT_STAGING_INPUT
  task raptor.0000.0001              : AGENT_SCHEDULING_PENDING
  task raptor.0000.0001              : AGENT_SCHEDULING
  task raptor.0000.0001              : AGENT_EXECUTING_PENDING
  task raptor.0000.0001              : AGENT_EXECUTING
  task raptor.0000.0001              : AGENT_STAGING_OUTPUT_PENDING
  task raptor.0000.0001              : AGENT_STAGING_OUTPUT
  task raptor.0000.0001              : TMGR_STAGING_OUTPUT_PENDING
  task raptor.0000.0001              : TMGR_STAGING_OUTPUT
  task raptor.0000.0001              : DONE
In [18]: worker.state
Out[18]: 'DONE'

In [19]: worker.exception

In [20]: worker.exit_code
Out[20]: 0

In [21]: worker.exception_detail

In the case above the worker should and must fail because there is no mpi4py and the worker.err did raise an exception:

cat raptor.0000.0001.err
Traceback (most recent call last):
  File "/home/aymen/ve/test_rpex_final/bin/radical-pilot-raptor-worker", line 52, in <module>
    run(sys.argv[1], sys.argv[2], sys.argv[3])
  File "/home/aymen/ve/test_rpex_final/bin/radical-pilot-raptor-worker", line 30, in run
    worker = cls(raptor_id)
  File "/home/aymen/ve/test_rpex_final/lib/python3.8/site-packages/radical/pilot/raptor/worker_mpi.py", line 592, in __init__
    from mpi4py import MPI                                            # noqa
ModuleNotFoundError: No module named 'mpi4py'

Instead, am I getting a DONE state? Any ideas? I am happy to open a corresponding ticket regarding why RAPTOR worker (which is a task) has:

Wrong state
No Exception or Exception details.

AymenFJA · 2024-02-01T16:05:03Z

@AymenFJA : please have a look at #3121. It changes the raptor example code to demonstrate how worker failures can be caught by the master which then terminated, and how the client side reacts on master termination. Is that approach usable for RPEX?

@andre-merzky While this is a valid approach to terminating on the worker's failure. Unfortunately, the approach you proposed is not sufficient for RPEX at least from my understanding. I have my main state_cb which checks for failure and so on in the main Parsl-Executor which should trigger the shutdown from Parsl. I think doing that on the master level which is in a separate file and namespace gives me no control to tell Parsl that we failed.

andre-merzky · 2024-02-02T11:51:14Z

The point is that the master can react on the worker's demise by terminating itself (self.stop()) which then triggers the respective callback on the client side, i.e., in your state_cb in the Parse executor.

andre-merzky · 2024-02-02T20:28:13Z

Wrong state

This is addressed in #3123

No Exception or Exception details.

That is only available in Failed state (so should work with the above patch)

andre-merzky · 2024-02-02T22:33:27Z

Wrong state
This is addressed in recover correct exit code of exec script in launcher #3123

Hotfix release 1.46.2 was pushed to pypi which resolves the invalid state transition - the worker now ends up in FAILED on missing module dependencies.

AymenFJA · 2024-02-02T23:48:05Z

The point is that the master can react on the worker's demise by terminating itself (self.stop()) which then triggers the respective callback on the client side, i.e., in your state_cb in the Parse executor.

This makes sense now. Thanks, Andre.

AymenFJA · 2024-02-02T23:57:22Z

I can confirm this is working now, and the state is reported correctly thanks @andre-merzky , @mtitov :

In [14]: tmgr.submit_raptors(rp.TaskDescription({'mode': rp.RAPTOR_MASTER}))
Out[14]: [<Raptor object, uid raptor.0000>]

  task raptor.0000                   : TMGR_SCHEDULING_PENDING
  task raptor.0000                   : TMGR_SCHEDULING
  task raptor.0000                   : TMGR_STAGING_INPUT_PENDING
  task raptor.0000                   : TMGR_STAGING_INPUT
  task raptor.0000                   : AGENT_STAGING_INPUT_PENDING
  task raptor.0000                   : AGENT_STAGING_INPUT
  task raptor.0000                   : AGENT_SCHEDULING_PENDING
  task raptor.0000                   : AGENT_SCHEDULING
  task raptor.0000                   : AGENT_EXECUTING_PENDING
  task raptor.0000                   : AGENT_EXECUTING
In [15]: tmgr.submit_workers(rp.TaskDescription({'mode': rp.RAPTOR_WORKER, 'raptor_class': 'MPIWorker', 'raptor_id': 'raptor.0000'}))
Out[15]: [<RaptorWorker object, uid raptor.0000.0000>]
  task raptor.0000.0000              : TMGR_SCHEDULING_PENDING

  task raptor.0000.0000              : TMGR_SCHEDULING
  task raptor.0000.0000              : TMGR_STAGING_INPUT_PENDING
  task raptor.0000.0000              : TMGR_STAGING_INPUT
  task raptor.0000.0000              : AGENT_STAGING_INPUT_PENDING
  task raptor.0000.0000              : AGENT_STAGING_INPUT
  task raptor.0000.0000              : AGENT_SCHEDULING_PENDING
  task raptor.0000.0000              : AGENT_SCHEDULING
  task raptor.0000.0000              : AGENT_EXECUTING_PENDING
  task raptor.0000.0000              : AGENT_EXECUTING
  task raptor.0000.0000              : AGENT_STAGING_OUTPUT_PENDING
  task raptor.0000.0000              : AGENT_STAGING_OUTPUT
  task raptor.0000.0000              : TMGR_STAGING_OUTPUT_PENDING
  task raptor.0000.0000              : TMGR_STAGING_OUTPUT
  task raptor.0000.0000              : FAILED

AymenFJA added type:bug priority:critical layer:rp comp:raptor labels Jan 18, 2024

AymenFJA assigned andre-merzky Jan 18, 2024

This was referenced Jan 18, 2024

test_radical/test_mpi_funcs.py needs unknown test setup Parsl/parsl#3013

Open

Parsl-RP (RPEX): MPI RP local test hangs on Parsl CI. #3116

Closed

AymenFJA closed this as completed Feb 2, 2024

AymenFJA added the external label Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raptor termination failure #3119

Raptor termination failure #3119

AymenFJA commented Jan 18, 2024

andre-merzky commented Jan 20, 2024

andre-merzky commented Jan 26, 2024

AymenFJA commented Feb 1, 2024 •

edited

Loading

AymenFJA commented Feb 1, 2024

andre-merzky commented Feb 2, 2024

andre-merzky commented Feb 2, 2024

andre-merzky commented Feb 2, 2024

AymenFJA commented Feb 2, 2024

AymenFJA commented Feb 2, 2024

Raptor termination failure #3119

Raptor termination failure #3119

Comments

AymenFJA commented Jan 18, 2024

andre-merzky commented Jan 20, 2024

andre-merzky commented Jan 26, 2024

AymenFJA commented Feb 1, 2024 • edited Loading

AymenFJA commented Feb 1, 2024

andre-merzky commented Feb 2, 2024

andre-merzky commented Feb 2, 2024

andre-merzky commented Feb 2, 2024

AymenFJA commented Feb 2, 2024

AymenFJA commented Feb 2, 2024

AymenFJA commented Feb 1, 2024 •

edited

Loading