Failure resilience #521

iparask · 2020-12-09T16:42:08Z

iparask
Dec 9, 2020

During our last devel meeting, we agreed that there are three modes of failure for EnTK:

One of EnTK's components fail,
The RTS fails, and
Tasks of the workflow fail for some other reason.

Currently, EnTK offers some level of failure resilience for mode 1 and mode 3. In case one of EnTK's components fail, EnTK can restart it. When tasks fail, EnTK can submit them again for execution based on a flag. I believe, based on the code here, that EnTK tries to resubmit a failed task until that task succeeds. Mode 2 is not supported at all.

During failure modes 1 and 2, EnTK will resubmit all tasks that are not in a final state for execution.

iparask · 2021-01-05T15:41:00Z

iparask
Jan 5, 2021
Author

Mode 2 failure resilience requires for EnTK to recognize that the runtime system has failed or canceled and restart it. I want to clarify that a timeout of the resource, due to insufficient walltime, is not considered as a runtime system failure.

In the case of a pilot, EnTK will use the pilot's final state to understand whether the pilot failed, canceled or timed out. Based on the final state, it will submit another pilot and execute the rest of tasks.

Q: What state does pilot transition when it times out?

1 reply

iparask Jan 5, 2021
Author

Do we need to submit a new pilot if it times out?

I believe not. The user should be responsible to request enough time to execute the workflow.

iparask · 2021-01-05T20:30:32Z

iparask
Jan 5, 2021
Author

Mode 1 includes failures from:

The App manager
Task manager
Workflow processor

Let us consider 1 as fatal for the whole application.

Failure 2 ( Task Manager fails) would require to bring the task manager back up. The task manager is responsible for creating a unit manager and submit tasks to RP. When we create a new unit manager, we lose any information about units that are currently executing. As a result, we may resubmit all tasks that are not in a final state again.

I executed two tests with RP that resemble a failing task manager's effect in terms of communication with the runtime system. The test creates a unit manager in a forked process, submits a set of units for execution, and waits for the units to finish. However, before the units finish, the process gets killed, and a new unit manager is created. The first test checked if the new unit manager can discover the units that exist already. However, this was not true, and the new unit manager could not see any of the units that are in the database. The second test submitted additional units for execution and waited for the units to finish. As a consequence, the second unit manager waits for the units that are already in the agent and the units submitted.

If the units that are already in the agent are not going to run for a significant amount of time, it is okay to wait for them to finish. However, if the remaining units in the agent are going to execute for a significant amount of time, waiting may cause issues as we have to rerun everything.

The test is:

import radical.pilot as rp
import multiprocessing as mp
import time
import os

def create_umgr(session, pilot, cus):
    umgr = rp.UnitManager(session=session)
    print(umgr.uid)
    umgr.add_pilots(pilot)
    umgr.submit_units(cus)
    umgr.wait_units()


if __name__ == "__main__":
    session = rp.Session()
    pmgr   = rp.PilotManager(session=session)
    pd_init = {'resource'      : 'local.localhost_anaconda',
               'runtime'       : 30,  # pilot runtime (min)
               'exit_on_error' : True,
               'cores'         : 1,
                  }
    pdesc = rp.ComputePilotDescription(pd_init)
    # Launch the pilot.
    pilot = pmgr.submit_pilots(pdesc)
    pilot.wait(['PMGR_ACTIVE'])

    cuds = list()
    for i in range(0, 12):

        # create a new CU description, and fill it.
        # Here we don't use dict initialization.
        cud = rp.ComputeUnitDescription()
        cud.executable    = 'stress'
        cud.cpu_processes = 1
        cud.arguments = ['--cpu', '1', '--timeout','70s']
        cuds.append(cud)

    tmgr_process = mp.Process(target=create_umgr,name='task-manager',args=(session, pilot, cuds))
    tmgr_process.start()
    time.sleep(60)
    os.kill(tmgr_process.pid, 9)
    umgr = rp.UnitManager(session=session)
    print(umgr.uid)
    umgr.add_pilots(pilot)
    umgr.list_units()
    cuds = list()
    for i in range(0, 5):

        # create a new CU description, and fill it.
        # Here we don't use dict initialization.
        cud = rp.ComputeUnitDescription()
        cud.executable    = 'stress'
        cud.cpu_processes = 1
        cud.arguments = ['--cpu', '1', '--timeout','70s']
        cud.post_exec = ['echo "I am " $RP_UNIT_ID']
        cuds.append(cud)
    umgr.submit_units(cuds)
    umgr.wait_units()
    session.close()

Q: Do we bite the bullet, report to the user that there was a failure or we find a way to get the units that are already in the agent, cancel or wait for them and continue?

Failure 3, a resource manager failure means that we lost also the session and the pilot manager. In this case, we need to create a new resource manager and request from the task manager to register its unit manager again.

1 reply

iparask Jan 14, 2021
Author

The test creates a unit manager in a forked process, submits a set of units for execution, and waits for the units to finish.

This statement is not entirely correct. The AppManager creates a task manager object, which then launches a task-manager process. This process communicates with the AppManager process through a heartbeat. The task-manager process then launches a thread that instantiates the unit manager.

The following figure may help:

As we can see, the heartbeat response comes from the task manager process and not the thread that does the actual work. The AppMannager and the task manager process cannot understand if the thread is dead and, as a result, restart the task manager. The AppManager understands that the task manager is down and restarts it (here).

Thus, I consider the task manager failure handled. The user can specify how many times they want to try to restart it.

iparask · 2021-01-15T22:49:56Z

iparask
Jan 15, 2021
Author

The second mode of failure is when the runtime system fails. RP has three final states, DONE, CANCELED, and FAILED. A pilot goes to the canceled state when the batch system cancels the job the pilot is running. Job cancelation can happen because of a timeout, the user or someone else cancels the batch job of the agent, the user cancels the whole application. In this case, I propose to end the application with a canceled message.

The agent goes to FAILED when one of its components actually fails. In this case, I propose the following. The resource manager notifies the task manager that the pilot has failed and submits a new one. It also notifies the task manager for the new agent. The task manager removes from its list of pilots the failed one, adds the latest pilot, and resubmits all the tasks that were not in a final state.

There are two levels of testing here. First, test the methods that communicate between resource and task manager. Second, an integration test introduces the communication layer and executes the actual transaction between the two components of EnTK.

2 replies

mturilli Jan 19, 2021
Maintainer

The agent goes to FAILED when one of its components actually fails. In this case, I propose the following. The resource manager notifies the task manager that the pilot has failed and submits a new one. It also notifies the task manager for the new agent. The task manager removes from its list of pilots the failed one, adds the latest pilot, and resubmits all the tasks that were not in a final state.

I would make this configurable, including the number of resubmission wanted. By default, I would terminate with in a FAILED state, possibly a RCT_FAILED. Would you agree?

iparask Feb 23, 2021
Author

The number of attempts is configurable with three being the default currently. I am not sure it makes sense to add a new state. Recovering from any failure increases the number of attempts counter. It may very well be that the last failure was from the runtime system, but the previous failures were not.

Unless you are suggesting to separate the counters based on the type of failure. As a result, we will have two counters. One that will count EnTK components failures and one that will count the failures of the runtime system.

iparask · 2021-02-17T19:45:04Z

iparask
Feb 17, 2021
Author

While working on recovering from component failure, I came across a case where I had set up EnTK task ids and a result set RP task ids. When tasks were resubmitted to RP an error was issued from RP for duplicate keys. The solution I thought was for the task manager to know the RP task ids of tasks and hold everything in a dictionary. When the workflow processor or the RTS fails this works, but not when the task manager fails. When the task manager restarts again, the specific dictionary is empty.

I am thinking of moving it one level up to the application manager. The task manager and application manager are different processes and I am not sure whether the move creates racing conditions or not. I can add a callback, but that may overcomplicate the implementation. Finally, there can be another thread in the appmanager that is responsible to maintain the specific dictionary and receive/send RMQ messages with the necessary information.

Any opinions?

2 replies

andre-merzky Feb 18, 2021
Maintainer

Another option would be to not keep the UIDs, but to blindly apply an suffix - then you would only need to keep the suffix updated for the restarted task manager: for the first one, .1 when restarted once, .2 when restarted twice, etc.

iparask Feb 18, 2021
Author

That is true, and ru.generate_ids does not care how many times it is called. It will work I think with the static sandbox for all tasks that have the same prefix. Thank you! I did not think of that!

iparask · 2021-03-19T17:16:48Z

iparask
Mar 19, 2021
Author

Wenjie asked to add a task timeout value that cancels the task after a specific amount of time. When set to zero the task does not time out.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure resilience #521

{{title}}

Replies: 5 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Failure resilience #521

iparask Dec 9, 2020

Replies: 5 comments · 6 replies

iparask Jan 5, 2021 Author

iparask Jan 5, 2021 Author

iparask Jan 5, 2021 Author

iparask Jan 14, 2021 Author

iparask Jan 15, 2021 Author

mturilli Jan 19, 2021 Maintainer

iparask Feb 23, 2021 Author

iparask Feb 17, 2021 Author

andre-merzky Feb 18, 2021 Maintainer

iparask Feb 18, 2021 Author

iparask Mar 19, 2021 Author

iparask
Dec 9, 2020

Replies: 5 comments 6 replies

iparask
Jan 5, 2021
Author

iparask Jan 5, 2021
Author

iparask
Jan 5, 2021
Author

iparask Jan 14, 2021
Author

iparask
Jan 15, 2021
Author

mturilli Jan 19, 2021
Maintainer

iparask Feb 23, 2021
Author

iparask
Feb 17, 2021
Author

andre-merzky Feb 18, 2021
Maintainer

iparask Feb 18, 2021
Author

iparask
Mar 19, 2021
Author