Replies: 5 comments 6 replies
-
Mode 2 failure resilience requires for EnTK to recognize that the runtime system has failed or canceled and restart it. I want to clarify that a timeout of the resource, due to insufficient walltime, is not considered as a runtime system failure. In the case of a pilot, EnTK will use the pilot's final state to understand whether the pilot failed, canceled or timed out. Based on the final state, it will submit another pilot and execute the rest of tasks. Q: What state does pilot transition when it times out? |
Beta Was this translation helpful? Give feedback.
-
Mode 1 includes failures from:
Let us consider 1 as fatal for the whole application. Failure 2 ( Task Manager fails) would require to bring the task manager back up. The task manager is responsible for creating a unit manager and submit tasks to RP. When we create a new unit manager, we lose any information about units that are currently executing. As a result, we may resubmit all tasks that are not in a final state again. I executed two tests with RP that resemble a failing task manager's effect in terms of communication with the runtime system. The test creates a unit manager in a forked process, submits a set of units for execution, and waits for the units to finish. However, before the units finish, the process gets killed, and a new unit manager is created. The first test checked if the new unit manager can discover the units that exist already. However, this was not true, and the new unit manager could not see any of the units that are in the database. The second test submitted additional units for execution and waited for the units to finish. As a consequence, the second unit manager waits for the units that are already in the agent and the units submitted. If the units that are already in the agent are not going to run for a significant amount of time, it is okay to wait for them to finish. However, if the remaining units in the agent are going to execute for a significant amount of time, waiting may cause issues as we have to rerun everything. The test is: import radical.pilot as rp
import multiprocessing as mp
import time
import os
def create_umgr(session, pilot, cus):
umgr = rp.UnitManager(session=session)
print(umgr.uid)
umgr.add_pilots(pilot)
umgr.submit_units(cus)
umgr.wait_units()
if __name__ == "__main__":
session = rp.Session()
pmgr = rp.PilotManager(session=session)
pd_init = {'resource' : 'local.localhost_anaconda',
'runtime' : 30, # pilot runtime (min)
'exit_on_error' : True,
'cores' : 1,
}
pdesc = rp.ComputePilotDescription(pd_init)
# Launch the pilot.
pilot = pmgr.submit_pilots(pdesc)
pilot.wait(['PMGR_ACTIVE'])
cuds = list()
for i in range(0, 12):
# create a new CU description, and fill it.
# Here we don't use dict initialization.
cud = rp.ComputeUnitDescription()
cud.executable = 'stress'
cud.cpu_processes = 1
cud.arguments = ['--cpu', '1', '--timeout','70s']
cuds.append(cud)
tmgr_process = mp.Process(target=create_umgr,name='task-manager',args=(session, pilot, cuds))
tmgr_process.start()
time.sleep(60)
os.kill(tmgr_process.pid, 9)
umgr = rp.UnitManager(session=session)
print(umgr.uid)
umgr.add_pilots(pilot)
umgr.list_units()
cuds = list()
for i in range(0, 5):
# create a new CU description, and fill it.
# Here we don't use dict initialization.
cud = rp.ComputeUnitDescription()
cud.executable = 'stress'
cud.cpu_processes = 1
cud.arguments = ['--cpu', '1', '--timeout','70s']
cud.post_exec = ['echo "I am " $RP_UNIT_ID']
cuds.append(cud)
umgr.submit_units(cuds)
umgr.wait_units()
session.close() Q: Do we bite the bullet, report to the user that there was a failure or we find a way to get the units that are already in the agent, cancel or wait for them and continue? Failure 3, a resource manager failure means that we lost also the session and the pilot manager. In this case, we need to create a new resource manager and request from the task manager to register its unit manager again. |
Beta Was this translation helpful? Give feedback.
-
The second mode of failure is when the runtime system fails. RP has three final states, The agent goes to There are two levels of testing here. First, test the methods that communicate between resource and task manager. Second, an integration test introduces the communication layer and executes the actual transaction between the two components of EnTK. |
Beta Was this translation helpful? Give feedback.
-
While working on recovering from component failure, I came across a case where I had set up EnTK task ids and a result set RP task ids. When tasks were resubmitted to RP an error was issued from RP for duplicate keys. The solution I thought was for the task manager to know the RP task ids of tasks and hold everything in a dictionary. When the workflow processor or the RTS fails this works, but not when the task manager fails. When the task manager restarts again, the specific dictionary is empty. I am thinking of moving it one level up to the application manager. The task manager and application manager are different processes and I am not sure whether the move creates racing conditions or not. I can add a callback, but that may overcomplicate the implementation. Finally, there can be another thread in the appmanager that is responsible to maintain the specific dictionary and receive/send RMQ messages with the necessary information. Any opinions? |
Beta Was this translation helpful? Give feedback.
-
Wenjie asked to add a task timeout value that cancels the task after a specific amount of time. When set to zero the task does not time out. |
Beta Was this translation helpful? Give feedback.
-
During our last devel meeting, we agreed that there are three modes of failure for EnTK:
Currently, EnTK offers some level of failure resilience for mode 1 and mode 3. In case one of EnTK's components fail, EnTK can restart it. When tasks fail, EnTK can submit them again for execution based on a flag. I believe, based on the code here, that EnTK tries to resubmit a failed task until that task succeeds. Mode 2 is not supported at all.
During failure modes 1 and 2, EnTK will resubmit all tasks that are not in a final state for execution.
Beta Was this translation helpful? Give feedback.
All reactions