Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. #6

Closed
544211707 opened this issue Apr 14, 2021 · 7 comments

Comments

@544211707
Copy link

Hello, in last year's Ray based LMAPF training version, my own training model can not achieve good path planning in more than 8 agents. In the latest training version, there is an error about ray worker died as follows. What's the problem,please?

(pid=5131) starting episode 5 on metaAgent 5
(pid=5137) running imitation job
(pid=5131) 2021-04-14 10:22:36.118770: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
(pid=5138) 2021-04-14 10:22:36.543190: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
(pid=5135) terminate called after throwing an instance of 'std::bad_alloc'
(pid=5135)   what():  std::bad_alloc
(pid=5135) *** Aborted at 1618366956 (unix time) try "date -d @1618366956" if you are using GNU date ***
(pid=5135) PC: @                0x0 (unknown)
(pid=5135) *** SIGABRT (@0x3e80000140f) received by PID 5135 (TID 0x7f1c9df65700) from PID 5135; stack trace: ***
(pid=5135)     @     0x7f1c9db7b390 (unknown)
(pid=5135)     @     0x7f1c9d7d5438 gsignal
(pid=5135)     @     0x7f1c9d7d703a abort
(pid=5135)     @     0x7f1c9745f84a __gnu_cxx::__verbose_terminate_handler()
(pid=5135)     @     0x7f1c9745df47 __cxxabiv1::__terminate()
(pid=5135)     @     0x7f1c9745df7d std::terminate()
(pid=5135)     @     0x7f1c9745e15a __cxa_throw
(pid=5135)     @     0x7f1c9745e522 operator new()
(pid=5135)     @     0x7f1c974ad68c std::__cxx11::basic_string<>::_M_construct()
(pid=5135)     @     0x7f1b4f654a09 tensorflow::SerializeToStringDeterministic()
(pid=5135)     @     0x7f1b4f2ed5ab tensorflow::(anonymous namespace)::TensorProtoHash()
(pid=5135)     @     0x7f1b4f2ed6d8 tensorflow::(anonymous namespace)::FastTensorProtoHash()
(pid=5135)     @     0x7f1b4f2e9a33 tensorflow::(anonymous namespace)::AttrValueHash()
(pid=5135)     @     0x7f1b4f2e9e27 tensorflow::FastAttrValueHash()
(pid=5135)     @     0x7f1b110208f7 tensorflow::grappler::UniqueNodes::ComputeSignature()
(pid=5135)     @     0x7f1b11023160 tensorflow::grappler::ArithmeticOptimizer::DedupComputations()
(pid=5135)     @     0x7f1b1103e522 tensorflow::grappler::ArithmeticOptimizer::Optimize()
(pid=5135)     @     0x7f1b1100ea30 tensorflow::grappler::MetaOptimizer::RunOptimizer()
(pid=5135)     @     0x7f1b1100f969 tensorflow::grappler::MetaOptimizer::OptimizeGraph()
(pid=5135)     @     0x7f1b11010e5d tensorflow::grappler::MetaOptimizer::Optimize()
(pid=5135)     @     0x7f1b11013b77 tensorflow::grappler::RunMetaOptimizer()
(pid=5135)     @     0x7f1b11005afc tensorflow::GraphExecutionState::OptimizeGraph()
(pid=5135)     @     0x7f1b1100742a tensorflow::GraphExecutionState::BuildGraph()
(pid=5135)     @     0x7f1b0e312549 tensorflow::DirectSession::CreateGraphs()
(pid=5135)     @     0x7f1b0e313ea5 tensorflow::DirectSession::CreateExecutors()
(pid=5135)     @     0x7f1b0e316120 tensorflow::DirectSession::GetOrCreateExecutors()
(pid=5135)     @     0x7f1b0e31788f tensorflow::DirectSession::Run()
(pid=5135)     @     0x7f1b0bb95251 tensorflow::SessionRef::Run()
(pid=5135)     @     0x7f1b0bd8dd41 TF_Run_Helper()
(pid=5135)     @     0x7f1b0bd8e53e TF_SessionRun
(pid=5135)     @     0x7f1b0bb90dc9 tensorflow::TF_SessionRun_wrapper_helper()
(pid=5135)     @     0x7f1b0bb90e62 tensorflow::TF_SessionRun_wrapper()
(pid=5137) cannot allocate memory for thread-local data: ABORT
E0414 10:22:37.032392  5052  5191 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=Runner, class_name=imitationRunner, function_name=job, function_hash=}, task_id=15c675b22d037e3bf66d17ba0100, job_id=0100, num_args=4, num_returns=2, actor_task_spec={actor_id=f66d17ba0100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
2021-04-14 10:22:37,038	WARNING worker.py:1134 -- A worker died or was killed while executing task fffffffffffffffff66d17ba0100.
Traceback (most recent call last):
  File "/XX/PRIMAL2-main-re/driver.py", line 232, in <module>
    main()
  File "/XX/PRIMAL2-main-re/driver.py", line 173, in main
    jobResults, metrics, info = ray.get(done_id)[0]
  File "/XX/anaconda3/envs/p2/lib/python3.6/site-packages/ray/worker.py", line 1540, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
E0414 10:22:37.053786  5052  5191 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=Runner, class_name=imitationRunner, function_name=job, function_hash=}, task_id=9d28cb176c7f7501ef0a6c220100, job_id=0100, num_args=4, num_returns=2, actor_task_spec={actor_id=ef0a6c220100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
2021-04-14 10:22:37,059	WARNING worker.py:1134 -- A worker died or was killed while executing task ffffffffffffffffef0a6c220100.
@544211707
Copy link
Author

I found a possible reason, related to the code max_time += time_limit in Env_Builder,When I delete this line of code according to the latest version, the above error will appear.So does this line of code is needed? And does another change time_limit=time_limit - c_timeis needed?I don't particularly understand~

@fire-keeper
Copy link

@544211707
Hi, I meet the same problem like you. I wonder if you have dealt with it. Is there any idea about solving it?

@544211707
Copy link
Author

@fire-keeper

I found a possible reason, related to the code max_time += time_limit in Env_Builder,When I delete this line of code according to the latest version, the above error will appear.So does this line of code is needed? And does another change time_limit=time_limit - c_timeis needed?I don't particularly understand~

I revise like this,it works but I don't know whether delete it is right.

@fire-keeper
Copy link

@544211707
I think I find the essence of the problem. It is cpp_mstar, the compile python wraper of od_mstar , which comsumes too much memory.
In "max_time += time_limit", max_time has not been defined, so an exception would be raised, which make programe won't call cpp_mstar but od_mstar. Therefore, the great memory consuming problem is solved by this weird code "max_time += time_limit.

@544211707
Copy link
Author

@fire-keeper OK,I got it~

@GilesLuo GilesLuo pinned this issue Sep 25, 2021
@Qiutianyun456
Copy link

Qiutianyun456 commented Sep 28, 2021

@544211707 我想我找到了问题的本质。它是 cpp_mstar,od_mstar 的编译 python 包装器,它消耗了太多内存。 在“max_time += time_limit”中,max_time没有被定义,所以会引发异常,这使得程序不会调用cpp_mstar而是od_mstar。因此,巨大的内存消耗问题被这个奇怪的代码“max_time += time_limit”解决了。
hi,Why don't I find this line“max_time += time_limit” of code in Env_Builder.py?

@greipicon
Copy link

@544211707我想我找到了问题的本质。它是cpp_mstar,od_mstar的编译python包装器,它消耗了太多的内存。在“max_time += time_limit”中,max_time没有被定义,所以会引发异常,这使得程序无法运行会调用cpp_mstar和od_mstar。因此,巨大的内存占用问题被这个奇怪的代码“max_time += time_limit”解决了。嗨,为什么
我在Env_Builder.py中找不到这行“max_time += time_limit”代码?
Hi, I'm getting the same error and also can't find this line of "max_time += time_limit" code in Env_Builder.py, have you solved this problem yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants