You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Segmentation fault (core dumped)
2021-02-24 14:39:39.456448: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1614195579.456255335","description":"Error received from peer ipv4:127.0.0.1:15000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
Traceback (most recent call last):
File "/home/xxx/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/xxx/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/xxx/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnavailableError: From /job:worker/replica:0/task:1:
Socket closed
Additional GRPC error information:
{"created":"@1614195579.456825677","description":"Error received from peer ipv4:10.20.41.65:15000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}
[[{{node scoped_allocator_1_2_CollectiveReduce}}]]
Please describe the expected behavior
System information and environment
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 18.04
TensorFlow version: 2.2.0
Python version: 3.6.12
GCC/Compiler version (if compiling from source):
CUDA version: 10.1
NCCL version: 10
cuDNN version: 10.1
GPU model and memory: GTX 1080 Ti, 12G
AutoDist version: github master
To Reproduce
Steps to reproduce the behavior:
Run example/linear_regression.py on a multi-node multi-CPU cluster.
Screenshots
If applicable, add screenshots to help explain your problem.
Code snippet to reproduce the problem
Additional information
Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.
Works on a the GPUs of the same cluster.
The text was updated successfully, but these errors were encountered:
Please describe the bug
example/linear_regression.py
with AllReduce strategy crashes when run on a CPU-only multinode cluster with the resource spec like:Output
Please describe the expected behavior
System information and environment
To Reproduce
Steps to reproduce the behavior:
Run
example/linear_regression.py
on a multi-node multi-CPU cluster.Screenshots
If applicable, add screenshots to help explain your problem.
Code snippet to reproduce the problem
Additional information
Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.
Works on a the GPUs of the same cluster.
The text was updated successfully, but these errors were encountered: