`AllReduce` strategy crashes on multinode CPU-only cluster #60

odp · 2021-02-24T19:53:11Z

Please describe the bug

example/linear_regression.py with AllReduce strategy crashes when run on a CPU-only multinode cluster with the resource spec like:

nodes:
  - address: X.X.X.X
    cpus: [0]
    chief: true
  - address: X.X.X.X
    cpus: [0]
    ssh_config: conf
ssh:
  conf:
    username: XXX
    key_file: YYY.pem
    shared_envs:
      LD_LIBRARY_PATH: '$LD_LIBRARY_PATH:/usr/local/cuda/lib64'

Output

Segmentation fault (core dumped)
2021-02-24 14:39:39.456448: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1614195579.456255335","description":"Error received from peer ipv4:127.0.0.1:15000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
Traceback (most recent call last):
  File "/home/xxx/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/xxx/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/xxx/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnavailableError: From /job:worker/replica:0/task:1:
Socket closed
Additional GRPC error information:
{"created":"@1614195579.456825677","description":"Error received from peer ipv4:10.20.41.65:15000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}
	 [[{{node scoped_allocator_1_2_CollectiveReduce}}]]

Please describe the expected behavior

System information and environment

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 18.04
TensorFlow version: 2.2.0
Python version: 3.6.12
GCC/Compiler version (if compiling from source):
CUDA version: 10.1
NCCL version: 10
cuDNN version: 10.1
GPU model and memory: GTX 1080 Ti, 12G
AutoDist version: github master

To Reproduce
Steps to reproduce the behavior:
Run example/linear_regression.py on a multi-node multi-CPU cluster.

Screenshots
If applicable, add screenshots to help explain your problem.

Code snippet to reproduce the problem

Additional information
Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.

Works on a the GPUs of the same cluster.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`AllReduce` strategy crashes on multinode CPU-only cluster #60

`AllReduce` strategy crashes on multinode CPU-only cluster #60

odp commented Feb 24, 2021

AllReduce strategy crashes on multinode CPU-only cluster #60

AllReduce strategy crashes on multinode CPU-only cluster #60

Comments

odp commented Feb 24, 2021

`AllReduce` strategy crashes on multinode CPU-only cluster #60

`AllReduce` strategy crashes on multinode CPU-only cluster #60