Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AllReduce strategy crashes on multinode CPU-only cluster #60

Open
odp opened this issue Feb 24, 2021 · 0 comments
Open

AllReduce strategy crashes on multinode CPU-only cluster #60

odp opened this issue Feb 24, 2021 · 0 comments

Comments

@odp
Copy link

odp commented Feb 24, 2021

Please describe the bug

example/linear_regression.py with AllReduce strategy crashes when run on a CPU-only multinode cluster with the resource spec like:

nodes:
  - address: X.X.X.X
    cpus: [0]
    chief: true
  - address: X.X.X.X
    cpus: [0]
    ssh_config: conf
ssh:
  conf:
    username: XXX
    key_file: YYY.pem
    shared_envs:
      LD_LIBRARY_PATH: '$LD_LIBRARY_PATH:/usr/local/cuda/lib64'

Output

Segmentation fault (core dumped)
2021-02-24 14:39:39.456448: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1614195579.456255335","description":"Error received from peer ipv4:127.0.0.1:15000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
Traceback (most recent call last):
  File "/home/xxx/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/xxx/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/xxx/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnavailableError: From /job:worker/replica:0/task:1:
Socket closed
Additional GRPC error information:
{"created":"@1614195579.456825677","description":"Error received from peer ipv4:10.20.41.65:15000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}
	 [[{{node scoped_allocator_1_2_CollectiveReduce}}]]

Please describe the expected behavior

System information and environment

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 18.04
  • TensorFlow version: 2.2.0
  • Python version: 3.6.12
  • GCC/Compiler version (if compiling from source):
  • CUDA version: 10.1
  • NCCL version: 10
  • cuDNN version: 10.1
  • GPU model and memory: GTX 1080 Ti, 12G
  • AutoDist version: github master

To Reproduce
Steps to reproduce the behavior:
Run example/linear_regression.py on a multi-node multi-CPU cluster.

Screenshots
If applicable, add screenshots to help explain your problem.

Code snippet to reproduce the problem

Additional information
Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.

Works on a the GPUs of the same cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant