Autodist on Ray using RaySGD API #61

odp · 2021-03-12T16:37:54Z

This PR adds RaySGD API to Autodist which enables it to train models on a Ray cluster. The API defines a TFTrainer class which takes a model creator, data creator, train step and a strategy builder and runs the training job on a distributed Ray cluster. The API follows the RaySGD API and is compatible with Ray Tune.

trainer = TFTrainer(strategy_builder, model_creator, data_creator, train_step)
trainer.step()

Internally it implements a TFRunner class which represents a replica. All communication between master and worker replicas happens through in-memory object store so there is no dependance on remote file system locations/accesses rights. Also ssh is not needed.

Moreover the client code executed by each worker is also replicated using Ray eliminating the need of copying the model code to remote filesystems on each node. The users can run the example by installing Ray and running $ python linear_regression_ray.py.

Reference: https://docs.ray.io/en/master/raysgd/raysgd_tensorflow.html

Fixes #57

zhisbug

other

zhisbug · 2021-05-12T06:14:57Z

tests/integration/cases/c10.py

@@ -77,15 +77,15 @@ def l(predicted_y, desired_y):
        # Only save the model on master node if autodist is used with NFS.
        checkpoint_suffix = 'c10'
        checkpoint_name = checkpoint_dir + checkpoint_suffix
-        if IS_AUTODIST_CHIEF:
+        if IS_AUTODIST_CHIEF():


could you add a test case (e.g. case c11) that uses the above linear regression code plus ray backend so the CI can test against it every time when there is new case? You might want to add it to both single-node multi GPU test or distributed tests.

zhisbug · 2021-05-12T06:23:19Z

autodist/ray/backend.py

+
+        def spawn_replica(replica_host, strategy_builder, strategy=None, env=None):
+            # Enforce actor placement on the provided host
+            runner = ray.remote(resources={f"node:{replica_host}": 0.01},


I believe this requires custom resource specification when you do ray up to start the ray cluster?

odp added 9 commits February 25, 2021 15:29

initial

973c850

Add Ray backend

91fb353

use master to build strategy

6547c7e

exclude ray from full path req

c3a1289

add _from_resource_info_file

d29713d

provide GPUs to TF servers

8569a08

cleanup

dc73f9b

remove some api

39f2432

get gpu devices correctly

8af330c

jgada assigned odp Mar 16, 2021

odp added 4 commits March 17, 2021 17:45

1. return per replica values 2. add ENV.AUTODIST_RAY_BACKEND

0cd0229

BERT initial implementation

296f070

add checkpointing, cleanups

acdd472

lint

11f2074

odp force-pushed the ray branch from 852f2b1 to 11f2074 Compare April 6, 2021 01:21

odp added 4 commits April 6, 2021 18:23

make IS_AUTODIST_CHIEF a function which forces environ querying

7ab641a

add profiling

d4c92f4

No need to reenter scope for sess.run

f54b360

ammend README

5fb9bc7

zhisbug requested review from pengwu22 and ZeyaWang April 13, 2021 14:05

zhisbug suggested changes May 12, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autodist on Ray using RaySGD API #61

Autodist on Ray using RaySGD API #61

odp commented Mar 12, 2021 •

edited

Loading

zhisbug left a comment

zhisbug May 12, 2021

zhisbug May 12, 2021

Autodist on Ray using RaySGD API #61

Are you sure you want to change the base?

Autodist on Ray using RaySGD API #61

Conversation

odp commented Mar 12, 2021 • edited Loading

zhisbug left a comment

Choose a reason for hiding this comment

zhisbug May 12, 2021

Choose a reason for hiding this comment

zhisbug May 12, 2021

Choose a reason for hiding this comment

odp commented Mar 12, 2021 •

edited

Loading