Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autodist on Ray using RaySGD API #61

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

Autodist on Ray using RaySGD API #61

wants to merge 17 commits into from

Conversation

odp
Copy link

@odp odp commented Mar 12, 2021

This PR adds RaySGD API to Autodist which enables it to train models on a Ray cluster. The API defines a TFTrainer class which takes a model creator, data creator, train step and a strategy builder and runs the training job on a distributed Ray cluster. The API follows the RaySGD API and is compatible with Ray Tune.

trainer = TFTrainer(strategy_builder, model_creator, data_creator, train_step)
trainer.step()

Internally it implements a TFRunner class which represents a replica. All communication between master and worker replicas happens through in-memory object store so there is no dependance on remote file system locations/accesses rights. Also ssh is not needed.

Moreover the client code executed by each worker is also replicated using Ray eliminating the need of copying the model code to remote filesystems on each node. The users can run the example by installing Ray and running $ python linear_regression_ray.py.

Reference: https://docs.ray.io/en/master/raysgd/raysgd_tensorflow.html

Fixes #57

@zhisbug zhisbug requested review from pengwu22 and ZeyaWang April 13, 2021 14:05
Copy link
Contributor

@zhisbug zhisbug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other

@@ -77,15 +77,15 @@ def l(predicted_y, desired_y):
# Only save the model on master node if autodist is used with NFS.
checkpoint_suffix = 'c10'
checkpoint_name = checkpoint_dir + checkpoint_suffix
if IS_AUTODIST_CHIEF:
if IS_AUTODIST_CHIEF():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a test case (e.g. case c11) that uses the above linear regression code plus ray backend so the CI can test against it every time when there is new case? You might want to add it to both single-node multi GPU test or distributed tests.


def spawn_replica(replica_host, strategy_builder, strategy=None, env=None):
# Enforce actor placement on the provided host
runner = ray.remote(resources={f"node:{replica_host}": 0.01},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this requires custom resource specification when you do ray up to start the ray cluster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ray Integration
2 participants