-
Notifications
You must be signed in to change notification settings - Fork 130
Creating GPU Workers on Google Cloud
Tyler edited this page Jul 26, 2019
·
5 revisions
Note: this assumes you are a member of the AutoDL project on the Google Cloud Platform. The ready to use worker instance is not currently public.
- Start from your Google Cloud Storage Control Panel, and click
VM Instances
underCompute Engine
in the side menu
- Select
Create New Instance
from the top menu
- Setup your machine name, region, and specs. We recommend one of the higher memory options. 4vCPU's and 26GB of RAM should be fine
- Under CPU Platform and GPU:
- Select Intel Broadwell or better
- Select
Add GPU
and add a Nvidia Tesla P100
- Select change under boot-disk:
-
Choose
Custom Images
-
Select
codalab-nvidia-gpu-4
(The size of the disk should automatically adjust)
- No other options need to be selected, submit the form and create the instance
- SSH into the GPU worker by using the SSH button in the Google Cloud Control Panel
- Add your user to Docker's user group:
$ sudo usermod -aG docker $USER
-
Relogin through SSH
-
Bring down the running worker that's created automatically:
nvidia-docker stop compute_worker
- Remove the worker image that's created automatically:
nvidia-docker rm compute_worker
- Run the following command after replacing <YOUR_BROKER_URL> with the correct broker URL:
mkdir -p /tmp/codalab && nvidia-docker run \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /var/lib/nvidia-docker/nvidia-docker.sock:/var/lib/nvidia-docker/nvidia-docker.sock \
-v /tmp/codalab:/tmp/codalab \
-d \
--name compute_worker \
--env BROKER_URL=<YOUR_BROKER_URL> \
--restart unless-stopped \
--log-opt max-size=50m \
--log-opt max-file=3 \
codalab/competitions-v1-nvidia-worker:latest
- Check logs with
nvidia-docker logs -f --tail=100 compute_worker
to ensure it connected fine
If you see:
[2019-07-26 00:35:48,575: ERROR/MainProcess] consumer: Cannot connect to amqp://fb691c30-8941-40d9-8ea4-034f15340034:**@backup.chalearn.org:5672/aa2897e9-64fd-447d-b1cb-7159d8e3aed8: t
imed out.
Trying again in x seconds...
Ensure that the network tag allow-rabbitmq-and-flower
is added to the WEBSERVER (not the GPU worker) if working under the AutoDL project on the Google Cloud Platofrm. This allows rabbitmq/rabbit/flower to connect to workers
If under a different project or operating on your own, make sure the ports used by rabbitmq/rabbitmq-management/flower are open (5672/15672/5555) by default