Skip to content

Creating GPU Workers on Google Cloud

Tyler edited this page Jul 26, 2019 · 5 revisions

Note: this assumes you are a member of the AutoDL project on the Google Cloud Platform. The ready to use worker instance is not currently public.

Creating the new GPU Worker

  • Start from your Google Cloud Storage Control Panel, and click VM Instances under Compute Engine in the side menu

image

  • Select Create New Instance from the top menu

image

  • Setup your machine name, region, and specs. We recommend one of the higher memory options. 4vCPU's and 26GB of RAM should be fine

image

  • Under CPU Platform and GPU:
    • Select Intel Broadwell or better
    • Select Add GPU and add a Nvidia Tesla P100

image

  • Select change under boot-disk:

image

  • Choose Custom Images

  • Select codalab-nvidia-gpu-4 (The size of the disk should automatically adjust)

image

  • No other options need to be selected, submit the form and create the instance

Configuring the new GPU Worker

  • SSH into the GPU worker by using the SSH button in the Google Cloud Control Panel

image

  • Add your user to Docker's user group: $ sudo usermod -aG docker $USER

image

  • Relogin through SSH

  • Bring down the running worker that's created automatically: nvidia-docker stop compute_worker

image

  • Remove the worker image that's created automatically: nvidia-docker rm compute_worker

image

  • Run the following command after replacing <YOUR_BROKER_URL> with the correct broker URL:
    mkdir -p /tmp/codalab && nvidia-docker run \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v /var/lib/nvidia-docker/nvidia-docker.sock:/var/lib/nvidia-docker/nvidia-docker.sock \
    -v /tmp/codalab:/tmp/codalab \
    -d \
    --name compute_worker \
    --env BROKER_URL=<YOUR_BROKER_URL> \
    --restart unless-stopped \
    --log-opt max-size=50m \
    --log-opt max-file=3 \
    codalab/competitions-v1-nvidia-worker:latest

image

  • Check logs with nvidia-docker logs -f --tail=100 compute_worker to ensure it connected fine

image

If you see:

[2019-07-26 00:35:48,575: ERROR/MainProcess] consumer: Cannot connect to amqp://fb691c30-8941-40d9-8ea4-034f15340034:**@backup.chalearn.org:5672/aa2897e9-64fd-447d-b1cb-7159d8e3aed8: t
imed out.
Trying again in x seconds...

Ensure that the network tag allow-rabbitmq-and-flower is added to the WEBSERVER (not the GPU worker) if working under the AutoDL project on the Google Cloud Platofrm. This allows rabbitmq/rabbit/flower to connect to workers

If under a different project or operating on your own, make sure the ports used by rabbitmq/rabbitmq-management/flower are open (5672/15672/5555) by default

Clone this wiki locally