Creating GPU Workers on Google Cloud

Note: this assumes you are a member of the AutoDL project on the Google Cloud Platform. The ready to use worker instance is not currently public.

Creating the new GPU Worker

Start from your Google Cloud Storage Control Panel, and click VM Instances under Compute Engine in the side menu

Select Create New Instance from the top menu

Setup your machine name, region, and specs. We recommend one of the higher memory options. 4vCPU's and 26GB of RAM should be fine

Under CPU Platform and GPU:
- Select Intel Broadwell or better
- Select Add GPU and add a Nvidia Tesla P100

Select change under boot-disk:

Choose Custom Images
Select codalab-nvidia-gpu-4 (The size of the disk should automatically adjust)

No other options need to be selected, submit the form and create the instance

Configuring the new GPU Worker

SSH into the GPU worker by using the SSH button in the Google Cloud Control Panel

Add your user to Docker's user group: $ sudo usermod -aG docker $USER

Relogin through SSH
Bring down the running worker that's created automatically: nvidia-docker stop compute_worker

Remove the worker image that's created automatically: nvidia-docker rm compute_worker

Run the following command after replacing <YOUR_BROKER_URL> with the correct broker URL:

    mkdir -p /tmp/codalab && nvidia-docker run \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v /var/lib/nvidia-docker/nvidia-docker.sock:/var/lib/nvidia-docker/nvidia-docker.sock \
    -v /tmp/codalab:/tmp/codalab \
    -d \
    --name compute_worker \
    --env BROKER_URL=<YOUR_BROKER_URL> \
    --restart unless-stopped \
    --log-opt max-size=50m \
    --log-opt max-file=3 \
    codalab/competitions-v1-nvidia-worker:latest

Check logs with nvidia-docker logs -f --tail=100 compute_worker to ensure it connected fine

If you see:

[2019-07-26 00:35:48,575: ERROR/MainProcess] consumer: Cannot connect to amqp://fb691c30-8941-40d9-8ea4-034f15340034:**@backup.chalearn.org:5672/aa2897e9-64fd-447d-b1cb-7159d8e3aed8: t
imed out.
Trying again in x seconds...

Ensure that the network tag allow-rabbitmq-and-flower is added to the WEBSERVER (not the GPU worker) if working under the AutoDL project on the Google Cloud Platofrm. This allows rabbitmq/rabbit/flower to connect to workers

If under a different project or operating on your own, make sure the ports used by rabbitmq/rabbitmq-management/flower are open (5672/15672/5555) by default

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating GPU Workers on Google Cloud

Creating the new GPU Worker

Configuring the new GPU Worker

Clone this wiki locally