Skip to content

Commit

Permalink
refresh gpu docs; add azure types, update oci/examples/links (#764)
Browse files Browse the repository at this point in the history
  • Loading branch information
kwmonroe authored Apr 13, 2023
1 parent 825d781 commit 82680bc
Show file tree
Hide file tree
Showing 2 changed files with 71 additions and 67 deletions.
2 changes: 1 addition & 1 deletion assets/nvidia-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ spec:
spec:
restartPolicy: Never
containers:
- image: nvidia/cuda
- image: nvidia/cuda:12.1.0-base-ubuntu22.04
name: nvidia-smi
args:
- nvidia-smi
Expand Down
136 changes: 70 additions & 66 deletions pages/k8s/gpu-workers.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,96 +13,101 @@ layout: [base, ubuntu-com]
toc: False
---

**Charmed Kubernetes** supports GPU-enabled
instances for applications which can use them. The kubernetes-worker charm will
automatically detect NVIDIA hardware and enable the appropriate support.
However, the implementation of GPU-enabled instances differs greatly between
public clouds. This page outlines the recommended methods for running GPU
enabled hardware for different public clouds.
**Charmed Kubernetes** supports GPU-enabled instances for applications that
can use them. The `kubernetes-worker` application will automatically detect
NVIDIA hardware and enable the appropriate support. This page describes
recommended deployment and verification steps when using GPU workers with
Charmed Kubernetes.

### Deploying Charmed Kubernetes with GPU workers on AWS
### Deploying Charmed Kubernetes with GPU workers

If you are installing Charmed Kubernetes using a bundle, you can use constraints to specify
that the worker units are deployed on GPU-enabled machines. Because GPU support
varies considerably depending on the underlying cloud, this requires specifying
a particular instance type.
When deploying the Charmed Kubernetes bundle, you can use a YAML overlay file
with constraints to ensure worker units are deployed on GPU-enabled machines.
Because GPU support varies depending on the underlying cloud, this requires
specifying a particular instance type.

This can be done with a YAML overlay file. For example, when deploying Charmed
Kubernetes on AWS, you may decide you wish to use AWS's 'p3.2xlarge' instances
(you can check the AWS instance definitions on the
[AWS website][aws-instance]). NVIDIA also updates its list of supported GPUs
frequently, so be sure to look at [NVIDIA GPU support docs][nvidia-gpu-support]
before installing on a specific AWS instance.
For example, when deploying to AWS, you may decide to use a `p3.2xlarge`
instance from the available [AWS GPU-enabled instance types][aws-instance].
Similarly, you could choose Azure's `Standard_NC6s_v3` instance from the
available [Azure GPU-enabled instance types][azure-instance].

NVIDIA updates its list of supported GPUs frequently, so be sure to cross
reference the GPU included in a specific cloud instance against the
[Supported NVIDIA GPUs and Systems][nvidia-gpu-support] documentation.

A YAML overlay file can be constructed like this:
Example overlay files that set GPU worker constraints:

```yaml
#gpu-overlay.yaml
# AWS gpu-overlay.yaml
applications:
kubernetes-worker:
constraints: instance-type=p3.2xlarge
```
And then deployed with Charmed Kubernetes like this:
```yaml
# Azure gpu-overlay.yaml
applications:
kubernetes-worker:
constraints: instance-type=Standard_NC6s_v3
```
Deploy Charmed Kubernetes with an overlay like this:
```bash
juju deploy charmed-kubernetes --overlay ~/path/aws-overlay.yaml --overlay ~/path/gpu-overlay.yaml
juju deploy charmed-kubernetes --overlay ~/path/my-overlay.yaml --overlay ~/path/gpu-overlay.yaml
```

As demonstrated here, you can use multiple overlay files when deploying, so you
can combine GPU support with an integrator charm or other custom configuration.

You may then want to [test a GPU workload](#test)
You may then want to [test a GPU workload](#test).

### Adding GPU workers with AWS
### Adding GPU workers post deployment

It isn't necessary for all the worker units to have GPU support. You can simply
add GPU-enabled workers to a running cluster. The recommended way to do this is
to first create a new constraint for the kubernetes-worker:
It isn't necessary for all worker units to have GPU support. You can add
GPU-enabled workers to an existing cluster. The recommended way to do this is
to first set a new constraint for the `kubernetes-worker` application:

```bash
juju set-constraints kubernetes-worker instance-type=p3.2xlarge
```

Then you can add as many new worker units as required. For example, to add two
new units.
Then add as many new GPU worker units as required. For example, to add two new
units:

```bash
juju add-unit kubernetes-worker -n2
```

### Adding GPU workers with GCP

Google supports GPUs slightly differently to most clouds. There are no GPUs
included in any of the default instance templates, and therefore they have
Google supports GPUs slightly differently to most clouds. There are no GPU
variations included in the general instance templates, and therefore they have
to be added manually.

To begin, add a new machine with Juju. Include any desired constraints for
memory,cores,etc :
cpu cores, memory, etc:

```bash
juju add-machine --constraints cores=2
juju add-machine --constraints 'cores=4 mem=16G'
```

The command will return, telling you the number of the machine that was
created - keep a note of this number.
The command will return with the unit number of the machine that was created -
take note of this number.

Next you will need to use the gcloud tool or the GCP console to stop the
instance, edit its configuration and then restart the machine.
newly created instance, edit its configuration and then restart the machine.

Once it is up and running, you can then add it as a worker:
Once it is up and running, add the `kubernetes-worker` application to it:

```bash
juju add-unit kubernetes-worker --to 10
```

...replacing '10' in the above with the number of the machine you created.
...replacing `10` in the above with the previously noted number. As the charm
installs, the GPU will be detected and the relevant support will be installed.

As the charm installs, the GPU will be detected and the relevant drivers will
also be installed.
<a id="test"> </a>

## Testing

As GPU instances can be costly, it is useful to test that they can actually be
Expand All @@ -124,7 +129,7 @@ spec:
spec:
restartPolicy: Never
containers:
- image: nvidia/cuda:11.6.0-base-ubuntu20.04
- image: nvidia/cuda:12.1.0-base-ubuntu22.04
name: nvidia-smi
args:
- nvidia-smi
Expand All @@ -145,7 +150,6 @@ spec:
- name: libraries
hostPath:
path: /usr/lib/x86_64-linux-gnu

```
Download the file and run it with:
Expand All @@ -159,35 +163,35 @@ You can inspect the logs to find the hardware report.
```bash
kubectl logs job.batch/nvidia-smi

Thu Mar 3 14:52:26 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 39C P0 24W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Tue Apr 11 22:46:04 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-16GB On | 00000000:00:1E.0 Off | 0 |
| N/A 36C P0 23W / 300W| 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
```

<!-- LINKS -->
[asset-nvidia]: https://raw.githubusercontent.com/juju-solutions/kubernetes-docs/main/assets/nvidia-test.yaml
[asset-nvidia]: https://raw.githubusercontent.com/charmed-kubernetes/kubernetes-docs/main/assets/nvidia-test.yaml
[nvidia-supported-tags]: https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/README.md#supported-tags
[quickstart]: /kubernetes/docs/quickstart
[aws-instance]: https://aws.amazon.com/ec2/instance-types/
[azure-instance]: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu
[nvidia-gpu-support]: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/platform-support.html#supported-nvidia-gpus-systems
[aws-instance]: https://aws.amazon.com/ec2/instance-types/#Accelerated_Computing
[azure-instance]: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu
[nvidia-gpu-support]: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/platform-support.html#supported-nvidia-gpus-and-systems

<!-- FEEDBACK -->
<div class="p-notification--information">
Expand Down

0 comments on commit 82680bc

Please sign in to comment.