refresh gpu docs; add azure types, update oci/examples/links (#764)

charmed-kubernetes · Apr 13, 2023 · 82680bc · 82680bc
1 parent 825d781
commit 82680bc
Show file tree

Hide file tree

Showing 2 changed files with 71 additions and 67 deletions.
diff --git a/assets/nvidia-test.yaml b/assets/nvidia-test.yaml
@@ -9,7 +9,7 @@ spec:
     spec:
       restartPolicy: Never
       containers:
-      - image: nvidia/cuda
+      - image: nvidia/cuda:12.1.0-base-ubuntu22.04
         name: nvidia-smi
         args:
           - nvidia-smi

diff --git a/pages/k8s/gpu-workers.md b/pages/k8s/gpu-workers.md
@@ -13,96 +13,101 @@ layout: [base, ubuntu-com]
 toc: False
 ---
 
-**Charmed Kubernetes** supports GPU-enabled
-instances for applications which can use them. The kubernetes-worker charm will
-automatically detect NVIDIA hardware and enable the appropriate support.
-However, the implementation of GPU-enabled instances differs greatly between
-public clouds. This page outlines the recommended methods for running GPU
-enabled hardware for different public clouds.
+**Charmed Kubernetes** supports GPU-enabled instances for applications that
+can use them. The `kubernetes-worker` application will automatically detect
+NVIDIA hardware and enable the appropriate support. This page describes
+recommended deployment and verification steps when using GPU workers with
+Charmed Kubernetes.
 
-### Deploying Charmed Kubernetes with GPU workers on AWS
+### Deploying Charmed Kubernetes with GPU workers
 
-If you are installing Charmed Kubernetes using a bundle, you can use constraints to specify
-that the worker units are deployed on GPU-enabled machines. Because GPU support
-varies considerably depending on the underlying cloud, this requires specifying
-a particular instance type.
+When deploying the Charmed Kubernetes bundle, you can use a YAML overlay file
+with constraints to ensure worker units are deployed on GPU-enabled machines.
+Because GPU support varies depending on the underlying cloud, this requires
+specifying a particular instance type.
 
-This can be done with a YAML overlay file. For example, when deploying Charmed
-Kubernetes on AWS, you may decide you wish to use AWS's 'p3.2xlarge' instances
-(you can check the AWS instance definitions on the
-[AWS website][aws-instance]). NVIDIA also updates its list of supported GPUs
-frequently, so be sure to look at [NVIDIA GPU support docs][nvidia-gpu-support] 
-before installing on a specific AWS instance.
+For example, when deploying to AWS, you may decide to use a `p3.2xlarge`
+instance from the available [AWS GPU-enabled instance types][aws-instance].
+Similarly, you could choose Azure's `Standard_NC6s_v3` instance from the
+available [Azure GPU-enabled instance types][azure-instance].
 
+NVIDIA updates its list of supported GPUs frequently, so be sure to cross
+reference the GPU included in a specific cloud instance against the
+[Supported NVIDIA GPUs and Systems][nvidia-gpu-support] documentation.
 
-A YAML overlay file can be constructed like this:
+Example overlay files that set GPU worker constraints:
 
 ```yaml
-#gpu-overlay.yaml
+# AWS gpu-overlay.yaml
 applications:
   kubernetes-worker:
     constraints: instance-type=p3.2xlarge
 ```
 
-And then deployed with Charmed Kubernetes like this:
+```yaml
+# Azure gpu-overlay.yaml
+applications:
+  kubernetes-worker:
+    constraints: instance-type=Standard_NC6s_v3
+```
+
+Deploy Charmed Kubernetes with an overlay like this:
 
 ```bash
-juju deploy charmed-kubernetes --overlay ~/path/aws-overlay.yaml --overlay ~/path/gpu-overlay.yaml
+juju deploy charmed-kubernetes --overlay ~/path/my-overlay.yaml --overlay ~/path/gpu-overlay.yaml
 ```
 
 As demonstrated here, you can use multiple overlay files when deploying, so you
 can combine GPU support with an integrator charm or other custom configuration.
 
-You may then want to [test a GPU workload](#test)
+You may then want to [test a GPU workload](#test).
 
-### Adding GPU workers with AWS
+### Adding GPU workers post deployment
 
-It isn't necessary for all the worker units to have GPU support. You can simply
-add GPU-enabled workers to a running cluster. The recommended way to do this is
-to first create a new constraint for the kubernetes-worker:
+It isn't necessary for all worker units to have GPU support. You can add
+GPU-enabled workers to an existing cluster. The recommended way to do this is
+to first set a new constraint for the `kubernetes-worker` application:
 
 ```bash
 juju set-constraints kubernetes-worker instance-type=p3.2xlarge
 ```
 
-Then you can add as many new worker units as required. For example, to add two
-new units.
+Then add as many new GPU worker units as required. For example, to add two new
+units:
 
 ```bash
 juju add-unit kubernetes-worker -n2
 ```
 
 ### Adding GPU workers with GCP
 
-Google supports GPUs slightly differently to most clouds. There are no GPUs
-included in any of the default instance templates, and therefore they have
+Google supports GPUs slightly differently to most clouds. There are no GPU
+variations included in the general instance templates, and therefore they have
 to be added manually.
 
 To begin, add a new machine with Juju. Include any desired constraints for
-memory,cores,etc :
+cpu cores, memory, etc:
 
 ```bash
-juju add-machine --constraints cores=2
+juju add-machine --constraints 'cores=4 mem=16G'
 ```
 
-The command will return, telling you the number of the machine that was
-created - keep a note of this number.
+The command will return with the unit number of the machine that was created -
+take note of this number.
 
 Next you will need to use the gcloud tool or the GCP console to stop the
-instance, edit its configuration and then restart the machine.
+newly created instance, edit its configuration and then restart the machine.
 
-Once it is up and running, you can then add it as a worker:
+Once it is up and running, add the `kubernetes-worker` application to it:
 
 ```bash
 juju add-unit kubernetes-worker --to 10
 ```
 
-...replacing '10' in the above with the number of the machine you created.
+...replacing `10` in the above with the previously noted number. As the charm
+installs, the GPU will be detected and the relevant support will be installed.
 
-As the charm installs, the GPU will be detected and the relevant drivers will
-also be installed.
 <a  id="test"> </a>
-
 ## Testing
 
 As GPU instances can be costly, it is useful to test that they can actually be
@@ -124,7 +129,7 @@ spec:
     spec:
       restartPolicy: Never
       containers:
-      - image: nvidia/cuda:11.6.0-base-ubuntu20.04
+      - image: nvidia/cuda:12.1.0-base-ubuntu22.04
         name: nvidia-smi
         args:
           - nvidia-smi
@@ -145,7 +150,6 @@ spec:
       - name: libraries
         hostPath:
           path: /usr/lib/x86_64-linux-gnu
-
 ```
 
 Download the file and run it with:
@@ -159,35 +163,35 @@ You can inspect the logs to find the hardware report.
 ```bash
 kubectl logs job.batch/nvidia-smi
 
-Thu Mar  3 14:52:26 2022       
-+-----------------------------------------------------------------------------+
-| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
-|-------------------------------+----------------------+----------------------+
-| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
-| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
-|                               |                      |               MIG M. |
-|===============================+======================+======================|
-|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
-| N/A   39C    P0    24W / 300W |      0MiB / 16384MiB |      0%      Default |
-|                               |                      |                  N/A |
-+-------------------------------+----------------------+----------------------+
-                                                                               
-+-----------------------------------------------------------------------------+
-| Processes:                                                                  |
-|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
-|        ID   ID                                                   Usage      |
-|=============================================================================|
-|  No running processes found                                                 |
-+-----------------------------------------------------------------------------+
+Tue Apr 11 22:46:04 2023
++---------------------------------------------------------------------------------------+
+| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
+|-----------------------------------------+----------------------+----------------------+
+| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|                                         |                      |               MIG M. |
+|=========================================+======================+======================|
+|   0  Tesla V100-SXM2-16GB            On | 00000000:00:1E.0 Off |                    0 |
+| N/A   36C    P0               23W / 300W|      0MiB / 16384MiB |      0%      Default |
+|                                         |                      |                  N/A |
++-----------------------------------------+----------------------+----------------------+
+
++---------------------------------------------------------------------------------------+
+| Processes:                                                                            |
+|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
+|        ID   ID                                                             Usage      |
+|=======================================================================================|
+|  No running processes found                                                           |
++---------------------------------------------------------------------------------------+
 ```
 
 <!-- LINKS -->
-[asset-nvidia]: https://raw.githubusercontent.com/juju-solutions/kubernetes-docs/main/assets/nvidia-test.yaml
+[asset-nvidia]: https://raw.githubusercontent.com/charmed-kubernetes/kubernetes-docs/main/assets/nvidia-test.yaml
 [nvidia-supported-tags]: https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/README.md#supported-tags
 [quickstart]: /kubernetes/docs/quickstart
-[aws-instance]: https://aws.amazon.com/ec2/instance-types/
-[azure-instance]: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu
-[nvidia-gpu-support]: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/platform-support.html#supported-nvidia-gpus-systems
+[aws-instance]: https://aws.amazon.com/ec2/instance-types/#Accelerated_Computing
+[azure-instance]: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu
+[nvidia-gpu-support]: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/platform-support.html#supported-nvidia-gpus-and-systems
 
 <!-- FEEDBACK -->
 <div class="p-notification--information">