From e01f34582b33ad52b1af8f87d21a645a42a4dc65 Mon Sep 17 00:00:00 2001 From: SherinDaher-Runai Date: Mon, 16 Dec 2024 12:15:04 +0200 Subject: [PATCH] Merge pull request #1292 from run-ai/oci-support OCI support --- .../cluster-setup/cluster-prerequisites.md | 25 ++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md b/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md index 7a81e8e6cd..540ecff65c 100644 --- a/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md +++ b/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md @@ -56,7 +56,8 @@ Run:ai Cluster requires Kubernetes. The following Kubernetes distributions are s * NVIDIA Base Command Manager (BCM) * Elastic Kubernetes Engine (EKS) * Google Kubernetes Engine (GKE) -* Azure Kubernetes Service (AKS) +* Azure Kubernetes Service (AKS) +* Oracle Kubernetes Engine (OKE) * Rancher Kubernetes Engine (RKE1) * Rancher Kubernetes Engine 2 (RKE2) @@ -130,6 +131,23 @@ There are many ways to install and configure different ingress controllers. A si --namespace nginx-ingress --create-namespace ``` +=== "Oracle Kubernetes Engine (OKE)" + + Run the following commands: + + ``` bash + helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx + helm repo update + helm install nginx-ingress ingress-nginx/ingress-nginx \ + --namespace ingress-nginx --create-namespace \ + --set controller.service.annotations.oci.oraclecloud.com/load-balancer-type=nlb \ + --set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/is-preserve-source=True \ + --set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/security-list-management-mode=None \ + --set controller.service.externalTrafficPolicy=Local \ + --set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/subnet= # Replace with the subnet ID of one of your cluster + ``` + + ### NVIDIA GPU Operator Run:ai Cluster requires NVIDIA GPU Operator to be installed on the Kubernetes Cluster, supports version 22.9 to 24.6 @@ -195,6 +213,11 @@ kubectl patch clusterPolicy cluster-policy -n gpu-operator --type=merge -p '{"sp `/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl` even though the file may not exist in your system. +??? "Oracle Kubernetes Engine (OKE)" + + * During cluster setup, [create a nodepool](https://docs.oracle.com/en-us/iaas/tools/python/latest/api/container_engine/models/oci.container_engine.models.NodePool.html#oci.container_engine.models.NodePool.initial_node_labels), and set `initial_node_labels` to include `oci.oraclecloud.com/disable-gpu-device-plugin=true` which disables the NVIDIA GPU device plugin. + * For GPU nodes, OKE defaults to Oracle Linux, which is incompatible with NVIDIA drivers. To resolve this, use a custom Ubuntu image instead. + For troubleshooting information, see the [NVIDIA GPU Operator Troubleshooting Guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/troubleshooting.html). ### Prometheus