From 73483d7f023e77d55586e68f69f87b1507bd038a Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 3 Jan 2024 16:20:28 +0000 Subject: [PATCH 01/19] Add first draft of template workflow --- .../training/L4_template_workflow.md | 314 ++++++++++++++++++ mkdocs.yml | 2 + 2 files changed, 316 insertions(+) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 2114bfda7..16e008145 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -1 +1,315 @@ # Template workflow + +An example workflow for code development using K8s is outlined below. + +The workflow requires a GitHub account and GitHub Actions for CI/CD, (this can be adapted for other platforms such as GitLab). + +The workflow is separated into three sections: + +1) Data Loading + +1) Preparing a custom Docker image + +1) Code development with K8s + +## Data loading + +### Create a persistent volume + +Request memory from the Ceph server by submitting a PVC to K8s (example pvc spec yaml below). + +``` bash +kubectl create -f +``` + +##### Example PyTorch PersistentVolumeClaim + +``` yaml +kind: PersistentVolumeClaim +apiVersion: v1 +metadata: + name: template-workflow-pvc +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 100Gi + storageClassName: csi-rbd-sc +``` + +### Create a lightweight pod to tranfer data to the persistent volume + +1. Check PVC has been created + + ``` bash + kubectl get pvc + ``` + +1. Create a lightweight pod with PV mounted (example pod below) + + ``` bash + kubectl create -f lightweight-pod.yaml + ``` + +1. Download data set (If the data set download time is estimated to be hours or days you may want to run this code within a [screen](https://www.gnu.org/software/screen/manual/screen.html) instance on your VM so you can track the progress asynchronously) + + ``` bash + kubectl exec lightweight-pod -- wget /mnt/ceph_rdb/ + ``` + +1. Delete lightweight pod + + ``` bash + kubectl delete pod lightweight-pod + ``` + +##### Example lightweight pod specification + +``` yaml +apiVersion: v1 +kind: Pod +metadata: + name: lightweight-pod +spec: + containers: + - name: data-loader + image: ubuntu-latest + command: ["sleep", "infinity"] + resources: + requests: + cpu: 1 + memory: "1Gi" + limits: + cpu: 1 + memory: "1Gi" + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + volumes: + - name: volume + persistentVolumeClaim: + claimName: template-workflow-pvc +``` + +## Preparing a custom Docker image + +Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub. Typical use cases require some custom modifications of a base image. + +1) Select a suitable base image (The [Nvidia container catalog](https://catalog.ngc.nvidia.com/containers) is often a useful starting place for GPU accelerated tasks) + +1) Create a [Dockerfile](https://docs.docker.com/engine/reference/builder/) to add any additional packages required to the base image + + ```txt + FROM nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10 + RUN pip install pandas + RUN pip install scikit-learn + ``` + +1) Build the Docker container locally or on a VM (You will need to install [Docker](https://docs.docker.com/)) + + ```bash + docker build + ``` + +1) Push Docker image to Docker Hub (You will need to create and setup an account) + + ```bash + docker push template-docker-image + ``` + +## Code development with K8s + +A rapid development cycle from code writing to testing requires some initial setup within k8s. + +The first step is to automatically pull the latest code version before running any tests in a pod. + +This allows development to be conducted on any device/VM with access to the repo (GitHub/GitLab) and testing to be completed on the cluster with just one `kubectl create` command. + +This allows custom code/models to be prototyped on the cluster, but typically within a standard base image. + +However, if the Docker container also needs to be developed then GitHub actions can be used to automatically build a new image and publish it to Docker Hub if any changes to a Dockerfile is detected. + +A template GitHub repo with sample code, k8s yaml files and github actions is available [here](https://github.com/DimmestP/template-EIDFGPU-workflow). + +### Create a job that downloads and runs the latest code version at runtime + +1) Create a standard job with the required resources and custom docker image (example below) + + ```yaml + apiVersion: batch/v1 + kind: Job + metadata: + name: template-workflow-job + spec: + completions: 1 + parallelism: 1 + template: + spec: + restartPolicy: Never + containers: + - name: template-docker-image + image: /template-docker-image:latest + command: ["sleep", "infinity"] + resources: + requests: + cpu: 10 + memory: "40Gi" + limits: + cpu: 10 + memory: "80Gi" + nvidia.com/gpu: 1 + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + volumes: + - name: volume + persistentVolumeClaim: + claimName: template-workflow-pvc + ``` + +1) Add an initial container that runs before the main container to download the latest version of the code. + + ```yaml + apiVersion: batch/v1 + kind: Job + metadata: + name: template-workflow-job + spec: + completions: 1 + parallelism: 1 + template: + spec: + restartPolicy: Never + containers: + - name: template-docker-image + image: /template-docker-image:latest + command: ["sleep", "infinity"] + resources: + requests: + cpu: 10 + memory: "40Gi" + limits: + cpu: 10 + memory: "80Gi" + nvidia.com/gpu: 1 + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + - mountPath: /code + name: github-code + initContainers: + - name: lightweight-git-container + image: cicirello/alpine-plus-plus + command: ['sh', '-c', "cd /code; git clone "] + resources: + requests: + cpu: 1 + memory: "4Gi" + limits: + cpu: 1 + memory: "8Gi" + volumeMounts: + - mountPath: /code + name: github-code + volumes: + - name: volume + persistentVolumeClaim: + claimName: benchmark-imagenet-pvc + - name: github-code + emptyDir: + sizeLimit: 1Gi + ``` + +1) Change the command argument in the main container to run the code once started. + + ```yaml + apiVersion: batch/v1 + kind: Job + metadata: + name: template-workflow-job + spec: + completions: 1 + parallelism: 1 + template: + spec: + restartPolicy: Never + containers: + - name: template-docker-image + image: /template-docker-image:latest + command: ['sh', '-c', "python3 /code/"] + resources: + requests: + cpu: 10 + memory: "40Gi" + limits: + cpu: 10 + memory: "80Gi" + nvidia.com/gpu: 1 + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + - mountPath: /code + name: github-code + initContainers: + - name: lightweight-git-container + image: cicirello/alpine-plus-plus + command: ['sh', '-c', "cd /code; git clone "] + resources: + requests: + cpu: 1 + memory: "4Gi" + limits: + cpu: 1 + memory: "8Gi" + volumeMounts: + - mountPath: /code + name: github-code + volumes: + - name: volume + persistentVolumeClaim: + claimName: benchmark-imagenet-pvc + - name: github-code + emptyDir: + sizeLimit: 1Gi + ``` + +### Setup GitHub actions to build and publish any changes to a Dockerfile + +1) Create two [GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) to securely provide your Docker Hub username and access token. + +1) Add the Dockerfile to a code/docker folder within the active GitHub repo + +1) Add the GitHub action yaml file below to the .github/workflow folder to automatically push a new image to Docker Hub if any changes to files in the code/docker folder is detected. + + ```yaml + name: ci + on: + push: + paths: + - 'code/docker/**' + + jobs: + docker: + runs-on: ubuntu-latest + steps: + - + name: Set up QEMU + uses: docker/setup-qemu-action@v3 + - + name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + - + name: Login to Docker Hub + uses: docker/login-action@v3 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_TOKEN }} + - + name: Build and push + uses: docker/build-push-action@v5 + with: + context: "{{defaultContext}}:code/docker" + push: true + tags: + ``` diff --git a/mkdocs.yml b/mkdocs.yml index fb602f696..85ba97978 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -68,6 +68,8 @@ nav: - "Getting Started": services/gpuservice/training/L1_getting_started.md - "Persistent Volumes": services/gpuservice/training/L2_requesting_persistent_volumes.md - "Running a Pytorch Pod": services/gpuservice/training/L3_running_a_pytorch_task.md + - "Template K8s Workflow": +services/gpuservice/training/L4_template_workflow.md - "GPU Service FAQ": services/gpuservice/faq.md - "Graphcore Bow Pod64": - "Overview": services/graphcore/index.md From 1cdc440482e11cce4b536e6ebc10c4527fd219d1 Mon Sep 17 00:00:00 2001 From: Samuel Joseph Haynes <37002508+DimmestP@users.noreply.github.com> Date: Thu, 4 Jan 2024 09:24:40 +0000 Subject: [PATCH 02/19] Clarify workflow with git pull --- docs/services/gpuservice/training/L4_template_workflow.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 16e008145..54189c521 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -134,7 +134,7 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av ### Create a job that downloads and runs the latest code version at runtime -1) Create a standard job with the required resources and custom docker image (example below) +1) Write a standard yaml file for a k8s job with the required resources and custom docker image (example below) ```yaml apiVersion: batch/v1 @@ -274,6 +274,11 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av sizeLimit: 1Gi ``` +1) Submit the yaml file to kubernetes + ```bash + kubectl create -f + ``` + ### Setup GitHub actions to build and publish any changes to a Dockerfile 1) Create two [GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) to securely provide your Docker Hub username and access token. From 5ea7d16831d187a9940d61321d9d55a35a7cd181 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Thu, 4 Jan 2024 10:08:32 +0000 Subject: [PATCH 03/19] Fix bug with loading template --- conda-requirements.yaml | 33 ------------------- .../training/L4_template_workflow.md | 32 +++++++++--------- mkdocs.yml | 3 +- 3 files changed, 17 insertions(+), 51 deletions(-) delete mode 100644 conda-requirements.yaml diff --git a/conda-requirements.yaml b/conda-requirements.yaml deleted file mode 100644 index 566edc658..000000000 --- a/conda-requirements.yaml +++ /dev/null @@ -1,33 +0,0 @@ -name: mkdocs -channels: - - conda-forge -dependencies: - - backports=1.1 - - cfgv=3.3.0 - - click=8.0.1 - - distlib=0.3.2 - - filelock=3.0.12 - - ghp-import=2.0.1 - - identify=2.2.11 - - importlib-metadata=4.6.1 - - Jinja2=3.0.1 - - Markdown=3.3.4 - - MarkupSafe=2.0.1 - - mergedeep=1.3.4 - - mkdocs=1.2.1 - - mkdocs-material=7.1.10 - - mkdocs-material-extensions=1.0.1 - - nodeenv=1.6.0 - - packaging=21.0 - - platformdirs=3.2 - - pre-commit=2.13.0 - - Pygments=2.9.0 - - pymdown-extensions=8.2 - - pyparsing=2.4.7 - - python-dateutil=2.8.1 - - PyYAML=5.4.1 - - pyyaml-env-tag=0.1 - - six=1.16.0 - - toml=0.10.2 - - watchdog=2.1.3 - - zipp=3.5.0 diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 54189c521..9d00cbdb1 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -6,11 +6,11 @@ The workflow requires a GitHub account and GitHub Actions for CI/CD, (this can b The workflow is separated into three sections: -1) Data Loading +1. Data Loading -1) Preparing a custom Docker image +1. Preparing a custom Docker image -1) Code development with K8s +1. Code development with K8s ## Data loading @@ -22,7 +22,7 @@ Request memory from the Ceph server by submitting a PVC to K8s (example pvc spec kubectl create -f ``` -##### Example PyTorch PersistentVolumeClaim +#### Example PyTorch PersistentVolumeClaim ``` yaml kind: PersistentVolumeClaim @@ -64,7 +64,7 @@ spec: kubectl delete pod lightweight-pod ``` -##### Example lightweight pod specification +#### Example lightweight pod specification ``` yaml apiVersion: v1 @@ -96,9 +96,9 @@ spec: Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub. Typical use cases require some custom modifications of a base image. -1) Select a suitable base image (The [Nvidia container catalog](https://catalog.ngc.nvidia.com/containers) is often a useful starting place for GPU accelerated tasks) +1. Select a suitable base image (The [Nvidia container catalog](https://catalog.ngc.nvidia.com/containers) is often a useful starting place for GPU accelerated tasks) -1) Create a [Dockerfile](https://docs.docker.com/engine/reference/builder/) to add any additional packages required to the base image +1. Create a [Dockerfile](https://docs.docker.com/engine/reference/builder/) to add any additional packages required to the base image ```txt FROM nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10 @@ -106,13 +106,13 @@ Kubernetes requires Docker images to be pre-built and available for download fro RUN pip install scikit-learn ``` -1) Build the Docker container locally or on a VM (You will need to install [Docker](https://docs.docker.com/)) +1. Build the Docker container locally or on a VM (You will need to install [Docker](https://docs.docker.com/)) ```bash docker build ``` -1) Push Docker image to Docker Hub (You will need to create and setup an account) +1. Push Docker image to Docker Hub (You will need to create and setup an account) ```bash docker push template-docker-image @@ -134,7 +134,7 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av ### Create a job that downloads and runs the latest code version at runtime -1) Write a standard yaml file for a k8s job with the required resources and custom docker image (example below) +1. Write a standard yaml file for a k8s job with the required resources and custom docker image (example below) ```yaml apiVersion: batch/v1 @@ -168,7 +168,7 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av claimName: template-workflow-pvc ``` -1) Add an initial container that runs before the main container to download the latest version of the code. +1. Add an initial container that runs before the main container to download the latest version of the code. ```yaml apiVersion: batch/v1 @@ -221,7 +221,7 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av sizeLimit: 1Gi ``` -1) Change the command argument in the main container to run the code once started. +1. Change the command argument in the main container to run the code once started. ```yaml apiVersion: batch/v1 @@ -274,18 +274,18 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av sizeLimit: 1Gi ``` -1) Submit the yaml file to kubernetes +1. Submit the yaml file to kubernetes ```bash kubectl create -f ``` ### Setup GitHub actions to build and publish any changes to a Dockerfile -1) Create two [GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) to securely provide your Docker Hub username and access token. +1. Create two [GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) to securely provide your Docker Hub username and access token. -1) Add the Dockerfile to a code/docker folder within the active GitHub repo +1. Add the Dockerfile to a code/docker folder within the active GitHub repo -1) Add the GitHub action yaml file below to the .github/workflow folder to automatically push a new image to Docker Hub if any changes to files in the code/docker folder is detected. +1. Add the GitHub action yaml file below to the .github/workflow folder to automatically push a new image to Docker Hub if any changes to files in the code/docker folder is detected. ```yaml name: ci diff --git a/mkdocs.yml b/mkdocs.yml index 85ba97978..b2837cbb7 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -68,8 +68,7 @@ nav: - "Getting Started": services/gpuservice/training/L1_getting_started.md - "Persistent Volumes": services/gpuservice/training/L2_requesting_persistent_volumes.md - "Running a Pytorch Pod": services/gpuservice/training/L3_running_a_pytorch_task.md - - "Template K8s Workflow": -services/gpuservice/training/L4_template_workflow.md + - "Template K8s Workflow": services/gpuservice/training/L4_template_workflow.md - "GPU Service FAQ": services/gpuservice/faq.md - "Graphcore Bow Pod64": - "Overview": services/graphcore/index.md From 588cf06275a63228bd46a09615b247498f97e8f7 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Thu, 4 Jan 2024 10:25:51 +0000 Subject: [PATCH 04/19] Changed template worflow md in response to pre-commit --- .../training/L4_template_workflow.md | 21 ++++++++++--------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 9d00cbdb1..6dc74df01 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -51,13 +51,13 @@ spec: ``` bash kubectl create -f lightweight-pod.yaml ``` - + 1. Download data set (If the data set download time is estimated to be hours or days you may want to run this code within a [screen](https://www.gnu.org/software/screen/manual/screen.html) instance on your VM so you can track the progress asynchronously) ``` bash kubectl exec lightweight-pod -- wget /mnt/ceph_rdb/ ``` - + 1. Delete lightweight pod ``` bash @@ -99,7 +99,7 @@ Kubernetes requires Docker images to be pre-built and available for download fro 1. Select a suitable base image (The [Nvidia container catalog](https://catalog.ngc.nvidia.com/containers) is often a useful starting place for GPU accelerated tasks) 1. Create a [Dockerfile](https://docs.docker.com/engine/reference/builder/) to add any additional packages required to the base image - + ```txt FROM nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10 RUN pip install pandas @@ -122,11 +122,11 @@ Kubernetes requires Docker images to be pre-built and available for download fro A rapid development cycle from code writing to testing requires some initial setup within k8s. -The first step is to automatically pull the latest code version before running any tests in a pod. +The first step is to automatically pull the latest code version before running any tests in a pod. This allows development to be conducted on any device/VM with access to the repo (GitHub/GitLab) and testing to be completed on the cluster with just one `kubectl create` command. -This allows custom code/models to be prototyped on the cluster, but typically within a standard base image. +This allows custom code/models to be prototyped on the cluster, but typically within a standard base image. However, if the Docker container also needs to be developed then GitHub actions can be used to automatically build a new image and publish it to Docker Hub if any changes to a Dockerfile is detected. @@ -210,7 +210,7 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av cpu: 1 memory: "8Gi" volumeMounts: - - mountPath: /code + - mountPath: /code name: github-code volumes: - name: volume @@ -263,7 +263,7 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av cpu: 1 memory: "8Gi" volumeMounts: - - mountPath: /code + - mountPath: /code name: github-code volumes: - name: volume @@ -273,13 +273,14 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av emptyDir: sizeLimit: 1Gi ``` - + 1. Submit the yaml file to kubernetes + ```bash kubectl create -f ``` - -### Setup GitHub actions to build and publish any changes to a Dockerfile + +### Setup GitHub actions to build and publish any changes to a Dockerfile 1. Create two [GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) to securely provide your Docker Hub username and access token. From a83cdad2babbb5018ee6b3a006cc72f41d452d31 Mon Sep 17 00:00:00 2001 From: Samuel Joseph Haynes <37002508+DimmestP@users.noreply.github.com> Date: Thu, 4 Jan 2024 11:02:10 +0000 Subject: [PATCH 05/19] Remove reference to pytorch in L4_template_workflow.md --- docs/services/gpuservice/training/L4_template_workflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 6dc74df01..360731d61 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -22,7 +22,7 @@ Request memory from the Ceph server by submitting a PVC to K8s (example pvc spec kubectl create -f ``` -#### Example PyTorch PersistentVolumeClaim +#### Example PersistentVolumeClaim ``` yaml kind: PersistentVolumeClaim From 20f75e0550d042d1f999c49bcf6f28c4c62819a8 Mon Sep 17 00:00:00 2001 From: Samuel Joseph Haynes <37002508+DimmestP@users.noreply.github.com> Date: Thu, 4 Jan 2024 12:12:49 +0000 Subject: [PATCH 06/19] Fix bugs with data loading --- docs/services/gpuservice/training/L4_template_workflow.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 360731d61..81431fb3a 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -55,7 +55,7 @@ spec: 1. Download data set (If the data set download time is estimated to be hours or days you may want to run this code within a [screen](https://www.gnu.org/software/screen/manual/screen.html) instance on your VM so you can track the progress asynchronously) ``` bash - kubectl exec lightweight-pod -- wget /mnt/ceph_rdb/ + kubectl exec lightweight-pod -- curl /mnt/ceph_rbd/ ``` 1. Delete lightweight pod @@ -74,7 +74,7 @@ metadata: spec: containers: - name: data-loader - image: ubuntu-latest + image: alpine/curl:latest command: ["sleep", "infinity"] resources: requests: From ced7729697b74621cb777c0eeda806074ab01578 Mon Sep 17 00:00:00 2001 From: Samuel Joseph Haynes <37002508+DimmestP@users.noreply.github.com> Date: Thu, 4 Jan 2024 12:35:47 +0000 Subject: [PATCH 07/19] Fix indent in basic workflow container --- .../training/L4_template_workflow.md | 42 +++++++++---------- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 81431fb3a..35647c9d9 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -140,31 +140,31 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av apiVersion: batch/v1 kind: Job metadata: - name: template-workflow-job + name: template-workflow-job spec: - completions: 1 - parallelism: 1 - template: + completions: 1 + parallelism: 1 + template: spec: - restartPolicy: Never - containers: - - name: template-docker-image - image: /template-docker-image:latest - command: ["sleep", "infinity"] - resources: + restartPolicy: Never + containers: + - name: template-docker-image + image: /template-docker-image:latest + command: ["sleep", "infinity"] + resources: requests: - cpu: 10 - memory: "40Gi" + cpu: 10 + memory: "40Gi" limits: - cpu: 10 - memory: "80Gi" - nvidia.com/gpu: 1 - volumeMounts: - - mountPath: /mnt/ceph_rbd - name: volume - volumes: - - name: volume - persistentVolumeClaim: + cpu: 10 + memory: "80Gi" + nvidia.com/gpu: 1 + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + volumes: + - name: volume + persistentVolumeClaim: claimName: template-workflow-pvc ``` From a5322b132a68cd62f5290a5e2d8cf64e698411c7 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 7 Feb 2024 15:14:34 +0000 Subject: [PATCH 08/19] Add notes highlighting the importance of specifying GPU types --- docs/services/gpuservice/index.md | 5 +++++ docs/services/gpuservice/training/L1_getting_started.md | 5 +++++ 2 files changed, 10 insertions(+) diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index 7dde82aaf..99629d1b0 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -33,6 +33,11 @@ The current full specification of the EIDF GPU Service as of 14 February 2024: Changes to the default quota must be discussed and agreed with the EIDF Services team. +> **NOTE** +> +> If you request a GPU on the EIDF GPU Service you will be assigned one at random unless you specify a GPU type. +> Please see [Getting started with Kubernetes](training/L1_getting_started.md) to learn about specifying GPU resources. + ## Service Access Users should have an [EIDF Account](../../access/project.md). diff --git a/docs/services/gpuservice/training/L1_getting_started.md b/docs/services/gpuservice/training/L1_getting_started.md index 9ebd1bea7..71fa10d1b 100644 --- a/docs/services/gpuservice/training/L1_getting_started.md +++ b/docs/services/gpuservice/training/L1_getting_started.md @@ -237,6 +237,11 @@ The GPU resource requests can be made more specific by adding the type of GPU pr - `nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB-MIG-1g.5gb'` - `nvidia.com/gpu.product: 'NVIDIA-H100-80GB-HBM3'` +> **WARNING** +> +> If you request a GPU but do not specify a GPU type you will be assigned one at random. +> Please check you are requesting a GPU with the correct memory and double check spelling. + ### Example yaml file ```yaml From ba2546df4600fab52723a3cd58dc6ce7006a6514 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Tue, 13 Feb 2024 17:19:35 +0000 Subject: [PATCH 09/19] Add options for all three stages of the workflow --- .../training/L4_template_workflow.md | 379 ++++++++++++------ 1 file changed, 265 insertions(+), 114 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 35647c9d9..0b7768954 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -1,136 +1,327 @@ # Template workflow +## Overview + An example workflow for code development using K8s is outlined below. -The workflow requires a GitHub account and GitHub Actions for CI/CD, (this can be adapted for other platforms such as GitLab). +In theory, users can create docker images with all the code, software and data included to complete their analysis. -The workflow is separated into three sections: +In practice, docker images with the required software alone can be several gigabytes in size and can be lead to unacceptable download times when ~100GB of data and code is included. -1. Data Loading +Therefore, it is recommended to separate code, software and data preparation into distinct steps: -1. Preparing a custom Docker image +1. Data Loading: Loading large data sets asynchronously. -1. Code development with K8s +1. Developing a Docker environment: Manually or automatically building Docker images. -## Data loading +1. Code development with K8s: Iteratively changing and testing code in a job. + +The workflow describes different strategies to tackle the three common stages in code development and analysis using the EIDF GPU Service. -### Create a persistent volume +The three stages are interchangeable and may not be relevant to every project. + +Some strategies in the workflow require a [GitHub](https://github.com) account and [Docker Hub](https://hub.docker.com/) account for automatic building (this can be adapted for other platforms such as GitLab). + +## Data loading -Request memory from the Ceph server by submitting a PVC to K8s (example pvc spec yaml below). +The EIDF GPU service contains GPUs with 40Gb/80Gb of on board memory and it is expected that data sets of > 100 Gb will be loaded onto the service to utilise this hardware. -``` bash -kubectl create -f -``` +Ensure persistent volume claims are of sufficient size to hold the input data, any expected output data and a small amount of additional empty space to facilitate IO. -#### Example PersistentVolumeClaim +Read the [requesting persistent volumes with Kubernetes](L2_requesting_persistent_volumes.md) lesson to learn how to request and mount persistent volumes to pods. -``` yaml -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: template-workflow-pvc -spec: - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 100Gi - storageClassName: csi-rbd-sc -``` +Downloading data sets of 1/2 TB or more to a persistent volume often takes several hours or days and needs to be completed asynchronously. -### Create a lightweight pod to tranfer data to the persistent volume +### Asynchronous data downloading with a lightweight job -1. Check PVC has been created +1. Check a PVC has been created. ``` bash - kubectl get pvc + kubectl get pvc template-workflow-pvc + ``` + +1. Write a job yaml with PV mounted and a command to download the data. Change the curl URL to your data set of interest. + + ``` yaml + apiVersion: batch/v1 + kind: Job + metadata: + name: lightweight-job + labels: + kueue.x-k8s.io/queue-name: + spec: + completions: 1 + parallelism: 1 + template: + metadata: + name: lightweight-job + spec: + restartPolicy: Never + containers: + - name: data-loader + image: alpine/curl:latest + command: ['sh', '-c', "cd /mnt/ceph_rbd; curl https://archive.ics.uci.edu/static/public/53/iris.zip -o iris.zip"] + resources: + requests: + cpu: 1 + memory: "1Gi" + limits: + cpu: 1 + memory: "1Gi" + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + volumes: + - name: volume + persistentVolumeClaim: + claimName: template-workflow-pvc ``` -1. Create a lightweight pod with PV mounted (example pod below) +1. Run the data download job. ``` bash kubectl create -f lightweight-pod.yaml ``` -1. Download data set (If the data set download time is estimated to be hours or days you may want to run this code within a [screen](https://www.gnu.org/software/screen/manual/screen.html) instance on your VM so you can track the progress asynchronously) +1. Check if the download has completed. ``` bash - kubectl exec lightweight-pod -- curl /mnt/ceph_rbd/ + kubectl get jobs ``` -1. Delete lightweight pod +1. Delete lightweight job once completed. ``` bash - kubectl delete pod lightweight-pod + kubectl delete job lightweight-job + ``` + +### Asynchronous data downloading within a screen session + +[Screen](https://www.gnu.org/software/screen/manual/screen.html#Overview) is a window manager available in Linux that allows you to create multiple interactive shells and swap between then. + +Screen has the added benefit that if your remote session is interrupted the screen session persists and can be reattached when you manage to reconnect. + +This allows you to start a task, such as downloading a data set, and check in on it asynchronously. + +Once you have started a screen session, you can create a new window with `ctrl-a c`, swap between windows with `ctrl-a 0-9` and exit screen (but keep any task running) with `ctrl-a d`. + +Using screen rather than a single download job can be helpful if downloading multiple data sets or if you intend to do some simple QC or tidying up before/after downloading. + +1. Start a screen session. + + ```bash + screen ``` -#### Example lightweight pod specification - -``` yaml -apiVersion: v1 -kind: Pod -metadata: - name: lightweight-pod -spec: - containers: - - name: data-loader - image: alpine/curl:latest - command: ["sleep", "infinity"] - resources: - requests: - cpu: 1 - memory: "1Gi" - limits: - cpu: 1 - memory: "1Gi" - volumeMounts: - - mountPath: /mnt/ceph_rbd - name: volume - volumes: - - name: volume - persistentVolumeClaim: - claimName: template-workflow-pvc -``` +1. Create an interactive lightweight job session. + + ``` yaml + apiVersion: batch/v1 + kind: Job + metadata: + name: lightweight-job + labels: + kueue.x-k8s.io/queue-name: + spec: + completions: 1 + parallelism: 1 + template: + metadata: + name: lightweight-pod + spec: + restartPolicy: Never + containers: + - name: data-loader + image: alpine/curl:latest + command: ['sleep','infinity'] + resources: + requests: + cpu: 1 + memory: "1Gi" + limits: + cpu: 1 + memory: "1Gi" + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + volumes: + - name: volume + persistentVolumeClaim: + claimName: template-workflow-pvc + ``` + +1. Download data set. Change the curl URL to your data set of interest. + + ``` bash + kubectl exec -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip + ``` + +1. Exit the remote session by either ending the session or `ctrl-a d`. + +1. Reconnect at a later time and reattach the screen window. + + ```bash + screen -list + + screen -r + ``` + +1. Check the download was successful and delete the job. + + ```bash + kubectl exec -- ls /mnt/ceph_rbd/ + + kubectl delete job lightweight-job + ``` + +1. Exit the screen session. + + ```bash + exit + ``` ## Preparing a custom Docker image -Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub. Typical use cases require some custom modifications of a base image. +Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub. -1. Select a suitable base image (The [Nvidia container catalog](https://catalog.ngc.nvidia.com/containers) is often a useful starting place for GPU accelerated tasks) +It does not provide functionality to build images and create pods from docker files. -1. Create a [Dockerfile](https://docs.docker.com/engine/reference/builder/) to add any additional packages required to the base image +However, use cases may require some custom modifications of a base image, such as adding a python library. + +These custom images need to be built locally (using docker) or online (using a GitHub/GitLab worker) and pushed to a repository such as Docker Hub. + +This is not an introduction to building docker images, please see the [Docker tutorial](https://docs.docker.com/get-started/) for a general overview. + +### Manually building a Docker image locally + +1. Select a suitable base image (The [Nvidia container catalog](https://catalog.ngc.nvidia.com/containers) is often a useful starting place for GPU accelerated tasks). We'll use to base [RAPIDS image](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/rapidsai/containers/base). + +1. Create a [Dockerfile](https://docs.docker.com/engine/reference/builder/) to add any additional packages required to the base image. ```txt FROM nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10 RUN pip install pandas - RUN pip install scikit-learn + RUN pip install plotly ``` -1. Build the Docker container locally or on a VM (You will need to install [Docker](https://docs.docker.com/)) +1. Build the Docker container locally (You will need to install [Docker](https://docs.docker.com/)) ```bash - docker build + cd + + docker build . -t /template-docker-image:latest ``` + + > **NOTE** + > + > Be aware that docker images built for Apple ARM64 architectures will not function optimally on the EIDFGPU Service's AMD64 based architecture. + > If building docker images locally on an Apple device you must tell the docker daemon to use AMD64 based images by passing the `--platform linux/amd64` flag to the build function. + +1. Create a repository to hold the image on [Docker Hub](https://hub.docker.com) (You will need to create and setup an account). -1. Push Docker image to Docker Hub (You will need to create and setup an account) +1. Push the Docker image to the repository. ```bash - docker push template-docker-image + docker push /template-docker-image:latest ``` + +1. Finally, specify your Docker image in the `image:` tag of the job specification yaml file. + + ```yaml + apiVersion: batch/v1 + kind: Job + metadata: + name: template-workflow-job + labels: + kueue.x-k8s.io/queue-name: + spec: + completions: 1 + parallelism: 1 + template: + spec: + restartPolicy: Never + containers: + - name: template-docker-image + image: /template-docker-image:latest + command: ["sleep", "infinity"] + resources: + requests: + cpu: 1 + memory: "4Gi" + limits: + cpu: 1 + memory: "8Gi" + ``` + +### Automatically building docker images using GitHub Actions + +In cases where the Docker image needs to be built and tested iteratively (i.e. to check for comparability issues), git version control and [GitHub Actions](https://github.com/features/actions) can simplify the build process. + +A GitHub action can build and push a Docker image to Docker Hub whenever it detects a git push that changes the dockerfile in a git repo. + +This process requires you to already have a [GitHub](https://github.com) and [Docker Hub](https://hub.docker.com) account. + +1. Create an [access token](https://docs.docker.com/security/for-developers/access-tokens/) on your Docker Hub account to allow GitHub to push changes to the Docker Hub image repo. + +1. Create two [GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) to securely provide your Docker Hub username and access token. + +1. Add the dockerfile to a code/docker folder within an active GitHub repo. + +1. Add the GitHub action yaml file below to the .github/workflow folder to automatically push a new image to Docker Hub if any changes to files in the code/docker folder is detected. + + ```yaml + name: ci + on: + push: + paths: + - 'code/docker/**' + + jobs: + docker: + runs-on: ubuntu-latest + steps: + - + name: Set up QEMU + uses: docker/setup-qemu-action@v3 + - + name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + - + name: Login to Docker Hub + uses: docker/login-action@v3 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_TOKEN }} + - + name: Build and push + uses: docker/build-push-action@v5 + with: + context: "{{defaultContext}}:code/docker" + push: true + tags: + ``` + +1. Push a change to the dockerfile and check the Docker Hub image is updated. ## Code development with K8s -A rapid development cycle from code writing to testing requires some initial setup within k8s. +Production code can be included within a Docker image to aid reproducibility as the specific software versions required to run the code are packaged together. -The first step is to automatically pull the latest code version before running any tests in a pod. +However, binding the code to the docker image during development can delay the testing cycle as re-downloading all of the software for every change in a code block can take time. -This allows development to be conducted on any device/VM with access to the repo (GitHub/GitLab) and testing to be completed on the cluster with just one `kubectl create` command. +If the docker image is consistent across tests, then it can be cached locally on the EIDFGPU Service instead of being re-downloaded (this occurs automatically although the cache is node specific and is not shared across nodes). -This allows custom code/models to be prototyped on the cluster, but typically within a standard base image. +A pod yaml file can be defined to automatically pull the latest code version before running any tests. -However, if the Docker container also needs to be developed then GitHub actions can be used to automatically build a new image and publish it to Docker Hub if any changes to a Dockerfile is detected. +Reducing the download time to fractions of a second allows rapid testing to be completed on the cluster with just the `kubectl create` command. -A template GitHub repo with sample code, k8s yaml files and github actions is available [here](https://github.com/DimmestP/template-EIDFGPU-workflow). +You must already have a [GitHub](https://github.com) account to follow this process. + +This process allows code development to be conducted on any device/VM with access to the repo (GitHub/GitLab). + +An alternative method for remote code development using the DevSpace toolkit is described is the next lesson, [Getting started with DevSpace](L5_devspace.md). + +A template GitHub repo with sample code, k8s yaml files and a Docker build Github Action is available [here](https://github.com/DimmestP/template-EIDFGPU-workflow). ### Create a job that downloads and runs the latest code version at runtime @@ -279,43 +470,3 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av ```bash kubectl create -f ``` - -### Setup GitHub actions to build and publish any changes to a Dockerfile - -1. Create two [GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) to securely provide your Docker Hub username and access token. - -1. Add the Dockerfile to a code/docker folder within the active GitHub repo - -1. Add the GitHub action yaml file below to the .github/workflow folder to automatically push a new image to Docker Hub if any changes to files in the code/docker folder is detected. - - ```yaml - name: ci - on: - push: - paths: - - 'code/docker/**' - - jobs: - docker: - runs-on: ubuntu-latest - steps: - - - name: Set up QEMU - uses: docker/setup-qemu-action@v3 - - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v3 - - - name: Login to Docker Hub - uses: docker/login-action@v3 - with: - username: ${{ secrets.DOCKERHUB_USERNAME }} - password: ${{ secrets.DOCKERHUB_TOKEN }} - - - name: Build and push - uses: docker/build-push-action@v5 - with: - context: "{{defaultContext}}:code/docker" - push: true - tags: - ``` From c361982b569772dab8a3156b13f3356d26725ebc Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 14 Feb 2024 16:45:14 +0000 Subject: [PATCH 10/19] Test all example code --- .../training/L4_template_workflow.md | 33 +++++++++++-------- 1 file changed, 19 insertions(+), 14 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 0b7768954..3064bd6d6 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -6,7 +6,7 @@ An example workflow for code development using K8s is outlined below. In theory, users can create docker images with all the code, software and data included to complete their analysis. -In practice, docker images with the required software alone can be several gigabytes in size and can be lead to unacceptable download times when ~100GB of data and code is included. +In practice, docker images with the required software alone can be several gigabytes in size which can lead to unacceptable download times when ~100GB of data and code is included. Therefore, it is recommended to separate code, software and data preparation into distinct steps: @@ -332,6 +332,8 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu kind: Job metadata: name: template-workflow-job + labels: + kueue.x-k8s.io/queue-name: spec: completions: 1 parallelism: 1 @@ -344,12 +346,11 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu command: ["sleep", "infinity"] resources: requests: - cpu: 10 - memory: "40Gi" + cpu: 1 + memory: "4Gi" limits: - cpu: 10 - memory: "80Gi" - nvidia.com/gpu: 1 + cpu: 1 + memory: "8Gi" volumeMounts: - mountPath: /mnt/ceph_rbd name: volume @@ -366,6 +367,8 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu kind: Job metadata: name: template-workflow-job + labels: + kueue.x-k8s.io/queue-name: spec: completions: 1 parallelism: 1 @@ -378,12 +381,11 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu command: ["sleep", "infinity"] resources: requests: - cpu: 10 - memory: "40Gi" + cpu: 1 + memory: "4Gi" limits: - cpu: 10 - memory: "80Gi" - nvidia.com/gpu: 1 + cpu: 1 + memory: "8Gi" volumeMounts: - mountPath: /mnt/ceph_rbd name: volume @@ -406,19 +408,22 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu volumes: - name: volume persistentVolumeClaim: - claimName: benchmark-imagenet-pvc + claimName: template-workflow-pvc - name: github-code emptyDir: sizeLimit: 1Gi ``` -1. Change the command argument in the main container to run the code once started. +1. Change the command argument in the main container to run the code once started. +Add the URL of the GitHub repo of interest to the `initContainers: command:` tag. ```yaml apiVersion: batch/v1 kind: Job metadata: name: template-workflow-job + labels: + kueue.x-k8s.io/queue-name: spec: completions: 1 parallelism: 1 @@ -459,7 +464,7 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu volumes: - name: volume persistentVolumeClaim: - claimName: benchmark-imagenet-pvc + claimName: template-workflow-pvc - name: github-code emptyDir: sizeLimit: 1Gi From b76e683a4b9e8e8624fe0f6bd44c38c7b7770828 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Thu, 21 Mar 2024 13:36:30 +0000 Subject: [PATCH 11/19] Add -n to kubectl usage --- .../training/L4_template_workflow.md | 26 ++++++++++--------- 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 3064bd6d6..67fc33c87 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -6,9 +6,9 @@ An example workflow for code development using K8s is outlined below. In theory, users can create docker images with all the code, software and data included to complete their analysis. -In practice, docker images with the required software alone can be several gigabytes in size which can lead to unacceptable download times when ~100GB of data and code is included. +In practice, docker images with the required software can be several gigabytes in size which can lead to unacceptable download times when ~100GB of data and code is then added. -Therefore, it is recommended to separate code, software and data preparation into distinct steps: +Therefore, it is recommended to separate code, software, and data preparation into distinct steps: 1. Data Loading: Loading large data sets asynchronously. @@ -26,18 +26,20 @@ Some strategies in the workflow require a [GitHub](https://github.com) account a The EIDF GPU service contains GPUs with 40Gb/80Gb of on board memory and it is expected that data sets of > 100 Gb will be loaded onto the service to utilise this hardware. -Ensure persistent volume claims are of sufficient size to hold the input data, any expected output data and a small amount of additional empty space to facilitate IO. +Persistent volume claims need to be of sufficient size to hold the input data, any expected output data and a small amount of additional empty space to facilitate IO. Read the [requesting persistent volumes with Kubernetes](L2_requesting_persistent_volumes.md) lesson to learn how to request and mount persistent volumes to pods. -Downloading data sets of 1/2 TB or more to a persistent volume often takes several hours or days and needs to be completed asynchronously. +It often takes several hours or days to download data sets of 1/2 TB or more to a persistent volume. + +Therefore, the data download step needs to be completed asynchronously as maintaining a contention to the server for long periods of time can be unreliable. ### Asynchronous data downloading with a lightweight job 1. Check a PVC has been created. ``` bash - kubectl get pvc template-workflow-pvc + kubectl -n get pvc template-workflow-pvc ``` 1. Write a job yaml with PV mounted and a command to download the data. Change the curl URL to your data set of interest. @@ -80,19 +82,19 @@ Downloading data sets of 1/2 TB or more to a persistent volume often takes sever 1. Run the data download job. ``` bash - kubectl create -f lightweight-pod.yaml + kubectl -n create -f lightweight-pod.yaml ``` 1. Check if the download has completed. ``` bash - kubectl get jobs + kubectl -n get jobs ``` 1. Delete lightweight job once completed. ``` bash - kubectl delete job lightweight-job + kubectl -n delete job lightweight-job ``` ### Asynchronous data downloading within a screen session @@ -153,7 +155,7 @@ Using screen rather than a single download job can be helpful if downloading mul 1. Download data set. Change the curl URL to your data set of interest. ``` bash - kubectl exec -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip + kubectl -n exec -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip ``` 1. Exit the remote session by either ending the session or `ctrl-a d`. @@ -169,9 +171,9 @@ Using screen rather than a single download job can be helpful if downloading mul 1. Check the download was successful and delete the job. ```bash - kubectl exec -- ls /mnt/ceph_rbd/ + kubectl -n exec -- ls /mnt/ceph_rbd/ - kubectl delete job lightweight-job + kubectl -n delete job lightweight-job ``` 1. Exit the screen session. @@ -473,5 +475,5 @@ Add the URL of the GitHub repo of interest to the `initContainers: command:` tag 1. Submit the yaml file to kubernetes ```bash - kubectl create -f + kubectl -n create -f ``` From 9df91f9c9340fc6a4b3d0597c2307b6639b92b84 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Mon, 25 Mar 2024 10:10:17 +0000 Subject: [PATCH 12/19] Simplify to project namespace --- .../gpuservice/training/L4_template_workflow.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 67fc33c87..bd5e069ee 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -39,7 +39,7 @@ Therefore, the data download step needs to be completed asynchronously as mainta 1. Check a PVC has been created. ``` bash - kubectl -n get pvc template-workflow-pvc + kubectl -n get pvc template-workflow-pvc ``` 1. Write a job yaml with PV mounted and a command to download the data. Change the curl URL to your data set of interest. @@ -82,19 +82,19 @@ Therefore, the data download step needs to be completed asynchronously as mainta 1. Run the data download job. ``` bash - kubectl -n create -f lightweight-pod.yaml + kubectl -n create -f lightweight-pod.yaml ``` 1. Check if the download has completed. ``` bash - kubectl -n get jobs + kubectl -n get jobs ``` 1. Delete lightweight job once completed. ``` bash - kubectl -n delete job lightweight-job + kubectl -n delete job lightweight-job ``` ### Asynchronous data downloading within a screen session @@ -155,7 +155,7 @@ Using screen rather than a single download job can be helpful if downloading mul 1. Download data set. Change the curl URL to your data set of interest. ``` bash - kubectl -n exec -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip + kubectl -n exec -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip ``` 1. Exit the remote session by either ending the session or `ctrl-a d`. @@ -171,9 +171,9 @@ Using screen rather than a single download job can be helpful if downloading mul 1. Check the download was successful and delete the job. ```bash - kubectl -n exec -- ls /mnt/ceph_rbd/ + kubectl -n exec -- ls /mnt/ceph_rbd/ - kubectl -n delete job lightweight-job + kubectl -n delete job lightweight-job ``` 1. Exit the screen session. @@ -475,5 +475,5 @@ Add the URL of the GitHub repo of interest to the `initContainers: command:` tag 1. Submit the yaml file to kubernetes ```bash - kubectl -n create -f + kubectl -n create -f ``` From 9d9afd9cc734b710e2ce93a48e08586bd40775b8 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 27 Mar 2024 17:04:38 +0000 Subject: [PATCH 13/19] Restore yaml file --- conda-requirements.yaml | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) create mode 100644 conda-requirements.yaml diff --git a/conda-requirements.yaml b/conda-requirements.yaml new file mode 100644 index 000000000..566edc658 --- /dev/null +++ b/conda-requirements.yaml @@ -0,0 +1,33 @@ +name: mkdocs +channels: + - conda-forge +dependencies: + - backports=1.1 + - cfgv=3.3.0 + - click=8.0.1 + - distlib=0.3.2 + - filelock=3.0.12 + - ghp-import=2.0.1 + - identify=2.2.11 + - importlib-metadata=4.6.1 + - Jinja2=3.0.1 + - Markdown=3.3.4 + - MarkupSafe=2.0.1 + - mergedeep=1.3.4 + - mkdocs=1.2.1 + - mkdocs-material=7.1.10 + - mkdocs-material-extensions=1.0.1 + - nodeenv=1.6.0 + - packaging=21.0 + - platformdirs=3.2 + - pre-commit=2.13.0 + - Pygments=2.9.0 + - pymdown-extensions=8.2 + - pyparsing=2.4.7 + - python-dateutil=2.8.1 + - PyYAML=5.4.1 + - pyyaml-env-tag=0.1 + - six=1.16.0 + - toml=0.10.2 + - watchdog=2.1.3 + - zipp=3.5.0 From f82513b1697661e6a7332b797fb394dd7e13387d Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 27 Mar 2024 17:22:04 +0000 Subject: [PATCH 14/19] Restore L1 to previous version --- docs/services/gpuservice/training/L1_getting_started.md | 5 ----- 1 file changed, 5 deletions(-) diff --git a/docs/services/gpuservice/training/L1_getting_started.md b/docs/services/gpuservice/training/L1_getting_started.md index 71fa10d1b..9ebd1bea7 100644 --- a/docs/services/gpuservice/training/L1_getting_started.md +++ b/docs/services/gpuservice/training/L1_getting_started.md @@ -237,11 +237,6 @@ The GPU resource requests can be made more specific by adding the type of GPU pr - `nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB-MIG-1g.5gb'` - `nvidia.com/gpu.product: 'NVIDIA-H100-80GB-HBM3'` -> **WARNING** -> -> If you request a GPU but do not specify a GPU type you will be assigned one at random. -> Please check you are requesting a GPU with the correct memory and double check spelling. - ### Example yaml file ```yaml From cf0ca63c16d1e8f2d78bfb60fb73039792347fd0 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 27 Mar 2024 17:35:20 +0000 Subject: [PATCH 15/19] Respond to Alistair comments --- .../training/L4_template_workflow.md | 42 ++++++++++--------- 1 file changed, 23 insertions(+), 19 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index bd5e069ee..b5a70eede 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -1,5 +1,9 @@ # Template workflow +## Requirements + + It is recommended that users complete [Getting started with Kubernetes](../L1_getting_started/#requirements) and [Requesting persistent volumes With Kubernetes](../L2_requesting_persistent_volumes/#requirements) before proceeding with this tutorial. + ## Overview An example workflow for code development using K8s is outlined below. @@ -50,7 +54,7 @@ Therefore, the data download step needs to be completed asynchronously as mainta metadata: name: lightweight-job labels: - kueue.x-k8s.io/queue-name: + kueue.x-k8s.io/queue-name: -user-queue spec: completions: 1 parallelism: 1 @@ -123,7 +127,7 @@ Using screen rather than a single download job can be helpful if downloading mul metadata: name: lightweight-job labels: - kueue.x-k8s.io/queue-name: + kueue.x-k8s.io/queue-name: -user-queue spec: completions: 1 parallelism: 1 @@ -155,7 +159,7 @@ Using screen rather than a single download job can be helpful if downloading mul 1. Download data set. Change the curl URL to your data set of interest. ``` bash - kubectl -n exec -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip + kubectl -n exec -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip ``` 1. Exit the remote session by either ending the session or `ctrl-a d`. @@ -171,7 +175,7 @@ Using screen rather than a single download job can be helpful if downloading mul 1. Check the download was successful and delete the job. ```bash - kubectl -n exec -- ls /mnt/ceph_rbd/ + kubectl -n exec -- ls /mnt/ceph_rbd/ kubectl -n delete job lightweight-job ``` @@ -209,22 +213,22 @@ This is not an introduction to building docker images, please see the [Docker tu 1. Build the Docker container locally (You will need to install [Docker](https://docs.docker.com/)) ```bash - cd + cd - docker build . -t /template-docker-image:latest + docker build . -t /template-docker-image:latest ``` - > **NOTE** - > - > Be aware that docker images built for Apple ARM64 architectures will not function optimally on the EIDFGPU Service's AMD64 based architecture. - > If building docker images locally on an Apple device you must tell the docker daemon to use AMD64 based images by passing the `--platform linux/amd64` flag to the build function. +!!! important "Building images for different CPU architectures" + Be aware that docker images built for Apple ARM64 architectures will not function optimally on the EIDFGPU Service's AMD64 based architecture. + + If building docker images locally on an Apple device you must tell the docker daemon to use AMD64 based images by passing the `--platform linux/amd64` flag to the build function. 1. Create a repository to hold the image on [Docker Hub](https://hub.docker.com) (You will need to create and setup an account). 1. Push the Docker image to the repository. ```bash - docker push /template-docker-image:latest + docker push /template-docker-image:latest ``` 1. Finally, specify your Docker image in the `image:` tag of the job specification yaml file. @@ -235,7 +239,7 @@ This is not an introduction to building docker images, please see the [Docker tu metadata: name: template-workflow-job labels: - kueue.x-k8s.io/queue-name: + kueue.x-k8s.io/queue-name: -user-queue spec: completions: 1 parallelism: 1 @@ -244,7 +248,7 @@ This is not an introduction to building docker images, please see the [Docker tu restartPolicy: Never containers: - name: template-docker-image - image: /template-docker-image:latest + image: /template-docker-image:latest command: ["sleep", "infinity"] resources: requests: @@ -335,7 +339,7 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu metadata: name: template-workflow-job labels: - kueue.x-k8s.io/queue-name: + kueue.x-k8s.io/queue-name: -user-queue spec: completions: 1 parallelism: 1 @@ -344,7 +348,7 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu restartPolicy: Never containers: - name: template-docker-image - image: /template-docker-image:latest + image: /template-docker-image:latest command: ["sleep", "infinity"] resources: requests: @@ -370,7 +374,7 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu metadata: name: template-workflow-job labels: - kueue.x-k8s.io/queue-name: + kueue.x-k8s.io/queue-name: -user-queue spec: completions: 1 parallelism: 1 @@ -379,7 +383,7 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu restartPolicy: Never containers: - name: template-docker-image - image: /template-docker-image:latest + image: /template-docker-image:latest command: ["sleep", "infinity"] resources: requests: @@ -425,7 +429,7 @@ Add the URL of the GitHub repo of interest to the `initContainers: command:` tag metadata: name: template-workflow-job labels: - kueue.x-k8s.io/queue-name: + kueue.x-k8s.io/queue-name: -user-queue spec: completions: 1 parallelism: 1 @@ -434,7 +438,7 @@ Add the URL of the GitHub repo of interest to the `initContainers: command:` tag restartPolicy: Never containers: - name: template-docker-image - image: /template-docker-image:latest + image: /template-docker-image:latest command: ['sh', '-c', "python3 /code/"] resources: requests: From c6ae85863108310af144dd4a03d17f68a3a28de2 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 27 Mar 2024 17:40:16 +0000 Subject: [PATCH 16/19] Add workflow lesson to overview table --- docs/services/gpuservice/index.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index 99629d1b0..53bbc6949 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -87,6 +87,7 @@ This tutorial teaches users how to submit tasks to the EIDF GPU Service, but it | [Getting started with Kubernetes](training/L1_getting_started.md) | a. What is Kubernetes?
b. How to send a task to a GPU node.
c. How to define the GPU resources needed. | | [Requesting persistent volumes with Kubernetes](training/L2_requesting_persistent_volumes.md) | a. What is a persistent volume?
b. How to request a PV resource. | | [Running a PyTorch task](training/L3_running_a_pytorch_task.md) | a. Accessing a Pytorch container.
b. Submitting a PyTorch task to the cluster.
c. Inspecting the results. | +| [Template workflow](training/L4_template workflow.md) | a. Loading large data sets asynchronously.
b. Manually or automatically building Docker images.
c. Iteratively changing and testing code in a job. | ## Further Reading and Help From 2da5e49d49070ea5bab0b2e804bfb7a8e0f29a90 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 27 Mar 2024 17:45:38 +0000 Subject: [PATCH 17/19] Fix typos --- docs/services/gpuservice/training/L4_template_workflow.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index b5a70eede..6362e1704 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -95,7 +95,7 @@ Therefore, the data download step needs to be completed asynchronously as mainta kubectl -n get jobs ``` -1. Delete lightweight job once completed. +1. Delete the lightweight job once completed. ``` bash kubectl -n delete job lightweight-job @@ -200,7 +200,7 @@ This is not an introduction to building docker images, please see the [Docker tu ### Manually building a Docker image locally -1. Select a suitable base image (The [Nvidia container catalog](https://catalog.ngc.nvidia.com/containers) is often a useful starting place for GPU accelerated tasks). We'll use to base [RAPIDS image](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/rapidsai/containers/base). +1. Select a suitable base image (The [Nvidia container catalog](https://catalog.ngc.nvidia.com/containers) is often a useful starting place for GPU accelerated tasks). We'll use the base [RAPIDS image](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/rapidsai/containers/base). 1. Create a [Dockerfile](https://docs.docker.com/engine/reference/builder/) to add any additional packages required to the base image. @@ -263,7 +263,7 @@ This is not an introduction to building docker images, please see the [Docker tu In cases where the Docker image needs to be built and tested iteratively (i.e. to check for comparability issues), git version control and [GitHub Actions](https://github.com/features/actions) can simplify the build process. -A GitHub action can build and push a Docker image to Docker Hub whenever it detects a git push that changes the dockerfile in a git repo. +A GitHub action can build and push a Docker image to Docker Hub whenever it detects a git push that changes the docker file in a git repo. This process requires you to already have a [GitHub](https://github.com) and [Docker Hub](https://hub.docker.com) account. From e5d7cfc481cc94322a835a495ec5d6aeab89f4b8 Mon Sep 17 00:00:00 2001 From: Samuel Joseph Haynes <37002508+DimmestP@users.noreply.github.com> Date: Tue, 9 Apr 2024 15:24:08 +0100 Subject: [PATCH 18/19] Removed reference to devspace and new line in bullet point. --- docs/services/gpuservice/training/L4_template_workflow.md | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 6362e1704..73348097f 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -325,8 +325,6 @@ You must already have a [GitHub](https://github.com) account to follow this proc This process allows code development to be conducted on any device/VM with access to the repo (GitHub/GitLab). -An alternative method for remote code development using the DevSpace toolkit is described is the next lesson, [Getting started with DevSpace](L5_devspace.md). - A template GitHub repo with sample code, k8s yaml files and a Docker build Github Action is available [here](https://github.com/DimmestP/template-EIDFGPU-workflow). ### Create a job that downloads and runs the latest code version at runtime @@ -420,8 +418,7 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu sizeLimit: 1Gi ``` -1. Change the command argument in the main container to run the code once started. -Add the URL of the GitHub repo of interest to the `initContainers: command:` tag. +1. Change the command argument in the main container to run the code once started. Add the URL of the GitHub repo of interest to the `initContainers: command:` tag. ```yaml apiVersion: batch/v1 From d9a80c84f0452dd43cb87e6ffdcfa508b6e30f5e Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Tue, 9 Apr 2024 15:42:59 +0100 Subject: [PATCH 19/19] Fixed whitespaces and incorrect link to lesson 4 in summary table --- docs/access/virtualmachines-vdi.md | 1 - docs/services/gpuservice/index.md | 2 +- .../training/L4_template_workflow.md | 36 +++++++++---------- 3 files changed, 19 insertions(+), 20 deletions(-) diff --git a/docs/access/virtualmachines-vdi.md b/docs/access/virtualmachines-vdi.md index abc7a18a2..390e4007b 100644 --- a/docs/access/virtualmachines-vdi.md +++ b/docs/access/virtualmachines-vdi.md @@ -85,4 +85,3 @@ For users who do not have standard `English (UK)` keyboard layouts, key presses are transmitted to your VM. Please contact the EIDF helpdesk at [eidf@epcc.ed.ac.uk](mailto:eidf@epcc.ed.ac.uk) if you are experiencing difficulties with your keyboard mapping, and we will help to resolve this by changing some settings in the Guacamole VDI connection configuration. - diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index 53bbc6949..bca3f0dea 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -87,7 +87,7 @@ This tutorial teaches users how to submit tasks to the EIDF GPU Service, but it | [Getting started with Kubernetes](training/L1_getting_started.md) | a. What is Kubernetes?
b. How to send a task to a GPU node.
c. How to define the GPU resources needed. | | [Requesting persistent volumes with Kubernetes](training/L2_requesting_persistent_volumes.md) | a. What is a persistent volume?
b. How to request a PV resource. | | [Running a PyTorch task](training/L3_running_a_pytorch_task.md) | a. Accessing a Pytorch container.
b. Submitting a PyTorch task to the cluster.
c. Inspecting the results. | -| [Template workflow](training/L4_template workflow.md) | a. Loading large data sets asynchronously.
b. Manually or automatically building Docker images.
c. Iteratively changing and testing code in a job. | +| [Template workflow](training/L4_template_workflow.md) | a. Loading large data sets asynchronously.
b. Manually or automatically building Docker images.
c. Iteratively changing and testing code in a job. | ## Further Reading and Help diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 73348097f..8c410c839 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -3,7 +3,7 @@ ## Requirements It is recommended that users complete [Getting started with Kubernetes](../L1_getting_started/#requirements) and [Requesting persistent volumes With Kubernetes](../L2_requesting_persistent_volumes/#requirements) before proceeding with this tutorial. - + ## Overview An example workflow for code development using K8s is outlined below. @@ -20,7 +20,7 @@ Therefore, it is recommended to separate code, software, and data preparation in 1. Code development with K8s: Iteratively changing and testing code in a job. -The workflow describes different strategies to tackle the three common stages in code development and analysis using the EIDF GPU Service. +The workflow describes different strategies to tackle the three common stages in code development and analysis using the EIDF GPU Service. The three stages are interchangeable and may not be relevant to every project. @@ -45,7 +45,7 @@ Therefore, the data download step needs to be completed asynchronously as mainta ``` bash kubectl -n get pvc template-workflow-pvc ``` - + 1. Write a job yaml with PV mounted and a command to download the data. Change the curl URL to your data set of interest. ``` yaml @@ -55,7 +55,7 @@ Therefore, the data download step needs to be completed asynchronously as mainta name: lightweight-job labels: kueue.x-k8s.io/queue-name: -user-queue - spec: + spec: completions: 1 parallelism: 1 template: @@ -105,7 +105,7 @@ Therefore, the data download step needs to be completed asynchronously as mainta [Screen](https://www.gnu.org/software/screen/manual/screen.html#Overview) is a window manager available in Linux that allows you to create multiple interactive shells and swap between then. -Screen has the added benefit that if your remote session is interrupted the screen session persists and can be reattached when you manage to reconnect. +Screen has the added benefit that if your remote session is interrupted the screen session persists and can be reattached when you manage to reconnect. This allows you to start a task, such as downloading a data set, and check in on it asynchronously. @@ -128,7 +128,7 @@ Using screen rather than a single download job can be helpful if downloading mul name: lightweight-job labels: kueue.x-k8s.io/queue-name: -user-queue - spec: + spec: completions: 1 parallelism: 1 template: @@ -162,13 +162,13 @@ Using screen rather than a single download job can be helpful if downloading mul kubectl -n exec -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip ``` -1. Exit the remote session by either ending the session or `ctrl-a d`. +1. Exit the remote session by either ending the session or `ctrl-a d`. 1. Reconnect at a later time and reattach the screen window. - + ```bash screen -list - + screen -r ``` @@ -176,7 +176,7 @@ Using screen rather than a single download job can be helpful if downloading mul ```bash kubectl -n exec -- ls /mnt/ceph_rbd/ - + kubectl -n delete job lightweight-job ``` @@ -188,7 +188,7 @@ Using screen rather than a single download job can be helpful if downloading mul ## Preparing a custom Docker image -Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub. +Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub. It does not provide functionality to build images and create pods from docker files. @@ -214,15 +214,15 @@ This is not an introduction to building docker images, please see the [Docker tu ```bash cd - + docker build . -t /template-docker-image:latest ``` - + !!! important "Building images for different CPU architectures" Be aware that docker images built for Apple ARM64 architectures will not function optimally on the EIDFGPU Service's AMD64 based architecture. - If building docker images locally on an Apple device you must tell the docker daemon to use AMD64 based images by passing the `--platform linux/amd64` flag to the build function. - + If building docker images locally on an Apple device you must tell the docker daemon to use AMD64 based images by passing the `--platform linux/amd64` flag to the build function. + 1. Create a repository to hold the image on [Docker Hub](https://hub.docker.com) (You will need to create and setup an account). 1. Push the Docker image to the repository. @@ -230,7 +230,7 @@ This is not an introduction to building docker images, please see the [Docker tu ```bash docker push /template-docker-image:latest ``` - + 1. Finally, specify your Docker image in the `image:` tag of the job specification yaml file. ```yaml @@ -258,7 +258,7 @@ This is not an introduction to building docker images, please see the [Docker tu cpu: 1 memory: "8Gi" ``` - + ### Automatically building docker images using GitHub Actions In cases where the Docker image needs to be built and tested iteratively (i.e. to check for comparability issues), git version control and [GitHub Actions](https://github.com/features/actions) can simplify the build process. @@ -267,7 +267,7 @@ A GitHub action can build and push a Docker image to Docker Hub whenever it dete This process requires you to already have a [GitHub](https://github.com) and [Docker Hub](https://hub.docker.com) account. -1. Create an [access token](https://docs.docker.com/security/for-developers/access-tokens/) on your Docker Hub account to allow GitHub to push changes to the Docker Hub image repo. +1. Create an [access token](https://docs.docker.com/security/for-developers/access-tokens/) on your Docker Hub account to allow GitHub to push changes to the Docker Hub image repo. 1. Create two [GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) to securely provide your Docker Hub username and access token.