diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md
index 7dde82aaf..bca3f0dea 100644
--- a/docs/services/gpuservice/index.md
+++ b/docs/services/gpuservice/index.md
@@ -33,6 +33,11 @@ The current full specification of the EIDF GPU Service as of 14 February 2024:
Changes to the default quota must be discussed and agreed with the EIDF Services team.
+> **NOTE**
+>
+> If you request a GPU on the EIDF GPU Service you will be assigned one at random unless you specify a GPU type.
+> Please see [Getting started with Kubernetes](training/L1_getting_started.md) to learn about specifying GPU resources.
+
## Service Access
Users should have an [EIDF Account](../../access/project.md).
@@ -82,6 +87,7 @@ This tutorial teaches users how to submit tasks to the EIDF GPU Service, but it
| [Getting started with Kubernetes](training/L1_getting_started.md) | a. What is Kubernetes?
b. How to send a task to a GPU node.
c. How to define the GPU resources needed. |
| [Requesting persistent volumes with Kubernetes](training/L2_requesting_persistent_volumes.md) | a. What is a persistent volume?
b. How to request a PV resource. |
| [Running a PyTorch task](training/L3_running_a_pytorch_task.md) | a. Accessing a Pytorch container.
b. Submitting a PyTorch task to the cluster.
c. Inspecting the results. |
+| [Template workflow](training/L4_template_workflow.md) | a. Loading large data sets asynchronously.
b. Manually or automatically building Docker images.
c. Iteratively changing and testing code in a job. |
## Further Reading and Help
diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md
index 2114bfda7..8c410c839 100644
--- a/docs/services/gpuservice/training/L4_template_workflow.md
+++ b/docs/services/gpuservice/training/L4_template_workflow.md
@@ -1 +1,480 @@
# Template workflow
+
+## Requirements
+
+ It is recommended that users complete [Getting started with Kubernetes](../L1_getting_started/#requirements) and [Requesting persistent volumes With Kubernetes](../L2_requesting_persistent_volumes/#requirements) before proceeding with this tutorial.
+
+## Overview
+
+An example workflow for code development using K8s is outlined below.
+
+In theory, users can create docker images with all the code, software and data included to complete their analysis.
+
+In practice, docker images with the required software can be several gigabytes in size which can lead to unacceptable download times when ~100GB of data and code is then added.
+
+Therefore, it is recommended to separate code, software, and data preparation into distinct steps:
+
+1. Data Loading: Loading large data sets asynchronously.
+
+1. Developing a Docker environment: Manually or automatically building Docker images.
+
+1. Code development with K8s: Iteratively changing and testing code in a job.
+
+The workflow describes different strategies to tackle the three common stages in code development and analysis using the EIDF GPU Service.
+
+The three stages are interchangeable and may not be relevant to every project.
+
+Some strategies in the workflow require a [GitHub](https://github.com) account and [Docker Hub](https://hub.docker.com/) account for automatic building (this can be adapted for other platforms such as GitLab).
+
+## Data loading
+
+The EIDF GPU service contains GPUs with 40Gb/80Gb of on board memory and it is expected that data sets of > 100 Gb will be loaded onto the service to utilise this hardware.
+
+Persistent volume claims need to be of sufficient size to hold the input data, any expected output data and a small amount of additional empty space to facilitate IO.
+
+Read the [requesting persistent volumes with Kubernetes](L2_requesting_persistent_volumes.md) lesson to learn how to request and mount persistent volumes to pods.
+
+It often takes several hours or days to download data sets of 1/2 TB or more to a persistent volume.
+
+Therefore, the data download step needs to be completed asynchronously as maintaining a contention to the server for long periods of time can be unreliable.
+
+### Asynchronous data downloading with a lightweight job
+
+1. Check a PVC has been created.
+
+ ``` bash
+ kubectl -n get pvc template-workflow-pvc
+ ```
+
+1. Write a job yaml with PV mounted and a command to download the data. Change the curl URL to your data set of interest.
+
+ ``` yaml
+ apiVersion: batch/v1
+ kind: Job
+ metadata:
+ name: lightweight-job
+ labels:
+ kueue.x-k8s.io/queue-name: -user-queue
+ spec:
+ completions: 1
+ parallelism: 1
+ template:
+ metadata:
+ name: lightweight-job
+ spec:
+ restartPolicy: Never
+ containers:
+ - name: data-loader
+ image: alpine/curl:latest
+ command: ['sh', '-c', "cd /mnt/ceph_rbd; curl https://archive.ics.uci.edu/static/public/53/iris.zip -o iris.zip"]
+ resources:
+ requests:
+ cpu: 1
+ memory: "1Gi"
+ limits:
+ cpu: 1
+ memory: "1Gi"
+ volumeMounts:
+ - mountPath: /mnt/ceph_rbd
+ name: volume
+ volumes:
+ - name: volume
+ persistentVolumeClaim:
+ claimName: template-workflow-pvc
+ ```
+
+1. Run the data download job.
+
+ ``` bash
+ kubectl -n create -f lightweight-pod.yaml
+ ```
+
+1. Check if the download has completed.
+
+ ``` bash
+ kubectl -n get jobs
+ ```
+
+1. Delete the lightweight job once completed.
+
+ ``` bash
+ kubectl -n delete job lightweight-job
+ ```
+
+### Asynchronous data downloading within a screen session
+
+[Screen](https://www.gnu.org/software/screen/manual/screen.html#Overview) is a window manager available in Linux that allows you to create multiple interactive shells and swap between then.
+
+Screen has the added benefit that if your remote session is interrupted the screen session persists and can be reattached when you manage to reconnect.
+
+This allows you to start a task, such as downloading a data set, and check in on it asynchronously.
+
+Once you have started a screen session, you can create a new window with `ctrl-a c`, swap between windows with `ctrl-a 0-9` and exit screen (but keep any task running) with `ctrl-a d`.
+
+Using screen rather than a single download job can be helpful if downloading multiple data sets or if you intend to do some simple QC or tidying up before/after downloading.
+
+1. Start a screen session.
+
+ ```bash
+ screen
+ ```
+
+1. Create an interactive lightweight job session.
+
+ ``` yaml
+ apiVersion: batch/v1
+ kind: Job
+ metadata:
+ name: lightweight-job
+ labels:
+ kueue.x-k8s.io/queue-name: -user-queue
+ spec:
+ completions: 1
+ parallelism: 1
+ template:
+ metadata:
+ name: lightweight-pod
+ spec:
+ restartPolicy: Never
+ containers:
+ - name: data-loader
+ image: alpine/curl:latest
+ command: ['sleep','infinity']
+ resources:
+ requests:
+ cpu: 1
+ memory: "1Gi"
+ limits:
+ cpu: 1
+ memory: "1Gi"
+ volumeMounts:
+ - mountPath: /mnt/ceph_rbd
+ name: volume
+ volumes:
+ - name: volume
+ persistentVolumeClaim:
+ claimName: template-workflow-pvc
+ ```
+
+1. Download data set. Change the curl URL to your data set of interest.
+
+ ``` bash
+ kubectl -n exec -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip
+ ```
+
+1. Exit the remote session by either ending the session or `ctrl-a d`.
+
+1. Reconnect at a later time and reattach the screen window.
+
+ ```bash
+ screen -list
+
+ screen -r
+ ```
+
+1. Check the download was successful and delete the job.
+
+ ```bash
+ kubectl -n exec -- ls /mnt/ceph_rbd/
+
+ kubectl -n delete job lightweight-job
+ ```
+
+1. Exit the screen session.
+
+ ```bash
+ exit
+ ```
+
+## Preparing a custom Docker image
+
+Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub.
+
+It does not provide functionality to build images and create pods from docker files.
+
+However, use cases may require some custom modifications of a base image, such as adding a python library.
+
+These custom images need to be built locally (using docker) or online (using a GitHub/GitLab worker) and pushed to a repository such as Docker Hub.
+
+This is not an introduction to building docker images, please see the [Docker tutorial](https://docs.docker.com/get-started/) for a general overview.
+
+### Manually building a Docker image locally
+
+1. Select a suitable base image (The [Nvidia container catalog](https://catalog.ngc.nvidia.com/containers) is often a useful starting place for GPU accelerated tasks). We'll use the base [RAPIDS image](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/rapidsai/containers/base).
+
+1. Create a [Dockerfile](https://docs.docker.com/engine/reference/builder/) to add any additional packages required to the base image.
+
+ ```txt
+ FROM nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10
+ RUN pip install pandas
+ RUN pip install plotly
+ ```
+
+1. Build the Docker container locally (You will need to install [Docker](https://docs.docker.com/))
+
+ ```bash
+ cd
+
+ docker build . -t /template-docker-image:latest
+ ```
+
+!!! important "Building images for different CPU architectures"
+ Be aware that docker images built for Apple ARM64 architectures will not function optimally on the EIDFGPU Service's AMD64 based architecture.
+
+ If building docker images locally on an Apple device you must tell the docker daemon to use AMD64 based images by passing the `--platform linux/amd64` flag to the build function.
+
+1. Create a repository to hold the image on [Docker Hub](https://hub.docker.com) (You will need to create and setup an account).
+
+1. Push the Docker image to the repository.
+
+ ```bash
+ docker push /template-docker-image:latest
+ ```
+
+1. Finally, specify your Docker image in the `image:` tag of the job specification yaml file.
+
+ ```yaml
+ apiVersion: batch/v1
+ kind: Job
+ metadata:
+ name: template-workflow-job
+ labels:
+ kueue.x-k8s.io/queue-name: -user-queue
+ spec:
+ completions: 1
+ parallelism: 1
+ template:
+ spec:
+ restartPolicy: Never
+ containers:
+ - name: template-docker-image
+ image: /template-docker-image:latest
+ command: ["sleep", "infinity"]
+ resources:
+ requests:
+ cpu: 1
+ memory: "4Gi"
+ limits:
+ cpu: 1
+ memory: "8Gi"
+ ```
+
+### Automatically building docker images using GitHub Actions
+
+In cases where the Docker image needs to be built and tested iteratively (i.e. to check for comparability issues), git version control and [GitHub Actions](https://github.com/features/actions) can simplify the build process.
+
+A GitHub action can build and push a Docker image to Docker Hub whenever it detects a git push that changes the docker file in a git repo.
+
+This process requires you to already have a [GitHub](https://github.com) and [Docker Hub](https://hub.docker.com) account.
+
+1. Create an [access token](https://docs.docker.com/security/for-developers/access-tokens/) on your Docker Hub account to allow GitHub to push changes to the Docker Hub image repo.
+
+1. Create two [GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) to securely provide your Docker Hub username and access token.
+
+1. Add the dockerfile to a code/docker folder within an active GitHub repo.
+
+1. Add the GitHub action yaml file below to the .github/workflow folder to automatically push a new image to Docker Hub if any changes to files in the code/docker folder is detected.
+
+ ```yaml
+ name: ci
+ on:
+ push:
+ paths:
+ - 'code/docker/**'
+
+ jobs:
+ docker:
+ runs-on: ubuntu-latest
+ steps:
+ -
+ name: Set up QEMU
+ uses: docker/setup-qemu-action@v3
+ -
+ name: Set up Docker Buildx
+ uses: docker/setup-buildx-action@v3
+ -
+ name: Login to Docker Hub
+ uses: docker/login-action@v3
+ with:
+ username: ${{ secrets.DOCKERHUB_USERNAME }}
+ password: ${{ secrets.DOCKERHUB_TOKEN }}
+ -
+ name: Build and push
+ uses: docker/build-push-action@v5
+ with:
+ context: "{{defaultContext}}:code/docker"
+ push: true
+ tags:
+ ```
+
+1. Push a change to the dockerfile and check the Docker Hub image is updated.
+
+## Code development with K8s
+
+Production code can be included within a Docker image to aid reproducibility as the specific software versions required to run the code are packaged together.
+
+However, binding the code to the docker image during development can delay the testing cycle as re-downloading all of the software for every change in a code block can take time.
+
+If the docker image is consistent across tests, then it can be cached locally on the EIDFGPU Service instead of being re-downloaded (this occurs automatically although the cache is node specific and is not shared across nodes).
+
+A pod yaml file can be defined to automatically pull the latest code version before running any tests.
+
+Reducing the download time to fractions of a second allows rapid testing to be completed on the cluster with just the `kubectl create` command.
+
+You must already have a [GitHub](https://github.com) account to follow this process.
+
+This process allows code development to be conducted on any device/VM with access to the repo (GitHub/GitLab).
+
+A template GitHub repo with sample code, k8s yaml files and a Docker build Github Action is available [here](https://github.com/DimmestP/template-EIDFGPU-workflow).
+
+### Create a job that downloads and runs the latest code version at runtime
+
+1. Write a standard yaml file for a k8s job with the required resources and custom docker image (example below)
+
+ ```yaml
+ apiVersion: batch/v1
+ kind: Job
+ metadata:
+ name: template-workflow-job
+ labels:
+ kueue.x-k8s.io/queue-name: -user-queue
+ spec:
+ completions: 1
+ parallelism: 1
+ template:
+ spec:
+ restartPolicy: Never
+ containers:
+ - name: template-docker-image
+ image: /template-docker-image:latest
+ command: ["sleep", "infinity"]
+ resources:
+ requests:
+ cpu: 1
+ memory: "4Gi"
+ limits:
+ cpu: 1
+ memory: "8Gi"
+ volumeMounts:
+ - mountPath: /mnt/ceph_rbd
+ name: volume
+ volumes:
+ - name: volume
+ persistentVolumeClaim:
+ claimName: template-workflow-pvc
+ ```
+
+1. Add an initial container that runs before the main container to download the latest version of the code.
+
+ ```yaml
+ apiVersion: batch/v1
+ kind: Job
+ metadata:
+ name: template-workflow-job
+ labels:
+ kueue.x-k8s.io/queue-name: -user-queue
+ spec:
+ completions: 1
+ parallelism: 1
+ template:
+ spec:
+ restartPolicy: Never
+ containers:
+ - name: template-docker-image
+ image: /template-docker-image:latest
+ command: ["sleep", "infinity"]
+ resources:
+ requests:
+ cpu: 1
+ memory: "4Gi"
+ limits:
+ cpu: 1
+ memory: "8Gi"
+ volumeMounts:
+ - mountPath: /mnt/ceph_rbd
+ name: volume
+ - mountPath: /code
+ name: github-code
+ initContainers:
+ - name: lightweight-git-container
+ image: cicirello/alpine-plus-plus
+ command: ['sh', '-c', "cd /code; git clone "]
+ resources:
+ requests:
+ cpu: 1
+ memory: "4Gi"
+ limits:
+ cpu: 1
+ memory: "8Gi"
+ volumeMounts:
+ - mountPath: /code
+ name: github-code
+ volumes:
+ - name: volume
+ persistentVolumeClaim:
+ claimName: template-workflow-pvc
+ - name: github-code
+ emptyDir:
+ sizeLimit: 1Gi
+ ```
+
+1. Change the command argument in the main container to run the code once started. Add the URL of the GitHub repo of interest to the `initContainers: command:` tag.
+
+ ```yaml
+ apiVersion: batch/v1
+ kind: Job
+ metadata:
+ name: template-workflow-job
+ labels:
+ kueue.x-k8s.io/queue-name: -user-queue
+ spec:
+ completions: 1
+ parallelism: 1
+ template:
+ spec:
+ restartPolicy: Never
+ containers:
+ - name: template-docker-image
+ image: /template-docker-image:latest
+ command: ['sh', '-c', "python3 /code/"]
+ resources:
+ requests:
+ cpu: 10
+ memory: "40Gi"
+ limits:
+ cpu: 10
+ memory: "80Gi"
+ nvidia.com/gpu: 1
+ volumeMounts:
+ - mountPath: /mnt/ceph_rbd
+ name: volume
+ - mountPath: /code
+ name: github-code
+ initContainers:
+ - name: lightweight-git-container
+ image: cicirello/alpine-plus-plus
+ command: ['sh', '-c', "cd /code; git clone "]
+ resources:
+ requests:
+ cpu: 1
+ memory: "4Gi"
+ limits:
+ cpu: 1
+ memory: "8Gi"
+ volumeMounts:
+ - mountPath: /code
+ name: github-code
+ volumes:
+ - name: volume
+ persistentVolumeClaim:
+ claimName: template-workflow-pvc
+ - name: github-code
+ emptyDir:
+ sizeLimit: 1Gi
+ ```
+
+1. Submit the yaml file to kubernetes
+
+ ```bash
+ kubectl -n create -f
+ ```
diff --git a/mkdocs.yml b/mkdocs.yml
index fb602f696..b2837cbb7 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -68,6 +68,7 @@ nav:
- "Getting Started": services/gpuservice/training/L1_getting_started.md
- "Persistent Volumes": services/gpuservice/training/L2_requesting_persistent_volumes.md
- "Running a Pytorch Pod": services/gpuservice/training/L3_running_a_pytorch_task.md
+ - "Template K8s Workflow": services/gpuservice/training/L4_template_workflow.md
- "GPU Service FAQ": services/gpuservice/faq.md
- "Graphcore Bow Pod64":
- "Overview": services/graphcore/index.md