Merge pull request #111 from josephleekl/graphcore-docs

JL: added graphcore service documentation
EPCCed · Nov 17, 2023 · babbdcc · babbdcc
2 parents 33bb6d6 + 076300c
commit babbdcc
Show file tree

Hide file tree

Showing 8 changed files with 732 additions and 0 deletions.
diff --git a/docs/services/graphcore/faq.md b/docs/services/graphcore/faq.md
@@ -0,0 +1,15 @@
+# Graphcore FAQ
+
+## Graphcore Questions
+
+### How do I delete a running/terminated pod?
+
+`IPUJobs` manages the launcher and worker `pods`, therefore the pods will be deleted when the `IPUJob` is deleted, using `kubectl delete ipujobs <IPUJob-name>`. If only the `pod` is deleted via `kubectl delete pod`, the `IPUJob` may respawn the `pod`.
+
+To see running or terminated `IPUJobs`, run `kubectl get ipujobs`.
+
+### My IPUJob died with a message: `'poptorch_cpp_error': Failed to acquire X IPU(s)`. Why?
+
+This error may appear when the IPUJob name is too long.
+
+We have identified that for IPUJobs with `metadata:name` length over 36 characters, this error may appear. A solution is to reduce the name to under 36 characters.
diff --git a/docs/services/graphcore/index.md b/docs/services/graphcore/index.md
@@ -0,0 +1,37 @@
+# Overview
+
+EIDF hosts a Graphcore Bow Pod64 system for AI acceleration.
+
+The specification of the Bow Pod64 is:
+
+- 16x Bow-2000 machines
+- 64x Bow IPUs (4 IPUs per Bow-2000)
+- 94,208 IPU cores (1472 cores per IPU)
+- 57.6GB of In-Processor-Memory (0.9GB per IPU)
+
+For more details about the IPU architecture, see [documentation from Graphcore](https://docs.graphcore.ai/projects/ipu-programmers-guide/en/latest/about_ipu.html#).
+
+The smallest unit of compute resource that can be requested is a single IPU.
+
+Similarly to the EIDF GPU Service, usage of the graphcore is managed using [Kubernetes](https://kubernetes.io).
+
+## Service Access
+
+## Project Quotas
+
+## Graphcore Tutorial
+
+The following tutorial teaches users how to submit tasks to the graphcore system. This tutorial assumes basic familiary with submitting jobs via Kubernetes. For a tutorial on using Kubernetes, see the [GPU service tutorial](../gpuservice/training/L1_getting_started.md). For more in-depth lessons about developing applications for graphcore, see [the general documentation](https://docs.graphcore.ai/en/latest/) and [guide for creating IPU jobs via Kubernetes](https://docs.graphcore.ai/projects/kubernetes-user-guide/en/latest/creating-ipujob.html).
+
+| Lesson                                                                                                   | Objective                                                                                                      |
+|-----------------------------------|-------------------------------------|
+| [Getting started with IPU jobs](training/L1_getting_started.md)                             | a. How to send an IPUJob.<br>b. Monitoring and Cancelling your IPUJob.  |
+| [Multi-IPU Jobs](training/L2_multiple_IPU.md) | a. Using multiple IPUs for distributed training.                                         |
+| [Profiling with PopVision](training/L3_profiling.md)                               | a. Enabling profiling in your code.<br>b. Downloading the profile reports. |
+| [Other Frameworks](training/L4_other_frameworks.md)                               | a. Using Tensorflow and PopART.<br>b. Writing IPU programs with PopLibs (C++).|
+
+## Further Reading and Help
+
+- The [Graphcore documentation](https://docs.graphcore.ai/en/latest/) provides information about using the Graphcore system.
+
+- The [Graphcore examples repository on github](https://github.com/graphcore/examples/tree/master) provides a catalogue of application examples that have been optimised to run on Graphcore IPUs for both training and inference. It also contains tutorials for using various frameworks.
diff --git a/docs/services/graphcore/training/L1_getting_started.md b/docs/services/graphcore/training/L1_getting_started.md
@@ -0,0 +1,125 @@
+# Getting started with Graphcore IPU Jobs
+
+This guide assumes basic familiarity with Kubernetes (K8s) and usage of `kubectl`. See [GPU service tutorial](../gpuservice/training/L1_getting_started.md) to get started.
+
+## Introduction
+
+Graphcore provides prebuilt docker containers (full lists [here](https://hub.docker.com/u/graphcore)) which contain the required libraries (pytorch, tensorflow, poplar etc.) and can be used directly within the K8s to run on the Graphcore IPUs.
+
+In this tutorial we will cover running training with a single IPU. The subsequent tutorial will cover using multiple IPUs, which can be used for distrubed training jobs.
+
+## Creating your first IPU job
+
+For our first IPU job, we will be using the Graphcore PyTorch (PopTorch) container image (`graphcore/pytorch:3.3.0`) to run a simple example of training a neural network for classification on the MNIST dataset, which is provided [here](https://github.com/graphcore/examples/tree/master/tutorials/simple_applications/pytorch/mnist). More applications can be found in the repository <https://github.com/graphcore/examples>.
+
+To get started:
+
+1. to specify the job - create the file `mnist-training-ipujob.yaml`, then copy and save the following content into the file:
+
+    ``` yaml
+    apiVersion: graphcore.ai/v1alpha1
+    kind: IPUJob
+    metadata:
+      name: mnist-training
+    spec:
+      # jobInstances defines the number of job instances.
+      # More than 1 job instance is usually useful for inference jobs only.
+      jobInstances: 1
+      # ipusPerJobInstance refers to the number of IPUs required per job instance.
+      # A separate IPU partition of this size will be created by the IPU Operator
+      # for each job instance.
+      ipusPerJobInstance: "1"
+      workers:
+        template:
+          spec:
+            containers:
+            - name: mnist-training
+              image: graphcore/pytorch:3.3.0
+              command: [/bin/bash, -c, --]
+              args:
+                - |
+                  cd;
+                  mkdir build;
+                  cd build;
+                  git clone https://github.com/graphcore/examples.git;
+                  cd examples/tutorials/simple_applications/pytorch/mnist;
+                  python -m pip install -r requirements.txt;
+                  python mnist_poptorch_code_only.py --epochs 1
+              securityContext:
+                capabilities:
+                  add:
+                  - IPC_LOCK
+              volumeMounts:
+              - mountPath: /dev/shm
+                name: devshm
+            restartPolicy: Never
+            hostIPC: true
+            volumes:
+            - emptyDir:
+                medium: Memory
+                sizeLimit: 10Gi
+              name: devshm
+    ```
+
+1. to submit the job - run `kubectl create -f mnist-training-ipujob.yaml`, which will give the following output:
+
+    ``` bash
+    ipujob.graphcore.ai/mnist-training created
+    ```
+
+1. to monitor progress of the job - run `kubectl get pods`, which will give the following output
+
+    ``` bash
+    NAME                      READY   STATUS      RESTARTS   AGE
+    mnist-training-worker-0   0/1     Completed   0          2m56s
+    ```
+
+1. to read the result - run `kubectl logs mnist-training-worker-0`, which will give the following output (or similar)
+
+   ``` bash
+   ...
+   Graph compilation: 100%|██████████| 100/100 [00:23<00:00]
+   Epochs: 100%|██████████| 1/1 [00:34<00:00, 34.18s/it]
+   ...
+   Accuracy on test set: 97.08%
+   ```
+
+## Monitoring and Cancelling your IPU job
+
+An IPU job creates an IPU Operator, which manages the required worker or launcher pods. To see running or complete `IPUjobs`, run `kubectl get ipujobs`, which will show:
+
+``` bash
+NAME             STATUS      CURRENT   DESIRED   LASTMESSAGE          AGE
+mnist-training   Completed   0         1         All instances done   10m
+```
+
+To delete the `IPUjob`, run `kubectl delete ipujobs <job-name>`, e.g. `kubectl delete ipujobs mnist-training`. This will also delete the associated worker pod `mnist-training-worker-0`.
+
+Note: simply deleting the pod via `kubectl delete pods mnist-training-worker-0` does not delete the IPU job, which will need to be deleted separately.
+
+Note: you can list all pods via `kubectl get all` or `kubectl get pods`, but they do not show the ipujobs. These can be obtained using `kubectl get ipujobs`.
+
+Note: `kubectl describe <pod-name>` provides verbose description of a specific pod.
+
+## Description
+
+The Graphcore IPU Operator (Kubernetes interface) extends the Kubernetes API by introducing a custom resource definition (CRD) named `IPUJob`, which can be seen at the beginning of the included yaml file:
+
+``` yaml
+apiVersion: graphcore.ai/v1alpha1
+kind: IPUJob
+```
+
+An `IPUJob` allows users to defineworkloads that can use IPUs. There are several fields specific to an `IPUJob`:
+
+**job instances** : This defines the number of jobs. In the case of training it should be 1.
+
+**ipusPerJobInstance** : This defines the size of IPU partition that will be created for each job instance.
+
+**workers** : This defines a Pod specification that will be used for `Worker` Pods, including the container image and commands.
+
+These fields have been populated in the example .yaml file. For distributed training (with multiple IPUs), additional fields need to be included, which will be described in the [next lesson](./L2_multiple_IPU.md).
+
+## Additional Information
+
+It is possible to further specify the restart policy (`Always`/`OnFailure`/`Never`/`ExitCode`) and clean up policy (`Workers`/`All`/`None`); see [here](https://docs.graphcore.ai/projects/kubernetes-user-guide/en/latest/creating-ipujob.html).