-
Notifications
You must be signed in to change notification settings - Fork 22
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #111 from josephleekl/graphcore-docs
JL: added graphcore service documentation
- Loading branch information
Showing
8 changed files
with
732 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# Graphcore FAQ | ||
|
||
## Graphcore Questions | ||
|
||
### How do I delete a running/terminated pod? | ||
|
||
`IPUJobs` manages the launcher and worker `pods`, therefore the pods will be deleted when the `IPUJob` is deleted, using `kubectl delete ipujobs <IPUJob-name>`. If only the `pod` is deleted via `kubectl delete pod`, the `IPUJob` may respawn the `pod`. | ||
|
||
To see running or terminated `IPUJobs`, run `kubectl get ipujobs`. | ||
|
||
### My IPUJob died with a message: `'poptorch_cpp_error': Failed to acquire X IPU(s)`. Why? | ||
|
||
This error may appear when the IPUJob name is too long. | ||
|
||
We have identified that for IPUJobs with `metadata:name` length over 36 characters, this error may appear. A solution is to reduce the name to under 36 characters. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# Overview | ||
|
||
EIDF hosts a Graphcore Bow Pod64 system for AI acceleration. | ||
|
||
The specification of the Bow Pod64 is: | ||
|
||
- 16x Bow-2000 machines | ||
- 64x Bow IPUs (4 IPUs per Bow-2000) | ||
- 94,208 IPU cores (1472 cores per IPU) | ||
- 57.6GB of In-Processor-Memory (0.9GB per IPU) | ||
|
||
For more details about the IPU architecture, see [documentation from Graphcore](https://docs.graphcore.ai/projects/ipu-programmers-guide/en/latest/about_ipu.html#). | ||
|
||
The smallest unit of compute resource that can be requested is a single IPU. | ||
|
||
Similarly to the EIDF GPU Service, usage of the graphcore is managed using [Kubernetes](https://kubernetes.io). | ||
|
||
## Service Access | ||
|
||
## Project Quotas | ||
|
||
## Graphcore Tutorial | ||
|
||
The following tutorial teaches users how to submit tasks to the graphcore system. This tutorial assumes basic familiary with submitting jobs via Kubernetes. For a tutorial on using Kubernetes, see the [GPU service tutorial](../gpuservice/training/L1_getting_started.md). For more in-depth lessons about developing applications for graphcore, see [the general documentation](https://docs.graphcore.ai/en/latest/) and [guide for creating IPU jobs via Kubernetes](https://docs.graphcore.ai/projects/kubernetes-user-guide/en/latest/creating-ipujob.html). | ||
|
||
| Lesson | Objective | | ||
|-----------------------------------|-------------------------------------| | ||
| [Getting started with IPU jobs](training/L1_getting_started.md) | a. How to send an IPUJob.<br>b. Monitoring and Cancelling your IPUJob. | | ||
| [Multi-IPU Jobs](training/L2_multiple_IPU.md) | a. Using multiple IPUs for distributed training. | | ||
| [Profiling with PopVision](training/L3_profiling.md) | a. Enabling profiling in your code.<br>b. Downloading the profile reports. | | ||
| [Other Frameworks](training/L4_other_frameworks.md) | a. Using Tensorflow and PopART.<br>b. Writing IPU programs with PopLibs (C++).| | ||
|
||
## Further Reading and Help | ||
|
||
- The [Graphcore documentation](https://docs.graphcore.ai/en/latest/) provides information about using the Graphcore system. | ||
|
||
- The [Graphcore examples repository on github](https://github.com/graphcore/examples/tree/master) provides a catalogue of application examples that have been optimised to run on Graphcore IPUs for both training and inference. It also contains tutorials for using various frameworks. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
# Getting started with Graphcore IPU Jobs | ||
|
||
This guide assumes basic familiarity with Kubernetes (K8s) and usage of `kubectl`. See [GPU service tutorial](../gpuservice/training/L1_getting_started.md) to get started. | ||
|
||
## Introduction | ||
|
||
Graphcore provides prebuilt docker containers (full lists [here](https://hub.docker.com/u/graphcore)) which contain the required libraries (pytorch, tensorflow, poplar etc.) and can be used directly within the K8s to run on the Graphcore IPUs. | ||
|
||
In this tutorial we will cover running training with a single IPU. The subsequent tutorial will cover using multiple IPUs, which can be used for distrubed training jobs. | ||
|
||
## Creating your first IPU job | ||
|
||
For our first IPU job, we will be using the Graphcore PyTorch (PopTorch) container image (`graphcore/pytorch:3.3.0`) to run a simple example of training a neural network for classification on the MNIST dataset, which is provided [here](https://github.com/graphcore/examples/tree/master/tutorials/simple_applications/pytorch/mnist). More applications can be found in the repository <https://github.com/graphcore/examples>. | ||
|
||
To get started: | ||
|
||
1. to specify the job - create the file `mnist-training-ipujob.yaml`, then copy and save the following content into the file: | ||
|
||
``` yaml | ||
apiVersion: graphcore.ai/v1alpha1 | ||
kind: IPUJob | ||
metadata: | ||
name: mnist-training | ||
spec: | ||
# jobInstances defines the number of job instances. | ||
# More than 1 job instance is usually useful for inference jobs only. | ||
jobInstances: 1 | ||
# ipusPerJobInstance refers to the number of IPUs required per job instance. | ||
# A separate IPU partition of this size will be created by the IPU Operator | ||
# for each job instance. | ||
ipusPerJobInstance: "1" | ||
workers: | ||
template: | ||
spec: | ||
containers: | ||
- name: mnist-training | ||
image: graphcore/pytorch:3.3.0 | ||
command: [/bin/bash, -c, --] | ||
args: | ||
- | | ||
cd; | ||
mkdir build; | ||
cd build; | ||
git clone https://github.com/graphcore/examples.git; | ||
cd examples/tutorials/simple_applications/pytorch/mnist; | ||
python -m pip install -r requirements.txt; | ||
python mnist_poptorch_code_only.py --epochs 1 | ||
securityContext: | ||
capabilities: | ||
add: | ||
- IPC_LOCK | ||
volumeMounts: | ||
- mountPath: /dev/shm | ||
name: devshm | ||
restartPolicy: Never | ||
hostIPC: true | ||
volumes: | ||
- emptyDir: | ||
medium: Memory | ||
sizeLimit: 10Gi | ||
name: devshm | ||
``` | ||
1. to submit the job - run `kubectl create -f mnist-training-ipujob.yaml`, which will give the following output: | ||
|
||
``` bash | ||
ipujob.graphcore.ai/mnist-training created | ||
``` | ||
|
||
1. to monitor progress of the job - run `kubectl get pods`, which will give the following output | ||
|
||
``` bash | ||
NAME READY STATUS RESTARTS AGE | ||
mnist-training-worker-0 0/1 Completed 0 2m56s | ||
``` | ||
|
||
1. to read the result - run `kubectl logs mnist-training-worker-0`, which will give the following output (or similar) | ||
|
||
``` bash | ||
... | ||
Graph compilation: 100%|██████████| 100/100 [00:23<00:00] | ||
Epochs: 100%|██████████| 1/1 [00:34<00:00, 34.18s/it] | ||
... | ||
Accuracy on test set: 97.08% | ||
``` | ||
|
||
## Monitoring and Cancelling your IPU job | ||
|
||
An IPU job creates an IPU Operator, which manages the required worker or launcher pods. To see running or complete `IPUjobs`, run `kubectl get ipujobs`, which will show: | ||
|
||
``` bash | ||
NAME STATUS CURRENT DESIRED LASTMESSAGE AGE | ||
mnist-training Completed 0 1 All instances done 10m | ||
``` | ||
|
||
To delete the `IPUjob`, run `kubectl delete ipujobs <job-name>`, e.g. `kubectl delete ipujobs mnist-training`. This will also delete the associated worker pod `mnist-training-worker-0`. | ||
|
||
Note: simply deleting the pod via `kubectl delete pods mnist-training-worker-0` does not delete the IPU job, which will need to be deleted separately. | ||
|
||
Note: you can list all pods via `kubectl get all` or `kubectl get pods`, but they do not show the ipujobs. These can be obtained using `kubectl get ipujobs`. | ||
|
||
Note: `kubectl describe <pod-name>` provides verbose description of a specific pod. | ||
|
||
## Description | ||
|
||
The Graphcore IPU Operator (Kubernetes interface) extends the Kubernetes API by introducing a custom resource definition (CRD) named `IPUJob`, which can be seen at the beginning of the included yaml file: | ||
|
||
``` yaml | ||
apiVersion: graphcore.ai/v1alpha1 | ||
kind: IPUJob | ||
``` | ||
|
||
An `IPUJob` allows users to defineworkloads that can use IPUs. There are several fields specific to an `IPUJob`: | ||
|
||
**job instances** : This defines the number of jobs. In the case of training it should be 1. | ||
|
||
**ipusPerJobInstance** : This defines the size of IPU partition that will be created for each job instance. | ||
|
||
**workers** : This defines a Pod specification that will be used for `Worker` Pods, including the container image and commands. | ||
|
||
These fields have been populated in the example .yaml file. For distributed training (with multiple IPUs), additional fields need to be included, which will be described in the [next lesson](./L2_multiple_IPU.md). | ||
|
||
## Additional Information | ||
|
||
It is possible to further specify the restart policy (`Always`/`OnFailure`/`Never`/`ExitCode`) and clean up policy (`Workers`/`All`/`None`); see [here](https://docs.graphcore.ai/projects/kubernetes-user-guide/en/latest/creating-ipujob.html). |
Oops, something went wrong.