Skip to content

Commit

Permalink
Merge pull request #140 from DimmestP/gpu-service-specify-namespace
Browse files Browse the repository at this point in the history
Addresses namespace issue with kubectl usage
  • Loading branch information
agngrant authored Apr 9, 2024
2 parents 5bb9958 + 6b0753a commit f2ab068
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 9 deletions.
8 changes: 8 additions & 0 deletions docs/services/gpuservice/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@ The default access route to the GPU Service is via an EIDF DSC VM. The DSC VM wi

Project Leads and Managers can access the kubeconfig file from the Project page in the Portal. Project Leads and Managers can provide the file on any of the project VMs or give it to individuals within the project.

### Access to GPU Service resources in default namespace is 'Forbidden'

```bash
Error from server (Forbidden): error when creating "myjobfile.yml": jobs is forbidden: User <user> cannot create resource "jobs" in API group "" in the namespace "default"
```

Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when you forgot to specify you are submitting job/pods to your project namespace, not the "default" namespace which you do not have permissions to use. Resubmitting the job/pod with `kubectl -n <project-namespace> create "myjobfile.yml"` should solve the issue.

### I can't mount my PVC in multiple containers or pods at the same time

The current PVC provisioner is based on Ceph RBD. The block devices provided by Ceph to the Kubernetes PV/PVC providers cannot be mounted in multiple pods at the same time. They can only be accessed by one pod at a time, once a pod has unmounted the PVC and terminated, the PVC can be reused by another pod. The service development team is working on new PVC provider systems to alleviate this limitation.
Expand Down
43 changes: 34 additions & 9 deletions docs/services/gpuservice/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximate
The service provides access to:

- Nvidia A100 40GB
- Nvidia 80GB
- Nvidia A100 80GB
- Nvidia MIG A100 1G.5GB
- Nvidia MIG A100 3G.20GB
- Nvidia H100 80GB
Expand All @@ -27,6 +27,7 @@ The current full specification of the EIDF GPU Service as of 14 February 2024:
- 32 Nvidia H100 80 GB

!!! important "Quotas"

This is the full configuration of the cluster.

Each project will have access to a quota across this shared configuration.
Expand All @@ -40,16 +41,31 @@ The current full specification of the EIDF GPU Service as of 14 February 2024:
## Service Access

Users should have an [EIDF Account](../../access/project.md).
Users should have an [EIDF Account](../../access/project.md) as the EIDF GPU Service is only accessible through EIDF Virtual Machines.

Existing projects can request access to the EIDF GPU Service through a service request to the [EIDF helpdesk](https://portal.eidf.ac.uk/queries/submit) or emailing [email protected] .

New projects wanting to using the GPU Service should include this in their EIDF Project Application.

Each project will be given a namespace within the EIDF GPU service to operate in.

Project Leads will be able to request access to the EIDF GPU Service for their project either during the project application process or through a service request to the EIDF helpdesk.
This namespace will normally be the EIDF Project code appended with ’ns’, i.e. `eidf989ns` for a project with code 'eidf989'.

Each project will be given a namespace to operate in and the ability to add a kubeconfig file to any of their Virtual Machines in their EIDF project - information on access to VMs is available [here](../../access/virtualmachines-vdi.md).
Once access to the EIDF GPU service has been confirmed, Project Leads will be give the ability to add a kubeconfig file to any of the VMs in their EIDF project - information on access to VMs is available [here](../../access/virtualmachines-vdi.md).

All EIDF virtual machines can be set up to access the EIDF GPU Service. The Virtual Machine does not require to be GPU-enabled.
All EIDF VMs with the project kubeconfig file downloaded can access the EIDF GPU Service using the kubectl command line tool.

The VM does not require to be GPU-enabled.

A quick check to see if a VM has access to the EIDF GPU service can be completed by typing `kubectl -n <project-namespace> get jobs` in to the command line.

If this is first time you have connected to the GPU service the response should be `No resources found in <project-namespace> namespace`.

!!! important "EIDF GPU Service vs EIDF GPU-Enabled VMs"
The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. This allows a project to access multiple GPUs of different types.

The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs.

This allows a project to access multiple GPUs of different types.

An EIDF Virtual Desktop GPU-enabled VM is limited to a small number (1-2) of GPUs of a single type.

Expand All @@ -64,16 +80,25 @@ A standard project namespace has the following initial quota (subject to ongoing
- GPU: 12

!!! important "Quota is a maximum on a Shared Resource"
A project quota is the maximum proportion of the service available for use by that project.

During periods of high demand, Jobs will be queued awaiting resource availability on the Service.
A project quota is the maximum proportion of the service available for use by that project.

This means that a project has access up to 12 GPUs but due to demand may only be able to access a smaller number at any given time.
Any submitted job requests that would exceed the total project quota will be queued.

## Project Queues

EIDF GPU Service is introducing the Kueue system in February 2024. The use of this is detailed in the [Kueue](kueue.md).

!!! important "Job Queuing"

During periods of high demand, jobs will be queued awaiting resource availability on the Service.

As a general rule, the higher the GPU/CPU/Memory resource request of a single job the longer it will wait in the queue before enough resources are free on a single node for it be allocated.

GPUs in high demand, such as Nvidia H100s, typically have longer wait times.

Furthermore, a project may have a quota of up to 12 GPUs but due to demand may only be able to access a smaller number at any given time.

## Additional Service Policy Information

Additional information on service policies can be found [here](policies.md).
Expand Down

0 comments on commit f2ab068

Please sign in to comment.