Skip to content

Commit

Permalink
Merge pull request #109 from EPCCed/gpu_faq_branch
Browse files Browse the repository at this point in the history
AG: added specific FAQ to the GPU Service
  • Loading branch information
agngrant authored Oct 12, 2023
2 parents a1dee3b + 06c1112 commit 9889465
Show file tree
Hide file tree
Showing 2 changed files with 32 additions and 0 deletions.
31 changes: 31 additions & 0 deletions docs/services/gpuservice/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# GPU Service FAQ

## GPU Service Frequently Asked Questions

### How do I access the GPU Service?

The default access route to the GPU Service is via an EIDF DSC VM. The DSC VM will have access to all EIDF resources for your project and can be accessed through the VDI (SSH or if enabled RDP) or via the EIDF SSH Gateway.

### How do I obtain my project kubeconfig file?

Project Leads and Managers can access the kubeconfig file from the Project page in the Portal. Project Leads and Managers can provide the file on any of the project VMs or give it to individuals within the project.

### I can't mount my PVC in multiple containers or pods at the same time

The current PVC provisioner is based on Ceph RBD. The block devices provided by Ceph to the Kubernetes PV/PVC providers cannot be mounted in multiple pods at the same time. They can only be accessed by one pod at a time, once a pod has unmounted the PVC and terminated, the PVC can be reused by another pod. The service development team is working on new PVC provider systems to alleviate this limitation.

### How many GPUs can I use in a pod?

The current limit is 8 GPUs per pod. Each underlying host has 8 GPUs.

### Why did a validation error occur when submitting a pod or job with a valid specification file?

If an error like the below occurs:

```bash
error: error validating "myjobfile.yml": error validating data: the server does not allow access to the requested resource; if you choose to ignore these errors, turn validation off with --validate=false
```
There may be an issue with the kubectl version that is being run. This can occur if installing in virtual environments or from packages repositories.
The current version verified to operate with the GPU Service is v1.24.10. kubectl and the Kubernetes API version can suffer from version skew if not with a defined number of releases. More information can be found on this under the [Kubernetes Version Skew Policy](https://kubernetes.io/releases/version-skew-policy/).
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ nav:
- "Getting Started": services/gpuservice/training/L1_getting_started.md
- "Persistent Volumes": services/gpuservice/training/L2_requesting_persistent_volumes.md
- "Running a Pytorch Pod": services/gpuservice/training/L3_running_a_pytorch_task.md
- "GPU Service FAQ": services/gpuservice/faq.md
- "Data Management Services":
- "Data Catalogue":
- "Metadata information": services/datacatalogue/metadata.md
Expand Down

0 comments on commit 9889465

Please sign in to comment.