From a1db9537ba2fba38a2c69bbfdf3ee6bcb352322a Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Thu, 21 Mar 2024 12:30:31 +0000 Subject: [PATCH 1/8] Clarify namespace flag in kubectl usage --- docs/services/gpuservice/faq.md | 6 +++++ docs/services/gpuservice/index.md | 45 ++++++++++++++++++++++++------- 2 files changed, 41 insertions(+), 10 deletions(-) diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md index 456870b7a..ccec549ab 100644 --- a/docs/services/gpuservice/faq.md +++ b/docs/services/gpuservice/faq.md @@ -10,6 +10,12 @@ The default access route to the GPU Service is via an EIDF DSC VM. The DSC VM wi Project Leads and Managers can access the kubeconfig file from the Project page in the Portal. Project Leads and Managers can provide the file on any of the project VMs or give it to individuals within the project. +### Access to GPU Service resources in default namespace is 'Forbidden' + +```Error from server (Forbidden): error when creating : jobs is forbidden: User cannot create resource "jobs" in API group "" in the namespace "default"``` + +Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when you forgot to specify you are submitting job/pods to your project namespace, not the "default" namespace which you do not have permissions to use. Resubmitting the job/pod with `kubectl -n create ` should solve the issue. + ### I can't mount my PVC in multiple containers or pods at the same time The current PVC provisioner is based on Ceph RBD. The block devices provided by Ceph to the Kubernetes PV/PVC providers cannot be mounted in multiple pods at the same time. They can only be accessed by one pod at a time, once a pod has unmounted the PVC and terminated, the PVC can be reused by another pod. The service development team is working on new PVC provider systems to alleviate this limitation. diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index bca3f0dea..6ac8f5b6c 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -9,7 +9,7 @@ The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximate The service provides access to: - Nvidia A100 40GB -- Nvidia 80GB +- Nvidia A100 80GB - Nvidia MIG A100 1G.5GB - Nvidia MIG A100 3G.20GB - Nvidia H100 80GB @@ -27,6 +27,7 @@ The current full specification of the EIDF GPU Service as of 14 February 2024: - 32 Nvidia H100 80 GB !!! important "Quotas" + This is the full configuration of the cluster. Each project will have access to a quota across this shared configuration. @@ -40,16 +41,29 @@ The current full specification of the EIDF GPU Service as of 14 February 2024: ## Service Access -Users should have an [EIDF Account](../../access/project.md). +Users should have an [EIDF Account](../../access/project.md) as access to the EIDF GPU Service can only be obtained through an EIDF virtual machine. + +Project Leads can request access to the EIDF GPU Service from VMs in an existing project through a service request to the EIDF helpdesk. + +Otherwise, Project Leads need to apply for a new EIDF project and specify access to the EIDF GPU service. + +Each project will be given a namespace within the EIDF GPU service to operate in. + +Once access to the EIDF GPU service has been confirmed, Project Leads will be give the ability to add a kubeconfig file to any of the VMs in their EIDF project - information on access to VMs is available [here](../../access/virtualmachines-vdi.md). + +All EIDF VMs with the project kubeconfig file downloaded can access the EIDF GPU Service using the kubectl API. -Project Leads will be able to request access to the EIDF GPU Service for their project either during the project application process or through a service request to the EIDF helpdesk. +The VM does not require to be GPU-enabled. -Each project will be given a namespace to operate in and the ability to add a kubeconfig file to any of their Virtual Machines in their EIDF project - information on access to VMs is available [here](../../access/virtualmachines-vdi.md). +A quick check to see if a VM has access to the EIDF GPU service can be completed by typing `kubectl -n get jobs` in to the command line. -All EIDF virtual machines can be set up to access the EIDF GPU Service. The Virtual Machine does not require to be GPU-enabled. +If this is first time you have connected to the GPU service the response should be `No resources found in namespace`. !!! important "EIDF GPU Service vs EIDF GPU-Enabled VMs" - The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. This allows a project to access multiple GPUs of different types. + + The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. + + This allows a project to access multiple GPUs of different types. An EIDF Virtual Desktop GPU-enabled VM is limited to a small number (1-2) of GPUs of a single type. @@ -64,16 +78,27 @@ A standard project namespace has the following initial quota (subject to ongoing - GPU: 12 !!! important "Quota is a maximum on a Shared Resource" + A project quota is the maximum proportion of the service available for use by that project. - - During periods of high demand, Jobs will be queued awaiting resource availability on the Service. - - This means that a project has access up to 12 GPUs but due to demand may only be able to access a smaller number at any given time. + + This is a sum of all requested resources across all submitted jobs/pods/deployments within a project. + + Any submitted resource requests that would exceed the total project quota will be rejected. ## Project Queues EIDF GPU Service is introducing the Kueue system in February 2024. The use of this is detailed in the [Kueue](kueue.md). +!!! important "Job Queuing" + + During periods of high demand, jobs will be queued awaiting resource availability on the Service. + + As a general rule, the higher the GPU/CPU/Memory resource request of a single job the longer it will wait in the queue before enough resources are free on a single node for it be allocated. + + GPUs in high demand, such as Nvidia H100s, typically have longer wait times. + + Furthermore, a project may have a quota of up to 12 GPUs but due to demand may only be able to access a smaller number at any given time. + ## Additional Service Policy Information Additional information on service policies can be found [here](policies.md). From 1de7fc678964919f26b52dbcc67cf3ffdb678e2e Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Thu, 21 Mar 2024 12:42:43 +0000 Subject: [PATCH 2/8] Pre commit checks --- docs/services/gpuservice/index.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index 6ac8f5b6c..115fd2f76 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -61,8 +61,8 @@ If this is first time you have connected to the GPU service the response should !!! important "EIDF GPU Service vs EIDF GPU-Enabled VMs" - The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. - + The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. + This allows a project to access multiple GPUs of different types. An EIDF Virtual Desktop GPU-enabled VM is limited to a small number (1-2) of GPUs of a single type. @@ -78,11 +78,11 @@ A standard project namespace has the following initial quota (subject to ongoing - GPU: 12 !!! important "Quota is a maximum on a Shared Resource" - + A project quota is the maximum proportion of the service available for use by that project. - + This is a sum of all requested resources across all submitted jobs/pods/deployments within a project. - + Any submitted resource requests that would exceed the total project quota will be rejected. ## Project Queues @@ -92,9 +92,9 @@ EIDF GPU Service is introducing the Kueue system in February 2024. The use of th !!! important "Job Queuing" During periods of high demand, jobs will be queued awaiting resource availability on the Service. - - As a general rule, the higher the GPU/CPU/Memory resource request of a single job the longer it will wait in the queue before enough resources are free on a single node for it be allocated. - + + As a general rule, the higher the GPU/CPU/Memory resource request of a single job the longer it will wait in the queue before enough resources are free on a single node for it be allocated. + GPUs in high demand, such as Nvidia H100s, typically have longer wait times. Furthermore, a project may have a quota of up to 12 GPUs but due to demand may only be able to access a smaller number at any given time. From 8b048c670ab109bf5060824d8dac9db441447370 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Thu, 21 Mar 2024 13:15:43 +0000 Subject: [PATCH 3/8] Adds typical namespace example --- docs/services/gpuservice/index.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index 115fd2f76..52523dc65 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -49,6 +49,8 @@ Otherwise, Project Leads need to apply for a new EIDF project and specify access Each project will be given a namespace within the EIDF GPU service to operate in. +Typically, the namespace is the same as the EIDF project code but with 'ns' appended, i.e. `eidf989ns` for a project with code 'eidf989'. + Once access to the EIDF GPU service has been confirmed, Project Leads will be give the ability to add a kubeconfig file to any of the VMs in their EIDF project - information on access to VMs is available [here](../../access/virtualmachines-vdi.md). All EIDF VMs with the project kubeconfig file downloaded can access the EIDF GPU Service using the kubectl API. From cfd144c31fb2d476cf15e10010479a1fab480434 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Mon, 25 Mar 2024 09:49:36 +0000 Subject: [PATCH 4/8] Swap manifest-filename to more specific myjobyaml --- docs/services/gpuservice/faq.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md index ccec549ab..4a26e42ed 100644 --- a/docs/services/gpuservice/faq.md +++ b/docs/services/gpuservice/faq.md @@ -12,9 +12,9 @@ Project Leads and Managers can access the kubeconfig file from the Project page ### Access to GPU Service resources in default namespace is 'Forbidden' -```Error from server (Forbidden): error when creating : jobs is forbidden: User cannot create resource "jobs" in API group "" in the namespace "default"``` +```Error from server (Forbidden): error when creating "myjobfile.yml": jobs is forbidden: User cannot create resource "jobs" in API group "" in the namespace "default"``` -Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when you forgot to specify you are submitting job/pods to your project namespace, not the "default" namespace which you do not have permissions to use. Resubmitting the job/pod with `kubectl -n create ` should solve the issue. +Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when you forgot to specify you are submitting job/pods to your project namespace, not the "default" namespace which you do not have permissions to use. Resubmitting the job/pod with `kubectl -n create "myjobfile.yml"` should solve the issue. ### I can't mount my PVC in multiple containers or pods at the same time From f3f73871025b602b7ac8dc94568fb66d071ae8fc Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Mon, 25 Mar 2024 10:08:43 +0000 Subject: [PATCH 5/8] Simplify to project namespace --- docs/services/gpuservice/faq.md | 2 +- docs/services/gpuservice/index.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md index 4a26e42ed..40692bb65 100644 --- a/docs/services/gpuservice/faq.md +++ b/docs/services/gpuservice/faq.md @@ -14,7 +14,7 @@ Project Leads and Managers can access the kubeconfig file from the Project page ```Error from server (Forbidden): error when creating "myjobfile.yml": jobs is forbidden: User cannot create resource "jobs" in API group "" in the namespace "default"``` -Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when you forgot to specify you are submitting job/pods to your project namespace, not the "default" namespace which you do not have permissions to use. Resubmitting the job/pod with `kubectl -n create "myjobfile.yml"` should solve the issue. +Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when you forgot to specify you are submitting job/pods to your project namespace, not the "default" namespace which you do not have permissions to use. Resubmitting the job/pod with `kubectl -n create "myjobfile.yml"` should solve the issue. ### I can't mount my PVC in multiple containers or pods at the same time diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index 52523dc65..cc7836ff0 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -57,9 +57,9 @@ All EIDF VMs with the project kubeconfig file downloaded can access the EIDF GPU The VM does not require to be GPU-enabled. -A quick check to see if a VM has access to the EIDF GPU service can be completed by typing `kubectl -n get jobs` in to the command line. +A quick check to see if a VM has access to the EIDF GPU service can be completed by typing `kubectl -n get jobs` in to the command line. -If this is first time you have connected to the GPU service the response should be `No resources found in namespace`. +If this is first time you have connected to the GPU service the response should be `No resources found in namespace`. !!! important "EIDF GPU Service vs EIDF GPU-Enabled VMs" From 7f23046bfee15b1a684bd8ba5aa760844c514621 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 27 Mar 2024 16:40:42 +0000 Subject: [PATCH 6/8] Respond to alistair comments --- docs/services/gpuservice/index.md | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index cc7836ff0..8e9992334 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -41,19 +41,19 @@ The current full specification of the EIDF GPU Service as of 14 February 2024: ## Service Access -Users should have an [EIDF Account](../../access/project.md) as access to the EIDF GPU Service can only be obtained through an EIDF virtual machine. +Users should have an [EIDF Account](../../access/project.md) as the EIDF GPU Service is only accessible through EIDF Virtual Machines. -Project Leads can request access to the EIDF GPU Service from VMs in an existing project through a service request to the EIDF helpdesk. +Existing projects can request access to the EIDF GPU Service through a service request to the [EIDF helpdesk](https://portal.eidf.ac.uk/queries/submit) or emailing eidf@epcc.ed.ac.uk . -Otherwise, Project Leads need to apply for a new EIDF project and specify access to the EIDF GPU service. +New projects wanting to using the GPU Service should include this in their EIDF Project Application. Each project will be given a namespace within the EIDF GPU service to operate in. -Typically, the namespace is the same as the EIDF project code but with 'ns' appended, i.e. `eidf989ns` for a project with code 'eidf989'. +This namespace will normally be the EIDF Project code appended with ’ns’, i.e. `eidf989ns` for a project with code 'eidf989'. Once access to the EIDF GPU service has been confirmed, Project Leads will be give the ability to add a kubeconfig file to any of the VMs in their EIDF project - information on access to VMs is available [here](../../access/virtualmachines-vdi.md). -All EIDF VMs with the project kubeconfig file downloaded can access the EIDF GPU Service using the kubectl API. +All EIDF VMs with the project kubeconfig file downloaded can access the EIDF GPU Service using the kubectl command line tool. The VM does not require to be GPU-enabled. @@ -83,9 +83,7 @@ A standard project namespace has the following initial quota (subject to ongoing A project quota is the maximum proportion of the service available for use by that project. - This is a sum of all requested resources across all submitted jobs/pods/deployments within a project. - - Any submitted resource requests that would exceed the total project quota will be rejected. + Any submitted job requests that would exceed the total project quota will be queued. ## Project Queues From c3f8020b2e907b8271ba53d3c6a04bf56c54cf7e Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Tue, 9 Apr 2024 16:13:33 +0100 Subject: [PATCH 7/8] Place error example within triangular brackets --- docs/services/gpuservice/faq.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md index 40692bb65..1d67da17f 100644 --- a/docs/services/gpuservice/faq.md +++ b/docs/services/gpuservice/faq.md @@ -12,7 +12,7 @@ Project Leads and Managers can access the kubeconfig file from the Project page ### Access to GPU Service resources in default namespace is 'Forbidden' -```Error from server (Forbidden): error when creating "myjobfile.yml": jobs is forbidden: User cannot create resource "jobs" in API group "" in the namespace "default"``` +``` cannot create resource "jobs" in API group "" in the namespace "default">``` Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when you forgot to specify you are submitting job/pods to your project namespace, not the "default" namespace which you do not have permissions to use. Resubmitting the job/pod with `kubectl -n create "myjobfile.yml"` should solve the issue. From 6b0753ae263725623c16f64c57ffa2f328a79996 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Tue, 9 Apr 2024 16:22:05 +0100 Subject: [PATCH 8/8] Fixed code block formatting --- docs/services/gpuservice/faq.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md index 1d67da17f..c859e0fb9 100644 --- a/docs/services/gpuservice/faq.md +++ b/docs/services/gpuservice/faq.md @@ -12,7 +12,9 @@ Project Leads and Managers can access the kubeconfig file from the Project page ### Access to GPU Service resources in default namespace is 'Forbidden' -``` cannot create resource "jobs" in API group "" in the namespace "default">``` +```bash +Error from server (Forbidden): error when creating "myjobfile.yml": jobs is forbidden: User cannot create resource "jobs" in API group "" in the namespace "default" +``` Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when you forgot to specify you are submitting job/pods to your project namespace, not the "default" namespace which you do not have permissions to use. Resubmitting the job/pod with `kubectl -n create "myjobfile.yml"` should solve the issue.