From b66904cdfa0b6f0b5175e34b205029f5041accda Mon Sep 17 00:00:00 2001 From: yuzhao Date: Thu, 12 Oct 2023 15:44:31 +0100 Subject: [PATCH] the solution for insufficient shared memory size --- docs/services/gpuservice/faq.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md index fe801ebf6..8169bfa27 100644 --- a/docs/services/gpuservice/faq.md +++ b/docs/services/gpuservice/faq.md @@ -29,3 +29,23 @@ error: error validating "myjobfile.yml": error validating data: the server does There may be an issue with the kubectl version that is being run. This can occur if installing in virtual environments or from packages repositories. The current version verified to operate with the GPU Service is v1.24.10. kubectl and the Kubernetes API version can suffer from version skew if not with a defined number of releases. More information can be found on this under the [Kubernetes Version Skew Policy](https://kubernetes.io/releases/version-skew-policy/). + + +### Insufficient Shared Memory Size + +My SHM is very small, and it causes "OSError: [Errno 28] No space left on device" when I train a model using multi-GPU. How to increase SHM size? + +The default size of SHM is only 64M. You can mount an empty dir to /dev/shm to solve this problem: +```yaml + spec: + containers: + - name: [NAME] + image: [IMAGE] + volumeMounts: + - mountPath: /dev/shm + name: dshm + volumes: + - name: dshm + emptyDir: + medium: Memory +``` \ No newline at end of file