You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I'm trying to use gpushare device plugin only for exposing gpu_mem resource from k8s gpu node in MiB. I have all the NVIDIA things like drivers, nvidia-container-runtime etc. installed and everything works fine except one thing. For example, there is a pod YAML
Now the same Pod has been successfully created and completed
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
gpu-test-bald 0/1 Completed 0 3m40s 10.62.97.59 gpu-node10 <none> <none>
$ kubectl -f gpu-test-bald
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
So could you explain is such a behaviour of NVIDIA_VISIBLE_DEVICES ENV VAR correct? Seems like it is not
The text was updated successfully, but these errors were encountered:
k0nstantinv
changed the title
NVIDIA_VISIBLE_DEVICES wrong value is in OCI spec
NVIDIA_VISIBLE_DEVICES wrong value in OCI spec
Nov 8, 2022
Hello, I'm trying to use gpushare device plugin only for exposing gpu_mem resource from k8s gpu node in MiB. I have all the NVIDIA things like drivers, nvidia-container-runtime etc. installed and everything works fine except one thing. For example, there is a pod YAML
gpu-node10
I've noticed
NVIDIA_VISIBLE_DEVICES
became different somehow, which causes an error during container creationthis exact error
appears due to this ENV VAR
NVIDIA_VISIBLE_DEVICES
gets unacceptable valueI've handled it in container OCI spec
adding
NVIDIA_VISIBLE_DEVICES=all
to Pod YAML fixes it as it described hereOCI
Now the same Pod has been successfully created and completed
So could you explain is such a behaviour of
NVIDIA_VISIBLE_DEVICES
ENV VAR correct? Seems like it is notThe text was updated successfully, but these errors were encountered: