-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How would I ensure that a specific task gets a specific GPU? #56
Comments
Some findings of interest. Following the instructions here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html Assume I have the following GUIDs according to
If I run the container manually like this:
and then run
Manually running the same exact container again, but with a different GUID using:
and then running
Same works for any GUID I try. Seems that everything works as expected when ran manually. Now on the Nomad side. Assume I have a task in my job set up in the following manner (other irrelevant stuff left out): task "llm-runner-1" {
driver = "docker"
config {
image = "myimage"
ports = ["api-1"]
runtime = "nvidia"
}
service {
name = "llm-runner-1"
port = "api-1"
provider = "nomad"
}
resources {
cpu = 3000
memory = 3000
device "nvidia/gpu" {
count = 1
}
}
env {
NVIDIA_VISIBLE_DEVICES = "GPU-684586d6-bed0-e6a7-78e2-bf784635fd1b"
}
}
Then running an What
So I am not exactly sure where the wires are getting crossed. It does the same thing with or without specifying the |
🤦 I apologize. This might have been less of a bug and more of a documentation issue. I discovered this: https://developer.hashicorp.com/nomad/docs/job-specification/device#affinity-towards-specific-gpu-devices and that was exactly what I was looking for, except I wanted a constraint and not an affinity. Note that constraint works this way as well. I guess I was looking only at the Feel free to close this if you believe it's not a bug of any kind, or keep it open if you want for further work (i.e. if you feel the documentation and/or internals surrounding |
Hi @BradyBonnette looks like you don't need us anymore, Im going to go ahead and close this issue, feel free to open it again if you feel like there is still something missing in the docs. |
I am trying to run Nomad (v1.9.0) with this plugin (v1.1.0) in a multi-GPU setup (H100). The plugin is installed and runs as it should.
What I would like to do is create 8 separate tasks in Nomad (could be any number, but using 8 as an example) where each task gets a specific GPU and only that GPU. E.g.
Task 1 => GPU 0
Task 2 => GPU 1
Task 3 => GPU 2
...
Task 8 => GPU 7
According to the documentation, I can supply the
NVIDIA_VISIBLE_DEVICES
as anenv {}
in the task, but doing so causes the GPU to be randomly selected instead of forcing the task to use that specific GPU. So for example, for a particular task, I would setenv { NVIDIA_VISIBLE_DEVICES = <GUID OF SPECIFIC DEVICE> }
and each time the task was started, it would be placed on any of the other 7 devices randomly.I also tried setting a constraint such as:
and that did not work either.
Is there something else I could try?
EDIT: Forgot to add that I traditionally accomplished this with the docker cli + nvidia container toolkit by issuing something like
docker run --gpus '"device=0"' ...
.The text was updated successfully, but these errors were encountered: