Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How would I ensure that a specific task gets a specific GPU? #56

Closed
BradyBonnette opened this issue Oct 17, 2024 · 3 comments
Closed

How would I ensure that a specific task gets a specific GPU? #56

BradyBonnette opened this issue Oct 17, 2024 · 3 comments
Labels
question Further information is requested

Comments

@BradyBonnette
Copy link

BradyBonnette commented Oct 17, 2024

I am trying to run Nomad (v1.9.0) with this plugin (v1.1.0) in a multi-GPU setup (H100). The plugin is installed and runs as it should.

What I would like to do is create 8 separate tasks in Nomad (could be any number, but using 8 as an example) where each task gets a specific GPU and only that GPU. E.g.

Task 1 => GPU 0
Task 2 => GPU 1
Task 3 => GPU 2
...
Task 8 => GPU 7

According to the documentation, I can supply the NVIDIA_VISIBLE_DEVICES as an env {} in the task, but doing so causes the GPU to be randomly selected instead of forcing the task to use that specific GPU. So for example, for a particular task, I would set env { NVIDIA_VISIBLE_DEVICES = <GUID OF SPECIFIC DEVICE> } and each time the task was started, it would be placed on any of the other 7 devices randomly.

I also tried setting a constraint such as:

task "mytask" {
  resources {
    device "nvidia/gpu" {
      count = 1
      constraint {
        attribute = "${device.attr.uuid}"
        value = "<GUID>"
      }
    }
  }
}

and that did not work either.

Is there something else I could try?

EDIT: Forgot to add that I traditionally accomplished this with the docker cli + nvidia container toolkit by issuing something like docker run --gpus '"device=0"' ....

@BradyBonnette
Copy link
Author

BradyBonnette commented Oct 18, 2024

Some findings of interest.

Following the instructions here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html

Assume I have the following GUIDs according to nvidia-smi -L on the host (i.e. outside of containers):

UUID: GPU-684586d6-bed0-e6a7-78e2-bf784635fd1b
UUID: GPU-49c859a6-c61a-0d40-adc6-8b463b489b00
UUID: GPU-13df2a8b-06ec-1168-9d95-d8a058bca48b
UUID: GPU-97518ddc-a1f1-ebaf-e0e9-18653b33fccc
UUID: GPU-00c18080-96ba-7cca-7a2a-c20ef3db6911
UUID: GPU-db76219f-0eaf-bd49-439e-7d82ae8aa526
UUID: GPU-ca3d6363-ee91-b096-d9b3-3e150c5c134c
UUID: GPU-675af948-8174-344c-ab1e-15075292349d

If I run the container manually like this:

sudo docker run -it --rm --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=GPU-684586d6-bed0-e6a7-78e2-bf784635fd1b myimage /bin/bash

and then run nvidia-smi -L inside the container, I see:

app@0b14ff4e14ef:~$ nvidia-smi -L
GPU 0: (UUID: GPU-684586d6-bed0-e6a7-78e2-bf784635fd1b)

Manually running the same exact container again, but with a different GUID using:

sudo docker run -it --rm --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=GPU-ca3d6363-ee91-b096-d9b3-3e150c5c134c myimage /bin/bash

and then running nvidia-smi -L inside that running container yields:

GPU 0: (UUID: GPU-ca3d6363-ee91-b096-d9b3-3e150c5c134c)

Same works for any GUID I try. Seems that everything works as expected when ran manually.

Now on the Nomad side.

Assume I have a task in my job set up in the following manner (other irrelevant stuff left out):

    task "llm-runner-1" {
      driver = "docker"
      config {
        image = "myimage"
        ports = ["api-1"]
        runtime = "nvidia"
      }
      service {
        name = "llm-runner-1"
        port = "api-1"
        provider = "nomad"
      }
      resources {
        cpu    = 3000
        memory = 3000
        device "nvidia/gpu" {
          count = 1
        }
      }

      env {
        NVIDIA_VISIBLE_DEVICES = "GPU-684586d6-bed0-e6a7-78e2-bf784635fd1b"
      }
    }

Then running an Exec window from nomad for that particular runner:

image

What docker inspect shows for that same exact running container:

[... omitted ...]
"NOMAD_TASK_DIR=/local",
"NOMAD_TASK_NAME=llm-runner-1",
"NVIDIA_VISIBLE_DEVICES=GPU-db76219f-0eaf-bd49-439e-7d82ae8aa526",
[... omitted ...]

So I am not exactly sure where the wires are getting crossed. It does the same thing with or without specifying the runtime = "nvidia"

@BradyBonnette
Copy link
Author

🤦

I apologize. This might have been less of a bug and more of a documentation issue.

I discovered this: https://developer.hashicorp.com/nomad/docs/job-specification/device#affinity-towards-specific-gpu-devices and that was exactly what I was looking for, except I wanted a constraint and not an affinity. Note that constraint works this way as well.

I guess I was looking only at the nomad-device-nvidia documentation only and never thought to check other sources.

Feel free to close this if you believe it's not a bug of any kind, or keep it open if you want for further work (i.e. if you feel the documentation and/or internals surrounding NVIDIA_VISIBLE_DEVICES is misleading)

@Juanadelacuesta Juanadelacuesta added the question Further information is requested label Oct 23, 2024
@Juanadelacuesta
Copy link
Member

Hi @BradyBonnette looks like you don't need us anymore, Im going to go ahead and close this issue, feel free to open it again if you feel like there is still something missing in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Development

No branches or pull requests

2 participants