-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad plugin nvidia-gpu does not detect multi-instance GPUs #3
Comments
Hi @aneutron! MIG wasn't available when the Nvidia driver was first developed, and I'll be honest and say it hasn't seen a lot of investment as we don't have too many users making feature requests. That being said, let's dig into this idea a bit...
When a Nomad agent is restarted, the workloads are left running. Ignoring Nomad for a moment, if we enable MIG while a workload has the device mounted, what's the expected behavior there? If workloads are expected to stay running, then do we need to be updating the agent's fingerprint of the GPU with the MIG option whether or not we restart Nomad? Are there security implications to using MIG (above and beyond the usual security implications of exposing the GPU to workloads)? What does this look like outside of Docker with our As an aside, as of the 1.2.0 release which should be coming this week or so, the Nvidia device driver is externalized rather than being bundled with Nomad (see hashicorp/nomad#10796). I'm going to move this over to https://github.com/hashicorp/nomad-device-nvidia/issues, which is where all the Nvidia GPU device driver work is done these days. |
Hey @tgross, Sorry for putting the issue on the wrong project. I'm not an expert on all MIG / CUDA matters myself, but I can perhaps offer some points to help reasoning about this issue: (I'm basing most of this on what I understood from the documentation over at nvidia's page for MIG)
|
Sorry for stale bump, with the cost/scarcity of A100/H100's this is starting to become an issue where its getting harder to avoid needing k8s or cloud containers to have more than one job assigned to a single GPU. e.g. (https://docs.nvidia.com/datacenter/cloud-native/kubernetes/mig-k8s.html) - Last Updated: 2023-06-10 Appreciate that my comment doesn't add any value or support to helping this become a reality, but would greatly appreciate this be given another look - as its extremely wasteful to need to use a full GPU for a task that needs only 1-10gb. |
This is also one of the bigger blockers for us, and I've decided to take a stab at it to at least experiment internally but I'd appreciate some forward-guiadance if this is something that can be eventually upstreamed. One of the things that this PR doesn't include whatsoever is the support for enabling/disabling MIG while there are workloads running or changing the MIG mode dynamically before the fingerprinting. I think it is an extremely rare occurrence that would complicate the initial implementation a lot, and the value this adds on just its own is already very big for us. We are going to start to slowly roll this out internally to see if there are any edge cases that we haven't fixed, but if anyone wants to take a look at it and let me know if it is right direction, that'd be super helpful! |
Hey @isidentical, thanks for the PR. I just noticed that this slipped thru the cracks and nobody has reviewed it yet. I'll bump it on our internal channels so somebody takes a look soon. Sorry for the delay! |
I think treating the device as modal, GPU or MIG, is fine as long as we clearly document that behavior. Any operator capable of altering a node's configuration should be capable of draining it. It seems like we can consider "graceful migration" a future enhancement. |
Appreciate seeing this get some more love! Just wanted to touch base from my comment above (this time last year) and say we're still looking forward to seeing this become a reality. |
Nomad version
Nomad v1.2.0-dev (6d35e2fb58663aa2ad8b8f47459eff342901e72a)
(The nightly build that fixed hashicorp/nomad#11342)
Operating system and Environment details
Issue
Hello Again !
First of all, thank you very much for the amazing response time and time to fix the preemption problem in hashicorp/nomad#11342.
While I was testing your product, I wondered about its compatibility with the Multi-Instance GPU solution of nVidia.
In a nutshell, it allows for us to physically partition a big GPU into more bite-sized GPUs. That can be immensly useful for numerous use cases (e.g. hosting multiple jobs on the same chonkster of a GPU).
When MIG is enabled on a GPU, you cannot use the GPU as a resource (i.e. You can only use the MIG instances created on the GPU). For example, in my setup we have 4 A100 GPUs, of which GPU3 has MIG enabled. I've went ahead and created 2 half-GPUs (basically). This is the
nvidia-smi -L
:Yet when I run
nomad node status {my_node_id}
, this is the output of the GPU resources part:Now this is problematic for two reasons:
Now MIG instances are usable (almost as fully-fledged GPUs, or more specifically CUDA devices), so they're perfectly compatible with Docker workloads using the nVidia runtime (e.g. instead of using --gpus='"device=0"' you'd use --gpus='"device=0:0"' to reference the first MIG instance on the first GPU).
Reproduction steps
Expected Result
Actual Result
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: