-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't install DCGM if the driver has been blacklisted #363
Don't install DCGM if the driver has been blacklisted #363
Conversation
If the sysadmin wants to pass the gpu to a virtual instance via pci passthrough, they will need to make the gpu unavailable to the host system by blacklisting[0] the kernel driver. On such a system DCGM would not be able to function and should therefore not be deployed. This commit makes the NVIDIA gpu verifier more strict by only marking DCGM as an available tool if both an NVIDIA gpu is detected *and* the kernel module is not blacklisted. Fixes: canonical#362 [0] https://wiki.debian.org/KernelModuleBlacklisting
0868bce
to
f8d6bd7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the implementation!
Based on the code review (note: I haven’t tested it), I believe we’re at a good point to start adding unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small suggestion, but LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if we follow Gabriel's suggestion on unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
not merging yet because manual tests show that updating to this charm version forces the driver to be installed and loaded, despite the blacklisting |
one possible hacky way to handle the upgrade issue would be to do something like this in
(basically a gpu-specific hardware redetection round) |
Need to be very careful about this proposal. This somehow means run the detect function every time when hook is triggered.
I believe this issue occurs because the list of HWTool included the GPU prior to the charm upgrade. It’s a bit of a chicken-and-egg problem—you’ll need a way to clean the local state to prevent this from happening. New deployed unit won't encounter this.(If I am correct) I have couple proposals:
|
Drive by comment: how expensive is it to detect the tools? Perhaps we can build the tools list on every charm hook, rather than using stored state? |
We did this in the past and encounter an issue that some unstable hardware just you different result every time you run re-detect. I will suggestion not dig into the same hole again. |
ok, I've tested that at least the upgrade scenario from rev 84 (pre-dcgm) works fine. Further cleanups would require changing how and when we do hw redetection, which is a much bigger endeavor than this change |
If the sysadmin wants to pass the gpu to a virtual instance via pci
passthrough, they will need to make the gpu unavailable to the host
system by blacklisting the kernel driver. On such a system DCGM would
not be able to function and should therefore not be deployed.
This commit makes the NVIDIA gpu verifier more strict by only marking
DCGM as an available tool if both an NVIDIA gpu is detected and the
kernel module is not blacklisted.
Fixes: #362
Testing setup
Given a server with an NVIDIA gpu, drivers installed, and modules blacklisted via modprobe:
Following the
dev-environment.md
file, set up a local controller deploy an older stable charm (e.g. rev 84), setting up resources if necessary.Next pack the charm from this commit, scp it to the server, and install it:
Expected result: