-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the NVIDIA gpu module blacklisting algorithm is not specific enough #379
Labels
bug
Something isn't working
Comments
Thank you for reporting us your feedback! The internal ticket has been created: https://warthogs.atlassian.net/browse/SOLENG-998.
|
ok after some more discussions the proposal here would be to:
|
aieri
added a commit
to aieri/hardware-observer-operator
that referenced
this issue
Dec 18, 2024
While we initially provided automatic installation of the NVIDIA driver as a convenience, we then ran into the complexity of dealing with users wanting to configure pci-passthrough and/or vgpu, and to possibly move across these configurations post-deployment (see canonical#362, canonical#379). After some more discussions, we agreed that deploying a gpu driver is not the responsibility of hardware-observer, but rather of the principal charm that needs to use the gpu (e.g. nova or kubernetes-worker). This commit therefore drops the functionality of automatically installing the driver and determining if it has been blacklisted for a simpler workflow of installing DCGM only if a driver is found to have been installed and loaded. Fixes: canonical#379
aieri
added a commit
that referenced
this issue
Dec 19, 2024
While we initially provided automatic installation of the NVIDIA driver as a convenience, we then ran into the complexity of dealing with users wanting to configure pci-passthrough and/or vgpu, and to possibly move across these configurations post-deployment (see #362, #379). After some more discussions, we agreed that deploying a gpu driver is not the responsibility of hardware-observer, but rather of the principal charm that needs to use the gpu (e.g. nova or kubernetes-worker). This commit therefore drops the functionality of automatically installing the driver and determining if it has been blacklisted for a simpler workflow of installing DCGM only if a driver is found to have been installed and loaded. Fixes: #379
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In revision 146/147 we improved the NVIDIA gpu detection algorithm by checking for module blacklisting, and avoiding deploying dcgm in that case (see #363).
The check is too generic though, as
nvidiafb
is always blacklisted via/etc/modprobe.d/blacklist-framebuffer.conf
, which is shipped by default via thekmod
deb package.The text was updated successfully, but these errors were encountered: