the NVIDIA gpu module blacklisting algorithm is not specific enough #379

aieri · 2024-12-17T17:17:53Z

In revision 146/147 we improved the NVIDIA gpu detection algorithm by checking for module blacklisting, and avoiding deploying dcgm in that case (see #363).

The check is too generic though, as nvidiafb is always blacklisted via /etc/modprobe.d/blacklist-framebuffer.conf, which is shipped by default via the kmod deb package.

The text was updated successfully, but these errors were encountered:

syncronize-issues-to-jira · 2024-12-17T17:18:02Z

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/SOLENG-998.

This message was autogenerated

aieri · 2024-12-17T19:00:45Z

ok after some more discussions the proposal here would be to:

stop installing the nvidia driver via the hardware observer charm
install dcgm only if a driver has been loaded

While we initially provided automatic installation of the NVIDIA driver as a convenience, we then ran into the complexity of dealing with users wanting to configure pci-passthrough and/or vgpu, and to possibly move across these configurations post-deployment (see canonical#362, canonical#379). After some more discussions, we agreed that deploying a gpu driver is not the responsibility of hardware-observer, but rather of the principal charm that needs to use the gpu (e.g. nova or kubernetes-worker). This commit therefore drops the functionality of automatically installing the driver and determining if it has been blacklisted for a simpler workflow of installing DCGM only if a driver is found to have been installed and loaded. Fixes: canonical#379

While we initially provided automatic installation of the NVIDIA driver as a convenience, we then ran into the complexity of dealing with users wanting to configure pci-passthrough and/or vgpu, and to possibly move across these configurations post-deployment (see #362, #379). After some more discussions, we agreed that deploying a gpu driver is not the responsibility of hardware-observer, but rather of the principal charm that needs to use the gpu (e.g. nova or kubernetes-worker). This commit therefore drops the functionality of automatically installing the driver and determining if it has been blacklisted for a simpler workflow of installing DCGM only if a driver is found to have been installed and loaded. Fixes: #379

aieri added the bug Something isn't working label Dec 17, 2024

aieri self-assigned this Dec 17, 2024

aieri mentioned this issue Dec 18, 2024

Stop managing the installation of the NVIDIA driver. #380

Merged

aieri closed this as completed in #380 Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the NVIDIA gpu module blacklisting algorithm is not specific enough #379

the NVIDIA gpu module blacklisting algorithm is not specific enough #379

aieri commented Dec 17, 2024

syncronize-issues-to-jira bot commented Dec 17, 2024

aieri commented Dec 17, 2024

the NVIDIA gpu module blacklisting algorithm is not specific enough #379

the NVIDIA gpu module blacklisting algorithm is not specific enough #379

Comments

aieri commented Dec 17, 2024

syncronize-issues-to-jira bot commented Dec 17, 2024

aieri commented Dec 17, 2024