Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the NVIDIA gpu module blacklisting algorithm is not specific enough #379

Closed
aieri opened this issue Dec 17, 2024 · 2 comments · Fixed by #380
Closed

the NVIDIA gpu module blacklisting algorithm is not specific enough #379

aieri opened this issue Dec 17, 2024 · 2 comments · Fixed by #380
Assignees
Labels
bug Something isn't working

Comments

@aieri
Copy link
Contributor

aieri commented Dec 17, 2024

In revision 146/147 we improved the NVIDIA gpu detection algorithm by checking for module blacklisting, and avoiding deploying dcgm in that case (see #363).

The check is too generic though, as nvidiafb is always blacklisted via /etc/modprobe.d/blacklist-framebuffer.conf, which is shipped by default via the kmod deb package.

@aieri aieri added the bug Something isn't working label Dec 17, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/SOLENG-998.

This message was autogenerated

@aieri aieri self-assigned this Dec 17, 2024
@aieri
Copy link
Contributor Author

aieri commented Dec 17, 2024

ok after some more discussions the proposal here would be to:

  • stop installing the nvidia driver via the hardware observer charm
  • install dcgm only if a driver has been loaded

aieri added a commit to aieri/hardware-observer-operator that referenced this issue Dec 18, 2024
While we initially provided automatic installation of the NVIDIA driver
as a convenience, we then ran into the complexity of dealing with users
wanting to configure pci-passthrough and/or vgpu, and to possibly move
across these configurations post-deployment (see canonical#362, canonical#379).

After some more discussions, we agreed that deploying a gpu driver is
not the responsibility of hardware-observer, but rather of the principal charm
that needs to use the gpu (e.g. nova or kubernetes-worker).

This commit therefore drops the functionality of automatically
installing the driver and determining if it has been blacklisted for a
simpler workflow of installing DCGM only if a driver is found to have
been installed and loaded.

Fixes: canonical#379
aieri added a commit that referenced this issue Dec 19, 2024
While we initially provided automatic installation of the NVIDIA driver
as a convenience, we then ran into the complexity of dealing with users
wanting to configure pci-passthrough and/or vgpu, and to possibly move
across these configurations post-deployment (see #362, #379).

After some more discussions, we agreed that deploying a gpu driver is
not the responsibility of hardware-observer, but rather of the principal charm
that needs to use the gpu (e.g. nova or kubernetes-worker).

This commit therefore drops the functionality of automatically
installing the driver and determining if it has been blacklisted for a
simpler workflow of installing DCGM only if a driver is found to have
been installed and loaded.

Fixes: #379
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant