Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't install DCGM if the driver has been blacklisted #363

Merged
merged 7 commits into from
Dec 14, 2024

Conversation

aieri
Copy link
Contributor

@aieri aieri commented Dec 4, 2024

If the sysadmin wants to pass the gpu to a virtual instance via pci
passthrough, they will need to make the gpu unavailable to the host
system by blacklisting the kernel driver. On such a system DCGM would
not be able to function and should therefore not be deployed.

This commit makes the NVIDIA gpu verifier more strict by only marking
DCGM as an available tool if both an NVIDIA gpu is detected and the
kernel module is not blacklisted.

Fixes: #362

Testing setup

Given a server with an NVIDIA gpu, drivers installed, and modules blacklisted via modprobe:

$ cat /etc/modprobe.d/blacklist-nvidia.conf 
blacklist nvidia
blacklist nvidiafb
blacklist nouveau
blacklist nvidia_drm

Following the dev-environment.md file, set up a local controller deploy an older stable charm (e.g. rev 84), setting up resources if necessary.

Next pack the charm from this commit, scp it to the server, and install it:

$ juju refresh --switch ./hardware-observer_ubuntu-24.04-amd64_ubuntu-22.04-amd64_ubuntu-20.04-amd64.charm  hardware-observer

Expected result:

  • DCGM is not installed
  • the nvidia driver remains not loaded
  • charm remains in active idle

If the sysadmin wants to pass the gpu to a virtual instance via pci
passthrough, they will need to make the gpu unavailable to the host
system by blacklisting[0] the kernel driver. On such a system DCGM would
not be able to function and should therefore not be deployed.

This commit makes the NVIDIA gpu verifier more strict by only marking
DCGM as an available tool if both an NVIDIA gpu is detected *and* the
kernel module is not blacklisted.

Fixes: canonical#362

[0] https://wiki.debian.org/KernelModuleBlacklisting
@aieri aieri force-pushed the SOLENG-974-check-driver-blacklisting branch from 0868bce to f8d6bd7 Compare December 4, 2024 23:27
src/hw_tools.py Outdated Show resolved Hide resolved
src/hw_tools.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jneo8 jneo8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the implementation!

Based on the code review (note: I haven’t tested it), I believe we’re at a good point to start adding unit tests.

src/hw_tools.py Outdated Show resolved Hide resolved
@aieri aieri marked this pull request as ready for review December 7, 2024 04:54
@aieri aieri requested a review from a team as a code owner December 7, 2024 04:54
Copy link
Member

@gabrielcocenza gabrielcocenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small suggestion, but LGTM.

tests/unit/test_hw_tools.py Show resolved Hide resolved
tests/unit/test_hw_tools.py Show resolved Hide resolved
Copy link
Contributor

@jneo8 jneo8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if we follow Gabriel's suggestion on unit test.

tests/unit/test_hw_tools.py Show resolved Hide resolved
tests/unit/test_hw_tools.py Show resolved Hide resolved
Copy link
Contributor

@Deezzir Deezzir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chanchiwai-ray chanchiwai-ray self-requested a review December 11, 2024 01:56
@aieri
Copy link
Contributor Author

aieri commented Dec 11, 2024

not merging yet because manual tests show that updating to this charm version forces the driver to be installed and loaded, despite the blacklisting

@aieri
Copy link
Contributor Author

aieri commented Dec 12, 2024

one possible hacky way to handle the upgrade issue would be to do something like this in _on_install_or_upgrade:

if HWTool.DCGM in self.stored_tools and not nvidia_gpu_verifier():
    self.stored_tools.remove(HWTool.DCGM)

(basically a gpu-specific hardware redetection round)

@jneo8
Copy link
Contributor

jneo8 commented Dec 12, 2024

one possible hacky way to handle the upgrade issue would be to do something like this in _on_install_or_upgrade

Need to be very careful about this proposal. This somehow means run the detect function every time when hook is triggered.

not merging yet because manual tests show that updating to this charm version forces the driver to be installed and loaded, despite the blacklisting

I believe this issue occurs because the list of HWTool included the GPU prior to the charm upgrade.

It’s a bit of a chicken-and-egg problem—you’ll need a way to clean the local state to prevent this from happening.

New deployed unit won't encounter this.(If I am correct)

I have couple proposals:

  • Ask people to redeploy the juju unit. Since the origin issue is a edge case and new deployment won't encounter this after merging. It can be a lowest effort option for us.
  • Provide --clean-resource argument to the re-detect action, which will remove the unused resource on the machine. User encounter this issue can be simple fixed by running the action.
  • Run re-detect every time on install/upgrade hook, this may be lowest priority option since changing the life-cycle make it become more not stable.

@samuelallan72
Copy link
Contributor

Drive by comment: how expensive is it to detect the tools? Perhaps we can build the tools list on every charm hook, rather than using stored state?

@jneo8
Copy link
Contributor

jneo8 commented Dec 12, 2024

Drive by comment: how expensive is it to detect the tools? Perhaps we can build the tools list on every charm hook, rather than using stored state?

We did this in the past and encounter an issue that some unstable hardware just you different result every time you run re-detect. I will suggestion not dig into the same hole again.

@aieri
Copy link
Contributor Author

aieri commented Dec 14, 2024

ok, I've tested that at least the upgrade scenario from rev 84 (pre-dcgm) works fine. Further cleanups would require changing how and when we do hw redetection, which is a much bigger endeavor than this change

@aieri aieri merged commit af58ad1 into canonical:main Dec 14, 2024
10 checks passed
@aieri aieri deleted the SOLENG-974-check-driver-blacklisting branch December 14, 2024 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed to communicate with NVIDIA driver. See more details in the logs
7 participants