Don't install DCGM if the driver has been blacklisted #363

aieri · 2024-12-04T23:26:43Z

If the sysadmin wants to pass the gpu to a virtual instance via pci
passthrough, they will need to make the gpu unavailable to the host
system by blacklisting the kernel driver. On such a system DCGM would
not be able to function and should therefore not be deployed.

This commit makes the NVIDIA gpu verifier more strict by only marking
DCGM as an available tool if both an NVIDIA gpu is detected and the
kernel module is not blacklisted.

Fixes: #362

Testing setup

Given a server with an NVIDIA gpu, drivers installed, and modules blacklisted via modprobe:

$ cat /etc/modprobe.d/blacklist-nvidia.conf 
blacklist nvidia
blacklist nvidiafb
blacklist nouveau
blacklist nvidia_drm

Following the dev-environment.md file, set up a local controller deploy an older stable charm (e.g. rev 84), setting up resources if necessary.

Next pack the charm from this commit, scp it to the server, and install it:

$ juju refresh --switch ./hardware-observer_ubuntu-24.04-amd64_ubuntu-22.04-amd64_ubuntu-20.04-amd64.charm  hardware-observer

Expected result:

DCGM is not installed
the nvidia driver remains not loaded
charm remains in active idle

If the sysadmin wants to pass the gpu to a virtual instance via pci passthrough, they will need to make the gpu unavailable to the host system by blacklisting[0] the kernel driver. On such a system DCGM would not be able to function and should therefore not be deployed. This commit makes the NVIDIA gpu verifier more strict by only marking DCGM as an available tool if both an NVIDIA gpu is detected *and* the kernel module is not blacklisted. Fixes: canonical#362 [0] https://wiki.debian.org/KernelModuleBlacklisting

src/hw_tools.py

jneo8

Thank you for the implementation!

Based on the code review (note: I haven’t tested it), I believe we’re at a good point to start adding unit tests.

src/hw_tools.py

tests/unit/test_hw_tools.py

gabrielcocenza

Small suggestion, but LGTM.

tests/unit/test_hw_tools.py

jneo8

LGTM if we follow Gabriel's suggestion on unit test.

tests/unit/test_hw_tools.py

Deezzir

LGTM

aieri · 2024-12-11T03:17:45Z

not merging yet because manual tests show that updating to this charm version forces the driver to be installed and loaded, despite the blacklisting

aieri · 2024-12-12T02:05:46Z

one possible hacky way to handle the upgrade issue would be to do something like this in _on_install_or_upgrade:

if HWTool.DCGM in self.stored_tools and not nvidia_gpu_verifier():
    self.stored_tools.remove(HWTool.DCGM)

(basically a gpu-specific hardware redetection round)

jneo8 · 2024-12-12T06:14:48Z

one possible hacky way to handle the upgrade issue would be to do something like this in _on_install_or_upgrade

Need to be very careful about this proposal. This somehow means run the detect function every time when hook is triggered.

not merging yet because manual tests show that updating to this charm version forces the driver to be installed and loaded, despite the blacklisting

I believe this issue occurs because the list of HWTool included the GPU prior to the charm upgrade.

It’s a bit of a chicken-and-egg problem—you’ll need a way to clean the local state to prevent this from happening.

New deployed unit won't encounter this.(If I am correct)

I have couple proposals:

Ask people to redeploy the juju unit. Since the origin issue is a edge case and new deployment won't encounter this after merging. It can be a lowest effort option for us.
Provide --clean-resource argument to the re-detect action, which will remove the unused resource on the machine. User encounter this issue can be simple fixed by running the action.
Run re-detect every time on install/upgrade hook, this may be lowest priority option since changing the life-cycle make it become more not stable.

samuelallan72 · 2024-12-12T06:47:44Z

Drive by comment: how expensive is it to detect the tools? Perhaps we can build the tools list on every charm hook, rather than using stored state?

jneo8 · 2024-12-12T06:56:53Z

Drive by comment: how expensive is it to detect the tools? Perhaps we can build the tools list on every charm hook, rather than using stored state?

We did this in the past and encounter an issue that some unstable hardware just you different result every time you run re-detect. I will suggestion not dig into the same hole again.

aieri · 2024-12-14T00:38:56Z

ok, I've tested that at least the upgrade scenario from rev 84 (pre-dcgm) works fine. Further cleanups would require changing how and when we do hw redetection, which is a much bigger endeavor than this change

aieri force-pushed the SOLENG-974-check-driver-blacklisting branch from 0868bce to f8d6bd7 Compare December 4, 2024 23:27

Deezzir mentioned this pull request Dec 5, 2024

Fix and restructure functional tests #343

Merged

Deezzir requested changes Dec 5, 2024

View reviewed changes

src/hw_tools.py Outdated Show resolved Hide resolved

src/hw_tools.py Outdated Show resolved Hide resolved

aieri added 2 commits December 4, 2024 19:12

Fix generator chain, cleanup nesting via any()

1268463

Additionally check if blacklisting has happened via kernel parameters

7886500

jneo8 reviewed Dec 6, 2024

View reviewed changes

src/hw_tools.py Outdated Show resolved Hide resolved

aieri and others added 2 commits December 6, 2024 20:34

Add unit tests and docstrings

9764b48

Merge branch 'main' into SOLENG-974-check-driver-blacklisting

cc0876c

aieri commented Dec 7, 2024

View reviewed changes

tests/unit/test_hw_tools.py Show resolved Hide resolved

aieri marked this pull request as ready for review December 7, 2024 04:54

aieri requested a review from a team as a code owner December 7, 2024 04:54

aieri requested review from Vultaire, Pjack, samuelallan72, jneo8, gabrielcocenza, rgildein and sbparke December 7, 2024 04:54

gabrielcocenza approved these changes Dec 9, 2024

View reviewed changes

tests/unit/test_hw_tools.py Show resolved Hide resolved

tests/unit/test_hw_tools.py Show resolved Hide resolved

jneo8 approved these changes Dec 10, 2024

View reviewed changes

tests/unit/test_hw_tools.py Show resolved Hide resolved

tests/unit/test_hw_tools.py Show resolved Hide resolved

Deezzir approved these changes Dec 10, 2024

View reviewed changes

chanchiwai-ray self-requested a review December 11, 2024 01:56

Merge branch 'main' into SOLENG-974-check-driver-blacklisting

4724626

Merge branch 'main' into SOLENG-974-check-driver-blacklisting

59ec5f1

aieri merged commit af58ad1 into canonical:main Dec 14, 2024
10 checks passed

aieri deleted the SOLENG-974-check-driver-blacklisting branch December 14, 2024 00:39

aieri mentioned this pull request Dec 17, 2024

the NVIDIA gpu module blacklisting algorithm is not specific enough #379

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't install DCGM if the driver has been blacklisted #363

Don't install DCGM if the driver has been blacklisted #363

aieri commented Dec 4, 2024 •

edited

Loading

jneo8 left a comment

gabrielcocenza left a comment

jneo8 left a comment

Deezzir left a comment

aieri commented Dec 11, 2024

aieri commented Dec 12, 2024

jneo8 commented Dec 12, 2024 •

edited

Loading

samuelallan72 commented Dec 12, 2024

jneo8 commented Dec 12, 2024

aieri commented Dec 14, 2024

Don't install DCGM if the driver has been blacklisted #363

Don't install DCGM if the driver has been blacklisted #363

Conversation

aieri commented Dec 4, 2024 • edited Loading

Testing setup

jneo8 left a comment

Choose a reason for hiding this comment

gabrielcocenza left a comment

Choose a reason for hiding this comment

jneo8 left a comment

Choose a reason for hiding this comment

Deezzir left a comment

Choose a reason for hiding this comment

aieri commented Dec 11, 2024

aieri commented Dec 12, 2024

jneo8 commented Dec 12, 2024 • edited Loading

samuelallan72 commented Dec 12, 2024

jneo8 commented Dec 12, 2024

aieri commented Dec 14, 2024

aieri commented Dec 4, 2024 •

edited

Loading

jneo8 commented Dec 12, 2024 •

edited

Loading