Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow /sbin/ldconfig (support for glibc extension for nvidia gpu-operator) #9339

Merged
merged 1 commit into from
Sep 20, 2024

Conversation

jfroy
Copy link
Contributor

@jfroy jfroy commented Sep 18, 2024

Pull Request

What? (description)

This PR allows /sbin/ldconfig in extensions.

Why? (reasoning)

Allowing /sbin/ldconfig in extensions is a simple change that should have basically no bad side effects. It allows the NVIDIA container toolkit to work on Talos without modification.

Original "Why?" that allowed containers with SYS_MODULE

The NVIDIA gpu-operator includes machinery to manage the driver on nodes via driver containers. These containers, usually provided by NVIDIA, include basically a giant shell script that will fetch, build, and load the kernel modules for the node's kernel. It will also unload these modules when the container terminates, and provide the userspace components by copying them to /run/nvidia/driver on the node.

More recently, driver containers have gained the ability to provide precompiled kernel modules on supported Linux distributions (Ubuntu today).

The intention here is for SideroLabs to build their own driver containers and let the gpu-operator manage them. Is it not a clear-cut better approach than the existing extensions, but it better aligns with NVIDIA's general direction for supporting GPUs in Kubernetes and provides more flexibility for cluster operators by decoupling NVIDIA driver releases from Talos releases (even with the recent addition of LTS and production extension flavors).

Of course, producing driver containers will either need to be coupled with kernel builds to use the ephemeral module signing key, or SideroLabs may change its approach and e.g. retain the signing key for each Talos release to decouple NVIDIA driver builds from kernel builds. Another approach would be to use SYSTEM_EXTRA_CERTIFICATE (see torvalds/linux@c4c3610) to allow Image Factory to insert a signing certificate provided by the user, which could then build their own driver containers.

Acceptance

Please use the following checklist:

  • you linked an issue (if applicable)
  • you included tests (if applicable)
  • you ran conformance (make conformance)
  • you formatted your code (make fmt)
  • you linted your code (make lint)
  • you generated documentation (make docs)
  • you ran unit-tests (make unit-tests)

See make help for a description of the available targets.

@frezbo
Copy link
Member

frezbo commented Sep 18, 2024

adding /sbin/ldconfig is all right

But we don't want to change sys_module capability, does the gpu operator has support for ignoring driver loading and let talos still load the drivers? or the drivers could still be provided as container images (pre-compiled kernel modules), but talos loads it to a place where the operator expects it? Would be a much better than reducing default security

@jfroy
Copy link
Contributor Author

jfroy commented Sep 18, 2024

adding /sbin/ldconfig is all right

But we don't want to change sys_module capability, does the gpu operator has support for ignoring driver loading and let talos still load the drivers? or the drivers could still be provided as container images (pre-compiled kernel modules), but talos loads it to a place where the operator expects it? Would be a much better than reducing default security

I think both of those options are possible with caveats. Driver containers really are glorified shell scripts. You can see my prototype here: https://github.com/jfroy/gpu-driver-container/tree/main/talos. So presumably a driver container for Talos could simply check for the nvidia modules and not support loading or unloading.

The caveat is that depending on the user configuration, driver containers can load the modules with parameters. That would break / not work on Talos if driver containers do not manage the kernel modules.

But maybe that is an acceptable first step. One possible second step from there would be to add a machined API to load/unload kernel modules, and then Talos driver containers could be granted the rights to invoke that API.

@frezbo
Copy link
Member

frezbo commented Sep 18, 2024

But maybe that is an acceptable first step. One possible second step from there would be to add a machined API to load/unload kernel modules, and then Talos driver containers could be granted the rights to invoke that API.

so talos already support loading modules with kernel parameters, with the WIP SELinux work, maybe we could at some point somehow label the driver containers so they could only load modules, we can start with the ldconfig change only at first, and see how we proceed

@jfroy
Copy link
Contributor Author

jfroy commented Sep 18, 2024

But maybe that is an acceptable first step. One possible second step from there would be to add a machined API to load/unload kernel modules, and then Talos driver containers could be granted the rights to invoke that API.

so talos already support loading modules with kernel parameters, with the WIP SELinux work, maybe we could at some point somehow label the driver containers so they could only load modules, we can start with the ldconfig change only at first, and see how we proceed

Let me test/prototype this first. There may be logic in the operator where if a node has the kernel modules, no driver container will be run on that node. Because the kernel and userspace drivers are coupled, this is a likely assumption. Maybe machined can spawn a static pod on nodes for the driver container, or maybe gpu-operator can be patched. There is a lot of logic in the operator to sequence things properly, but generally speaking loading the kernel modules is the first step, so hopefully this won't be too complex.

I've been prototyping with driver CRDs as well (https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-configuration.html), which allows dynamic and per-node drivers. This of course depends on the ability to load and unload kernel modules. From an NVIDIA and cluster operator point of view, this is a nice feature, especially for developer clusters where the driver may need to be changed or tuned. From SideroLabs or, I suppose also, a cluster operator point of view, this is kind of a security issue (the promise that Talos is an immutable distribution/system -- though I'd argue that there are many ways a cluster operator can do to mutate or tune or configure Talos dynamically already).

@andrewrynhard
Copy link
Member

@jfroy since this would be a pretty big change for us philosophically I would it if we could hop on a call to chat through this some more. Would you be open to that?

@jfroy
Copy link
Contributor Author

jfroy commented Sep 18, 2024

@jfroy since this would be a pretty big change for us philosophically I would it if we could hop on a call to chat through this some more. Would you be open to that?

Sure, reach out by email.

@jfroy jfroy changed the title Nvida gpu operator support Allow /sbin/ldconfig (support for glibc extension for nvidia gpu-operator) Sep 19, 2024
This is specifically for the glibc extension to support nvidia container
toolkit.

Signed-off-by: Jean-Francois Roy <[email protected]>
Signed-off-by: Noel Georgi <[email protected]>
@jfroy
Copy link
Contributor Author

jfroy commented Sep 20, 2024

As a note (which I guess you probably know), the validator needs a module update to import the new package from talos.

@frezbo
Copy link
Member

frezbo commented Sep 20, 2024

As a note (which I guess you probably know), the validator needs a module update to import the new package from talos.

yeh, next PR should be to validator

@frezbo
Copy link
Member

frezbo commented Sep 20, 2024

/m

@talos-bot talos-bot merged commit 6484581 into siderolabs:main Sep 20, 2024
50 checks passed
@jfroy jfroy deleted the nvida-gpu-operator-support branch September 21, 2024 05:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants