Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using runc with no pid namespace nor cgroups #5512

Open
kolyshkin opened this issue Nov 12, 2024 · 2 comments
Open

Using runc with no pid namespace nor cgroups #5512

kolyshkin opened this issue Nov 12, 2024 · 2 comments
Assignees
Labels
area/rootless rootless mode

Comments

@kolyshkin
Copy link
Contributor

When runc 1.2.x is used by rootless buildkit, runc spits the following warning (the reproducer can be taken from here: #5491):

Creating a rootless container with no cgroup and no private pid namespace. Such configuration is strongly discouraged (as it is impossible to properly kill all container's processes) and will result in an error in a future runc version.

To reiterate, such config is unsound, here's why.

The lack of private pid namespace means that runc can not rely on kernel's special handling of pid 1 (when killed, all other processes in the same cgroup are also killed). In such case, in order to kill a container runc needs to collect all the pids from the cgroup, freeze the cgroup (so no new forks are happening), send sigkill to all processes and thaw the cgroup (so that those processes can actually be killed). Or, if cgroup.kill is available, it is used (which basically tells the kernel to kill all processes in the cgroup, and don't allow any new forks).

Now, if runc doesn't have neither private pid namespace nor cgroup for a container, there is no way to properly stop a container (kill all container processes).

Can you please fix this, or suggest how runc should handle such a case? I mean, I can think of other ways to figure out which processes belong to the container, but they are all rather expensive and not race-free.

@AkihiroSuda
Copy link
Member

AkihiroSuda commented Nov 12, 2024

PIDNS is disabled only when buildkitd is executed with --oci-worker-no-process-sandbox:

#### About `--oci-worker-no-process-sandbox`
By adding `--oci-worker-no-process-sandbox` to the `buildkitd` arguments, BuildKit can be executed in a container without adding `--privileged` to `docker run` arguments.
However, you still need to pass `--security-opt seccomp=unconfined --security-opt apparmor=unconfined` to `docker run`.
Note that `--oci-worker-no-process-sandbox` allows build executor containers to `kill` (and potentially `ptrace` depending on the seccomp configuration) an arbitrary process in the BuildKit daemon container.
To allow running rootless `buildkitd` without `--oci-worker-no-process-sandbox`, run `docker run` with `--security-opt systempaths=unconfined`. (For Kubernetes, set `securityContext.procMount` to `Unmasked`.)
The `--security-opt systempaths=unconfined` flag disables the masks for the `/proc` mount in the container and potentially allows reading and writing dangerous kernel files, but it is safe when you are running `buildkitd` as non-root.

This was required for allowing to run the moby/buildkit image without --privileged, by eliminating the necessity of mounting /proc for the executor containers (RUN instruction containers).

Since now we have systemdpaths=unconfined option in both Kube and Docker, probably we can deprecate --oci-worker-no-process-sandbox in favor of it.
But I'm not sure if systemdpaths=unconfined is really more secure than --privileged.

@AkihiroSuda
Copy link
Member

BTW, having unkillable processes is not a serious concern for BuildKit, as long as the following conditions are satisfied:

Because all the RUN processes are killed on the death of the buildkitd process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rootless rootless mode
Projects
None yet
Development

No branches or pull requests

3 participants