Using runc with no pid namespace nor cgroups #5512

kolyshkin · 2024-11-12T01:57:12Z

When runc 1.2.x is used by rootless buildkit, runc spits the following warning (the reproducer can be taken from here: #5491):

Creating a rootless container with no cgroup and no private pid namespace. Such configuration is strongly discouraged (as it is impossible to properly kill all container's processes) and will result in an error in a future runc version.

To reiterate, such config is unsound, here's why.

The lack of private pid namespace means that runc can not rely on kernel's special handling of pid 1 (when killed, all other processes in the same cgroup are also killed). In such case, in order to kill a container runc needs to collect all the pids from the cgroup, freeze the cgroup (so no new forks are happening), send sigkill to all processes and thaw the cgroup (so that those processes can actually be killed). Or, if cgroup.kill is available, it is used (which basically tells the kernel to kill all processes in the cgroup, and don't allow any new forks).

Now, if runc doesn't have neither private pid namespace nor cgroup for a container, there is no way to properly stop a container (kill all container processes).

Can you please fix this, or suggest how runc should handle such a case? I mean, I can think of other ways to figure out which processes belong to the container, but they are all rather expensive and not race-free.

AkihiroSuda · 2024-11-12T03:01:56Z

PIDNS is disabled only when buildkitd is executed with --oci-worker-no-process-sandbox:

buildkit/docs/rootless.md

Lines 71 to 80 in 0655923

    
           #### About `--oci-worker-no-process-sandbox` 
        
           By adding `--oci-worker-no-process-sandbox` to the `buildkitd` arguments, BuildKit can be executed in a container without adding `--privileged` to `docker run` arguments. 
        
           However, you still need to pass `--security-opt seccomp=unconfined --security-opt apparmor=unconfined` to `docker run`. 
        
           Note that `--oci-worker-no-process-sandbox` allows build executor containers to `kill` (and potentially `ptrace` depending on the seccomp configuration) an arbitrary process in the BuildKit daemon container. 
        
           To allow running rootless `buildkitd` without `--oci-worker-no-process-sandbox`, run `docker run` with `--security-opt systempaths=unconfined`. (For Kubernetes, set `securityContext.procMount` to `Unmasked`.) 
        
           The `--security-opt systempaths=unconfined` flag disables the masks for the `/proc` mount in the container and potentially allows reading and writing dangerous kernel files, but it is safe when you are running `buildkitd` as non-root.

This was required for allowing to run the moby/buildkit image without --privileged, by eliminating the necessity of mounting /proc for the executor containers (RUN instruction containers).

Since now we have systemdpaths=unconfined option in both Kube and Docker, probably we can deprecate --oci-worker-no-process-sandbox in favor of it.
But I'm not sure if systemdpaths=unconfined is really more secure than --privileged.

AkihiroSuda · 2024-11-12T03:07:06Z

BTW, having unkillable processes is not a serious concern for BuildKit, as long as the following conditions are satisfied:

buildkitd process is containerized with PIDNS, and
buildkitd is running in the (pseudo-)daemonless mode

Because all the RUN processes are killed on the death of the buildkitd process.

tonistiigi assigned AkihiroSuda Nov 12, 2024

tonistiigi added the area/rootless rootless mode label Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using runc with no pid namespace nor cgroups #5512

Using runc with no pid namespace nor cgroups #5512

kolyshkin commented Nov 12, 2024

AkihiroSuda commented Nov 12, 2024 •

edited

Loading

AkihiroSuda commented Nov 12, 2024

Using runc with no pid namespace nor cgroups #5512

Using runc with no pid namespace nor cgroups #5512

Comments

kolyshkin commented Nov 12, 2024

AkihiroSuda commented Nov 12, 2024 • edited Loading

AkihiroSuda commented Nov 12, 2024

AkihiroSuda commented Nov 12, 2024 •

edited

Loading