You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When runc 1.2.x is used by rootless buildkit, runc spits the following warning (the reproducer can be taken from here: #5491):
Creating a rootless container with no cgroup and no private pid namespace. Such configuration is strongly discouraged (as it is impossible to properly kill all container's processes) and will result in an error in a future runc version.
To reiterate, such config is unsound, here's why.
The lack of private pid namespace means that runc can not rely on kernel's special handling of pid 1 (when killed, all other processes in the same cgroup are also killed). In such case, in order to kill a container runc needs to collect all the pids from the cgroup, freeze the cgroup (so no new forks are happening), send sigkill to all processes and thaw the cgroup (so that those processes can actually be killed). Or, if cgroup.kill is available, it is used (which basically tells the kernel to kill all processes in the cgroup, and don't allow any new forks).
Now, if runc doesn't have neither private pid namespace nor cgroup for a container, there is no way to properly stop a container (kill all container processes).
Can you please fix this, or suggest how runc should handle such a case? I mean, I can think of other ways to figure out which processes belong to the container, but they are all rather expensive and not race-free.
The text was updated successfully, but these errors were encountered:
By adding `--oci-worker-no-process-sandbox` to the `buildkitd` arguments, BuildKit can be executed in a container without adding `--privileged` to `docker run` arguments.
However, you still need to pass `--security-opt seccomp=unconfined --security-opt apparmor=unconfined` to `docker run`.
Note that `--oci-worker-no-process-sandbox` allows build executor containers to `kill` (and potentially `ptrace` depending on the seccomp configuration) an arbitrary process in the BuildKit daemon container.
To allow running rootless `buildkitd` without `--oci-worker-no-process-sandbox`, run `docker run` with `--security-opt systempaths=unconfined`. (For Kubernetes, set `securityContext.procMount` to `Unmasked`.)
The `--security-opt systempaths=unconfined` flag disables the masks for the `/proc` mount in the container and potentially allows reading and writing dangerous kernel files, but it is safe when you are running `buildkitd` as non-root.
This was required for allowing to run the moby/buildkit image without --privileged, by eliminating the necessity of mounting /proc for the executor containers (RUN instruction containers).
Since now we have systemdpaths=unconfined option in both Kube and Docker, probably we can deprecate --oci-worker-no-process-sandbox in favor of it.
But I'm not sure if systemdpaths=unconfined is really more secure than --privileged.
When runc 1.2.x is used by rootless buildkit, runc spits the following warning (the reproducer can be taken from here: #5491):
To reiterate, such config is unsound, here's why.
The lack of private pid namespace means that runc can not rely on kernel's special handling of pid 1 (when killed, all other processes in the same cgroup are also killed). In such case, in order to kill a container runc needs to collect all the pids from the cgroup, freeze the cgroup (so no new forks are happening), send sigkill to all processes and thaw the cgroup (so that those processes can actually be killed). Or, if cgroup.kill is available, it is used (which basically tells the kernel to kill all processes in the cgroup, and don't allow any new forks).
Now, if runc doesn't have neither private pid namespace nor cgroup for a container, there is no way to properly stop a container (kill all container processes).
Can you please fix this, or suggest how runc should handle such a case? I mean, I can think of other ways to figure out which processes belong to the container, but they are all rather expensive and not race-free.
The text was updated successfully, but these errors were encountered: