Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libcontainer: remove all mount logic from nsexec #3985

Merged
merged 8 commits into from
Dec 18, 2023
Merged

libcontainer: remove all mount logic from nsexec #3985

merged 8 commits into from
Dec 18, 2023

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Aug 17, 2023

Split out from #3953 to be reviewed separately.

Draft until #4124 is merged.

Draft until #4003 is merged.

Draft until #3967 is merged.


With open_tree(OPEN_TREE_CLONE), it is possible to implement both the
id-mapped mounts and bind-mount source file descriptor logic entirely in
Go without requiring any complicated handling from nsexec.

This allows us to remove the amount of C code we have in nsexec, as well
as simplifying a whole host of places that were made more complicated
with the addition of id-mapped mounts and the bind sourcefd logic. The
one downside of this is that the bind sourcefd feature now depends on
Linux 5.4.
(EDIT: We just pass a regular file descriptor, so there's no
change)

In addition, we can easily add support for id-mappings that don't match
the container's user namespace. The approach taken here is to use Go's
officially supported mechanism for spawning a process in a user
namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a
different process. The most efficient way to implement this would be to
do clone() in cgo directly to run a function that just does
kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out
this approach is too slow.

This also includes a fix for a fairly sneaky bug caused by commit 9c44407
("Open bind mount sources from the host userns") and further exacerbated by
commit fda12ab ("Support idmap mounts on volumes"), where opening the
mount sources long before we do any of the container rootfs mounts results in
undesirable behaviour with regards to mount propagation and unconventional
mount entries that use the container filesystem as the source. The solution is
to have a thread that joins the container mount namespace and does on-demand
generation of mount fds when requested.

Fixes: fda12ab ("Support idmap mounts on volumes")
Fixes: 9c44407 ("Open bind mount sources from the host userns")
Signed-off-by: Aleksa Sarai [email protected]

@cyphar cyphar added this to the 1.2.0 milestone Aug 17, 2023
@cyphar

This comment was marked as resolved.

@cyphar

This comment was marked as resolved.

@cyphar

This comment was marked as resolved.

@cyphar

This comment was marked as resolved.

@cyphar

This comment was marked as resolved.

@cyphar cyphar marked this pull request as ready for review August 24, 2023 06:12
@cyphar

This comment was marked as resolved.

@kolyshkin
Copy link
Contributor

@cyphar two issues:

  1. squash! libcontainer: remove all mount logic from nsexec probably shouldn be renamed (or merged into other commits).
  2. Not sure what's wrong with Cirrus CI, but it doesn't want to clone the repo.

@cyphar
Copy link
Member Author

cyphar commented Aug 26, 2023

  1. Will do, I wanted to make sure the rewrite-of-the-rewrite worked before doing the squash.
  2. I pushed the branch to origin by accident and Cirrus has saved the origin branch name and won't re-fetch the PR (probably because the commit is the same). I think a force-push will fix it...

@rata
Copy link
Member

rata commented Aug 28, 2023

It seems @cyphar is on vacations for two weeks, does someone wants to picks this up? Would that be okay?

@cyphar
Copy link
Member Author

cyphar commented Aug 28, 2023

I will push the version that is rebased on top of #3967 (which fixes the test errors here). But if anything else is needed I can't really guarantee that I'll be able to get back to it before I get back in two weeks. Feel free to carry this if necessary @kolyshkin.

@rata
Copy link
Member

rata commented Aug 28, 2023

@cyphar can we carry your other open PRs forward too? (if it happens to be needed)

@cyphar
Copy link
Member Author

cyphar commented Aug 28, 2023

Yeah, of course.

@rata
Copy link
Member

rata commented Aug 29, 2023

@kolyshkin can you carry this forward, then?

@lifubang
Copy link
Member

It's so big that we need time to review and do some corner tests.

@@ -5,6 +5,8 @@ import (
)

func TestParseCgroups(t *testing.T) {
// We don't need to use /proc/thread-self here because runc always runs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could remove these comments in this file and other files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, though the unfortunate thing is that knowing whether to use thread-self or not is a little bit delicate and so the comments (while repetitive) are kind of important.

@lifubang
Copy link
Member

I think there are many refactor codes not about mount logic, I think we can split this PR to some small PRs, then it will be easy to review. I will do it ASAP, if someone wants to do it, it will be appreciated.

@cyphar
Copy link
Member Author

cyphar commented Aug 31, 2023

Thanks @lifubang! Just be aware that while it would be nice to cherry-pick the thread-self patch, when I tried it became quite ugly to disentangle it from the mount changes.

@lifubang
Copy link
Member

Part I: #4003

@lifubang
Copy link
Member

when I tried it became quite ugly to disentangle it from the mount changes.

After #4003 merged, I'll see whether it's worth to do or not.

// We don't need to use /proc/thread-self here because the exe mm of a
// thread-group is guaranteed to be the same for all threads by definition.
// This lets us avoid having to do runtime.LockOSThread.
return os.StartProcess("/proc/self/exe", []string{"runc", "--help"}, &os.ProcAttr{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can’t be just /bin/true ? (when it exists)

Copy link
Member Author

@cyphar cyphar Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path doesn't actually matter, we never call execve(2). We can't use "" or some other dummy path because the Go runtime checks that the path exists and is executable. I would prefer to not add code to spawn a random external executable (even if we don't actually execute it), and if /bin/true doesn't exist (or is not executable for some reason) we would need a fallback anyway.

Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please consider to keep json field names consistent with Go variable names

@cyphar
Copy link
Member Author

cyphar commented Dec 13, 2023

@AkihiroSuda I'm fixing them up now 😸.

With open_tree(OPEN_TREE_CLONE), it is possible to implement both the
id-mapped mounts and bind-mount source file descriptor logic entirely in
Go without requiring any complicated handling from nsexec.

However, implementing it the naive way (do the OPEN_TREE_CLONE in the
host namespace before the rootfs is set up -- which is what the existing
implementation did) exposes issues in how mount ordering (in particular
when handling mount sources from inside the container rootfs, but also
in relation to mount propagation) was handled for idmapped mounts and
bind-mount sources. In order to solve this problem completely, it is
necessary to spawn a thread which joins the container mount namespace
and provides mountfds when requested by the rootfs setup code (ensuring
that the mount order and mount propagation of the source of the
bind-mount are handled correctly). While the need to join the mount
namespace leads to other complicated (such as with the usage of
/proc/self -- fixed in a later patch) the resulting code is still
reasonable and is the only real way to solve the issue.

This allows us to reduce the amount of C code we have in nsexec, as well
as simplifying a whole host of places that were made more complicated
with the addition of id-mapped mounts and the bind sourcefd logic.
Because we join the container namespace, we can continue to use regular
O_PATH file descriptors for non-id-mapped bind-mount sources (which
means we don't have to raise the kernel requirement for that case).

In addition, we can easily add support for id-mappings that don't match
the container's user namespace. The approach taken here is to use Go's
officially supported mechanism for spawning a process in a user
namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a
different process. The most efficient way to implement this would be to
do clone() in cgo directly to run a function that just does
kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out
this approach is too slow. It should be noted that the included
micro-benchmark seems to indicate this is Fast Enough(TM):

  goos: linux
  goarch: amd64
  pkg: github.com/opencontainers/runc/libcontainer/userns
  cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
  BenchmarkSpawnProc
  BenchmarkSpawnProc-8        1670            770065 ns/op

Fixes: fda12ab ("Support idmap mounts on volumes")
Fixes: 9c44407 ("Open bind mount sources from the host userns")
Signed-off-by: Aleksa Sarai <[email protected]>
With the rework of nsexec.c to handle MOUNT_ATTR_IDMAP in our Go code we
can now handle arbitrary mappings without issue, so remove the primary
artificial limit of mappings (must use the same mapping as the
container's userns) and add some tests.

We still only support idmap mounts for bind-mounts because configuring
mappings for other filesystems would require switching our entire mount
machinery to the new mount API. The current design would easily allow
for this but we would need to convert new mount options entirely to the
fsopen/fsconfig/fsmount API. This can be done in the future.

Signed-off-by: Aleksa Sarai <[email protected]>
The primary change is a switch to using /proc/thread-self, which is
needed for when we add a CLONE_FS thread to runc.

Signed-off-by: Aleksa Sarai <[email protected]>
With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.

In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:

 * Checks that certain namespaces are supported by stat(2)ing
   /proc/self/ns/...

 * /proc/self/exe and /proc/self/cmdline are not thread-local.

 * While threads can be in different cgroups, we do not do this for the
   runc binary (or libcontainer) and thus we do not need to switch to
   the thread-local version of /proc/self/cgroups.

 * All of the CLONE_NEWUSER files are not thread-local because you
   cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
   is blocked for multi-threaded programs).

Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.

In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.

Yes this is all pretty ugly. I also wish it wasn't necessary.

Signed-off-by: Aleksa Sarai <[email protected]>
Our previous test for whether we can mount on top of /proc incorrectly
assumed that it would only be called with bind-mount sources. This meant
that having a non bind-mount entry for a pseudo-filesystem (like
overlayfs) with a dummy source set to /proc on the host would let you
bypass the check, which could easily lead to security issues.

In addition, the check should be applied more uniformly to all mount
types, so fix that as well. And add some tests for some of the tricky
cases to make sure we protect against them properly.

Fixes: 331692b ("Only allow proc mount if it is procfs")
Signed-off-by: Aleksa Sarai <[email protected]>
If a user specifies a configuration like "rro, rrw", we should have
similar behaviour to "ro, rw" where we clear the previous flags so that
the last specified flag takes precendence.

Fixes: 382eba4 ("Support recursive mount attrs ("rro", "rnosuid", "rnodev", ...)")
Signed-off-by: Aleksa Sarai <[email protected]>
ridmap indicates that the id mapping should be applied recursively (only
really relevant for rbind mount entries), and idmap indicates that it
should not be applied recursively (the default). If no mappings are
specified for the mount, we use the userns configuration of the
container. This matches the behaviour in the currently-unreleased
runtime-spec.

This includes a minor change to the state.json serialisation format, but
because there has been no released version of runc with commit
fbf183c ("Add uid and gid mappings to mounts"), we can safely make
this change without affecting running containers. Doing it this way
makes it much easier to handle m.IsIDMapped() and indicating that a
mapping has been specified.

Signed-off-by: Aleksa Sarai <[email protected]>
Our previous implementation of idmapped mounts and bind-mount sources
would open all of the source paths before we did any mounts, meaning
that mounts using sources from inside the container rootfs would not be
correct.

This has been fixed with the new on-demand system, and so add some
regression tests.

Signed-off-by: Aleksa Sarai <[email protected]>
@cyphar
Copy link
Member Author

cyphar commented Dec 14, 2023

@kolyshkin Did you want to re-review this, or can we merge it as-is?

@lifubang lifubang merged commit 371ff9c into opencontainers:main Dec 18, 2023
45 checks passed
@cyphar cyphar deleted the idmap-generic branch December 18, 2023 05:13
@cyphar cyphar mentioned this pull request Mar 14, 2024
@lifubang
Copy link
Member

lifubang commented Jul 3, 2024

// If m is nil, don't stat the filesystem. This is used for restore of a checkpoint.

  1. Maybe here it should be: If m.Source is nil......
  2. And maybe this line can be removed, because it seems like we don't care about it anymore.

@cyphar
Copy link
Member Author

cyphar commented Jul 4, 2024

@lifubang Yeah, that comment was obsoleted by cdff09a ("rootfs; fix 'can we mount on top of /proc' check") but I didn't notice this until I did the backport in #4334.

(We can remove it, hopefully this didn't break anything.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants