-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
criu: Add support for pidfds #2449
Conversation
acac25e
to
729891a
Compare
8f702f6
to
c455eaa
Compare
f0e574a
to
a5c9e56
Compare
@bsach64 The following blog post provides information on how to write good commit messages: https://cbea.ms/git-commit/ Similar to the Linux kernel, we use the commit history to make it easier for reviewers (and maintainers) to understand the problem that motivated particular patch and how this problem is being addressed. |
Hello @rst0git! |
cc230c5
to
e2746eb
Compare
I am having a hard time understanding why the Vagrant Fedora Rawhide and Vagrant Fedora Rawhide (No VDSO) fail on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Hey @bsach64, if you have a time and interest, we can discuss support for threaded pidfds. |
yeah, sorry, I just forgot to press an "approve" button earlier. As I was following a whole development process of this thing there is no doubt that it's approved/reviewed from my side :-) |
Sure Alex :) |
5578e26
to
e572f5b
Compare
56c8f7d
to
3b8efda
Compare
We only use the last pid from the list in NSpid entry (from /proc/<pid>/fdinfo/<pidfd>) while restoring pidfds. The last pid refers to the pid of the process in the most deeply nested pid namespace. Since CRIU does not currently support nested pid namespaces, this entry is the one we want. After Linux 6.9, inode numbers can be used to compare pidfds. pidfds referring to the same process will have the same inode numbers. We use inode numbers to restore pidfds that point to dead processes. Signed-off-by: Bhavik Sachdev <[email protected]>
3b8efda
to
7bb2ecd
Compare
LGTM. Good job. |
Process file descriptors (pidfds) were introduced to provide a stable handle on a process. They solve the problem of pid recycling. For a detailed explanation, see https://lwn.net/Articles/801319/ and http://www.corsix.org/content/what-is-a-pidfd Before Linux 6.9, anonymous inodes were used for the implementation of pidfds. So, we detect them in a fashion similiar to other fd types that use anonymous inodes by calling `readlink()`. After 6.9, pidfs (a file system for pidfds) was introduced. In 6.9 `S_ISREG()` returned true for pidfds, but this again changed with 6.10. (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/pidfs.c?h=v6.11-rc2#n285) After this change, pidfs inodes have no file type in st_mode in userspace. We use `PID_FS_MAGIC` to detect pidfds for kernel >= 6.9 Hence, check for pidfds occurs before the check for regular files. For pidfds that refer to dead processes, we lose the pid of the process as the Pid and NSpid fields in /proc/<pid>/fdinfo/<pidfd> change to -1. So, we create a temporary process for each unique inode and open pidfds that refer to this process. After all pidfds have been opened we kill this temporary process. This commit does not include support for pidfds that point to a specific thread, i.e pidfds opened with `PIDFD_THREAD` flag. Fixes: checkpoint-restore#2258 Signed-off-by: Bhavik Sachdev <[email protected]>
Ensures that entries in /proc/<pid>/fdinfo/<pidfd> are same. Signed-off-by: Bhavik Sachdev <[email protected]>
Ensure `pidfd_send_signal()` syscall works as expected after C/R. Signed-off-by: Bhavik Sachdev <[email protected]>
Validate that pidfds can been used to send signals to different processes after C/R using the `pidfd_send_signal()` syscall. Signed-off-by: Bhavik Sachdev <[email protected]>
After, C/R of pidfds that point to dead processes their inodes might change. But if two pidfds point to same dead process they should continue to do so after C/R. This test ensures that this happens by calling `statx()` on pidfds after C/R and then comparing their inode numbers. Support for comparing pidfds by using `statx()` and inode numbers was introduced alongside pidfs. So if `f_type` of pidfd is not equal to `PID_FS_MAGIC` then we skip this test. signed-off-by: Bhavik Sachdev <[email protected]>
We get the read end of a pipe using `pidfd_getfd` and check if we can read from it after C/R. signed-off-by: Bhavik Sachdev <[email protected]>
We open a pidfd to a thread using `PIDFD_THREAD` flag and after C/R ensure that we can send signals using it with `PIDFD_SIGNAL_THREAD`. signed-off-by: Bhavik Sachdev <[email protected]>
@@ -54,6 +54,7 @@ TST_NOFILE := \ | |||
shm-mp \ | |||
ptrace_sig \ | |||
pidfd_self \ | |||
pidfd_dead \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this test fails in Fedora Rawhide:
======================= Run zdtm/static/pidfd_dead in h ========================
Start test
./pidfd_dead --pidfile=pidfd_dead.pid --outfile=pidfd_dead.out
Run criu dump
Run criu restore
=[log]=> dump/zdtm/static/pidfd_dead/168/1/restore.log
------------------------ grep Error ------------------------
b'(00.004951) 168: \t\t\tGoing to dup 1 into 2'
b'(00.004953) 168: \tReceive fd for 2'
b'(00.004961) 168: \t\tCreate fd for 3'
b'(00.005264) 168: \t\tCreate fd for 4'
b'(00.005486) 168: Error (criu/cr-restore.c:1261): 175 killed by signal 9: Killed'
b'(00.005504) Error (criu/cr-restore.c:2314): Restoring FAILED.'
------------------------ ERROR OVER ------------------------
############### Test zdtm/static/pidfd_dead FAIL at CRIU restore ###############
Test output: ================================
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to see why this happens!
Is there any way that I can replicate this test locally?
I did try running the tests on a VM with Fedora Rawhide 6.12.0-0.rc1.20241005git27cc6fdf7201.22.fc42.x86_64
where these tests end up passing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bsach64 There is a race condition between kill()
and waitpid()
in free_dead_pidfd()
.
You can replicate the error with the following change:
diff --git a/criu/pidfd.c b/criu/pidfd.c
index fdf5dec60..23062ae02 100644
--- a/criu/pidfd.c
+++ b/criu/pidfd.c
@@ -151,7 +151,7 @@ static int free_dead_pidfd(struct dead_pidfd *dead)
dead->pid);
goto err;
}
-
+ sleep(1);
if (waitpid(dead->pid, &status, 0) != dead->pid) {
pr_perror("Could not wait on temporary process with pid: %d",
dead->pid);
$ sudo python3 zdtm.py run -t zdtm/static/pidfd_dead
userns is supported
=== Run 1/1 ================ zdtm/static/pidfd_dead
====================== Run zdtm/static/pidfd_dead in uns =======================
Start test
./pidfd_dead --pidfile=pidfd_dead.pid --outfile=pidfd_dead.out
Run criu dump
Run criu restore
=[log]=> dump/zdtm/static/pidfd_dead/64/1/restore.log
------------------------ grep Error ------------------------
b'(00.016134) 1: No ipcns-sem-11.img image'
b'(00.020782) 1: net: Try to restore a link 10:1:lo'
b'(00.020820) 1: net: Restoring link lo type 1'
b'(00.024435) 1: net: \tRunning ip addr restore'
b'Error: ipv4: Address already assigned.'
b'Error: ipv6: address already assigned.'
b'(00.141528) 4: \t\tCreate fd for 3'
b'(00.141715) 1: `- render 1 iovs (0x7f8433c61000:8192...)'
b'(00.141790) 1: Restore via sigreturn'
b'(00.142590) 4: \t\tCreate fd for 4'
b'(00.143514) 4: Error (criu/cr-restore.c:1261): 18 killed by signal 9: Killed'
b'(00.143579) 4: Error (criu/pidfd.c:156): pidfd: Could not wait on temporary process with pid: 18: No child processes'
b'(00.143591) 4: Error (criu/pidfd.c:218): pidfd: Failed to delete dead_pidfd struct'
b"(00.143605) 4: Error (criu/pidfd.c:234): pidfd: Can't create pidfd 0x00000d NSpid: -1 flags: 0"
b'(00.143615) 4: Error (criu/files.c:1221): Unable to open fd=5 id=0xd'
b'(00.147752) uns: calling exit_usernsd (-1, 1)'
b'(00.147902) uns: daemon calls 0x479de0 (91, -1, 1)'
b'(00.147939) uns: `- daemon exits w/ 0'
b'(00.150476) uns: daemon stopped'
b'(00.150514) Error (criu/cr-restore.c:2314): Restoring FAILED.'
------------------------ ERROR OVER ------------------------
############### Test zdtm/static/pidfd_dead FAIL at CRIU restore ###############
Test output: ================================
<<< ================================
##################################### FAIL #####################################
We might need to add the temporary process PID (i.e., the processes created with create_tmp_process()
) in the list of task helpers (ta->helpers
). See collect_helper_pids()
in cr-restore.c
.
cc: @avagin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another solution might be to block SIGCHLD
?
See sysctl.c
line 271 https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/sysctl.c#L271
diff --git a/criu/pidfd.c b/criu/pidfd.c
index fdf5dec60..a63a9a4b8 100644
--- a/criu/pidfd.c
+++ b/criu/pidfd.c
@@ -27,6 +27,7 @@ struct dead_pidfd {
unsigned int ino;
int pid;
size_t count;
+ sigset_t oldmask;
mutex_t pidfd_lock;
struct hlist_node hash;
};
@@ -152,12 +153,14 @@ static int free_dead_pidfd(struct dead_pidfd *dead)
goto err;
}
+ sleep(1);
if (waitpid(dead->pid, &status, 0) != dead->pid) {
pr_perror("Could not wait on temporary process with pid: %d",
dead->pid);
goto err;
}
+ sigprocmask(SIG_SETMASK, &dead->oldmask, NULL);
if (!WIFSIGNALED(status)) {
pr_err("Expected temporary process to be terminated by a signal\n");
goto err;
@@ -180,6 +183,7 @@ static int open_one_pidfd(struct file_desc *d, int *new_fd)
{
struct pidfd_info *info;
struct dead_pidfd *dead = NULL;
+ sigset_t blockmask;
int pidfd;
info = container_of(d, struct pidfd_info, d);
@@ -199,6 +203,9 @@ static int open_one_pidfd(struct file_desc *d, int *new_fd)
BUG_ON(dead->count == 0);
dead->count--;
if (dead->pid == -1) {
+ sigemptyset(&blockmask);
+ sigaddset(&blockmask, SIGCHLD);
+ sigprocmask(SIG_BLOCK, &blockmask, &dead->oldmask);
dead->pid = create_tmp_process();
if (dead->pid < 0) {
mutex_unlock(&dead->pidfd_lock);
@@ -270,6 +277,7 @@ static int collect_one_pidfd(void *obj, ProtobufCMessage *msg, struct cr_img *i)
dead->ino = info->pidfe->ino;
dead->count = 1;
dead->pid = -1;
+ sigemptyset(&dead->oldmask);
mutex_init(&dead->pidfd_lock);
mutex_lock(dead_pidfd_hash_lock);
cc: @mihalicyn
A PR for support of process file descriptors.
Here's a article that talks about features of pidfds, which we might have to support (and test):
http://www.corsix.org/content/what-is-a-pidfd
To Do:
pid
ofpidfd
is part of the process tree we are dumping. So, that we can ensure it exists on restore.pidfds
that point to dead processes.If the process for which a pidfd is open dies, the Pid and NSPid entries change to -1. So, we don't have the Pid necessary to recreate the process. We cannot just create a random process, open a pidfd to it, and kill it since pidfds opened for a specific process have the same inode number used to compare them.
criu/include/magic.h
pidfd_getfd
Support for
PIDFD_THREAD
(allows creation of a pidfd that points to a specific thread) not a part of this PR.Known Limitations:
CRIU doesn't support nested PID namespaces. As such we cannot C/R pidfd's that point to PIDs in nested namespaces. If we introduce support for nested PID namespaces we will have to rework some of this code.
This is work being done as part of Google Summer of Code 2024 (https://summerofcode.withgoogle.com/programs/2024/projects/vgpqxIxx)
A huge thank you to Alex @mihalicyn for guiding me and making this a wonderful experience :)