-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable CUDA checkpointing with multiple processes #2470
Enable CUDA checkpointing with multiple processes #2470
Conversation
* before freezing; otherwise, the cuda-checkpoint tool may hang. | ||
*/ | ||
if (opts.freeze_cgroup) | ||
ret = pause_devices_in_cgroup(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a race window between pause_devices_in_cgroup and freeze_processes.
run_plugins(PAUSE_DEVICES, pid); | ||
ptr = end_ptr; | ||
} else { | ||
/* Move to the next line if the current line was invalid */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
report a error if the line is "invalid"
} | ||
|
||
/* Read cgroup.procs into a buffer */ | ||
while ((bytes_read = read(procs_fd, buffer, sizeof(buffer) - 1)) > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you must understand the last line in the buffer can be incomplete and you need to handle it properly
6043fc4
to
421e873
Compare
Any update? |
When checkpointing a container, CRIU uses the cgroup path specified by the container runtime (e.g., runc) via the `--freeze-cgroup` option to pause all running processes in the container and obtain a consistent process tree. However, if the container has multiple processes with CUDA state, we need to "lock" these processes before freezing the cgroup; otherwise the cuda-checkpoint tool may hang. To address this problem, this patch updates the collect_pstree function to run the CUDA plugin PAUSE_DEVICES hook for all processes in the container cgroup prior to freezing. In addition, this change introduces a mechanism to disable the use of freeze cgroups during process seizing, even if explicitly requested via the --freeze-cgroup option. The CUDA plugin is updated to utilize this new mechanism to ensure compatibility. Signed-off-by: Andrei Vagin <[email protected]> Signed-off-by: Radostin Stoyanov <[email protected]>
421e873
to
76b2fa5
Compare
Closing in favour of #2475 |
When checkpointing a container with multiple CUDA processes, CRIU currently fails with the following error:
This error occurs because CRIU uses the cgroup specified by the container runtime to freeze all processes running in the container. However, if there are multiple processes that have CUDA state, we need to "lock" all these processes before freezing the cgroup; otherwise the
cuda-checkpoint
tool hangs.To address this problem, this pull request updates the
collect_pstree
function to run thePAUSE_DEVICES
hook for all processes in the container cgroup prior to freezing.The error above can be replicated with the following example: