Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable CUDA checkpointing with multiple processes #2470

Closed

Conversation

rst0git
Copy link
Member

@rst0git rst0git commented Aug 17, 2024

When checkpointing a container with multiple CUDA processes, CRIU currently fails with the following error:

cuda_plugin: pausing devices on pid 802521
Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 0
Error (cuda_plugin.c:141): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call
Error (cuda_plugin.c:382): cuda_plugin: PAUSE_DEVICES failed with

This error occurs because CRIU uses the cgroup specified by the container runtime to freeze all processes running in the container. However, if there are multiple processes that have CUDA state, we need to "lock" all these processes before freezing the cgroup; otherwise the cuda-checkpoint tool hangs.

To address this problem, this pull request updates the collect_pstree function to run the PAUSE_DEVICES hook for all processes in the container cgroup prior to freezing.

The error above can be replicated with the following example:

echo -e "tcp-established\nghost-limit=100M" | sudo tee -a /etc/criu/runc.conf

sudo podman run -d --privileged --name=hpl --device nvidia.com/gpu=all --security-opt=label=disable \
        nvcr.io/nvidia/hpc-benchmarks:24.06 \
        mpirun --bind-to none -np 2 \
        ./hpl.sh --no-multinode --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-1GPU.dat

sudo podman container checkpoint -l -e /tmp/test.tar

* before freezing; otherwise, the cuda-checkpoint tool may hang.
*/
if (opts.freeze_cgroup)
ret = pause_devices_in_cgroup();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a race window between pause_devices_in_cgroup and freeze_processes.

criu/seize.c Outdated Show resolved Hide resolved
run_plugins(PAUSE_DEVICES, pid);
ptr = end_ptr;
} else {
/* Move to the next line if the current line was invalid */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

report a error if the line is "invalid"

}

/* Read cgroup.procs into a buffer */
while ((bytes_read = read(procs_fd, buffer, sizeof(buffer) - 1)) > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you must understand the last line in the buffer can be incomplete and you need to handle it properly

@avagin
Copy link
Member

avagin commented Sep 5, 2024

Any update?

When checkpointing a container, CRIU uses the cgroup path specified by
the container runtime (e.g., runc) via the `--freeze-cgroup` option to
pause all running processes in the container and obtain a consistent
process tree. However, if the container has multiple processes with
CUDA state, we need to "lock" these processes before freezing the cgroup;
otherwise the cuda-checkpoint tool may hang.

To address this problem, this patch updates the collect_pstree function
to run the CUDA plugin PAUSE_DEVICES hook for all processes in the
container cgroup prior to freezing.

In addition, this change introduces a mechanism to disable the use of
freeze cgroups during process seizing, even if explicitly requested
via the --freeze-cgroup option.

The CUDA plugin is updated to utilize this new mechanism to ensure
compatibility.

Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Radostin Stoyanov <[email protected]>
@rst0git
Copy link
Member Author

rst0git commented Sep 12, 2024

Closing in favour of #2475

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants