Enable CUDA checkpointing with multiple processes #2470

rst0git · 2024-08-17T11:06:17Z

When checkpointing a container with multiple CUDA processes, CRIU currently fails with the following error:

cuda_plugin: pausing devices on pid 802521
Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 0
Error (cuda_plugin.c:141): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call
Error (cuda_plugin.c:382): cuda_plugin: PAUSE_DEVICES failed with

This error occurs because CRIU uses the cgroup specified by the container runtime to freeze all processes running in the container. However, if there are multiple processes that have CUDA state, we need to "lock" all these processes before freezing the cgroup; otherwise the cuda-checkpoint tool hangs.

To address this problem, this pull request updates the collect_pstree function to run the PAUSE_DEVICES hook for all processes in the container cgroup prior to freezing.

The error above can be replicated with the following example:

echo -e "tcp-established\nghost-limit=100M" | sudo tee -a /etc/criu/runc.conf

sudo podman run -d --privileged --name=hpl --device nvidia.com/gpu=all --security-opt=label=disable \
        nvcr.io/nvidia/hpc-benchmarks:24.06 \
        mpirun --bind-to none -np 2 \
        ./hpl.sh --no-multinode --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-1GPU.dat

sudo podman container checkpoint -l -e /tmp/test.tar

avagin · 2024-08-17T20:21:17Z

criu/seize.c

+	 * before freezing; otherwise, the cuda-checkpoint tool may hang.
+	 */
+	if (opts.freeze_cgroup)
+		ret = pause_devices_in_cgroup();


Here is a race window between pause_devices_in_cgroup and freeze_processes.

criu/seize.c

avagin · 2024-08-17T20:24:12Z

criu/seize.c

+				run_plugins(PAUSE_DEVICES, pid);
+				ptr = end_ptr;
+			} else {
+				/* Move to the next line if the current line was invalid */


report a error if the line is "invalid"

avagin · 2024-08-17T20:25:42Z

criu/seize.c

+	}
+
+	/* Read cgroup.procs into a buffer */
+	while ((bytes_read = read(procs_fd, buffer, sizeof(buffer) - 1)) > 0) {


you must understand the last line in the buffer can be incomplete and you need to handle it properly

avagin · 2024-09-05T15:25:40Z

Any update?

When checkpointing a container, CRIU uses the cgroup path specified by the container runtime (e.g., runc) via the `--freeze-cgroup` option to pause all running processes in the container and obtain a consistent process tree. However, if the container has multiple processes with CUDA state, we need to "lock" these processes before freezing the cgroup; otherwise the cuda-checkpoint tool may hang. To address this problem, this patch updates the collect_pstree function to run the CUDA plugin PAUSE_DEVICES hook for all processes in the container cgroup prior to freezing. In addition, this change introduces a mechanism to disable the use of freeze cgroups during process seizing, even if explicitly requested via the --freeze-cgroup option. The CUDA plugin is updated to utilize this new mechanism to ensure compatibility. Signed-off-by: Andrei Vagin <[email protected]> Signed-off-by: Radostin Stoyanov <[email protected]>

rst0git · 2024-09-12T18:10:08Z

Closing in favour of #2475

rst0git requested review from jesus-ramos and avagin August 17, 2024 11:06

avagin reviewed Aug 17, 2024

View reviewed changes

criu/seize.c Outdated Show resolved Hide resolved

avagin reviewed Aug 17, 2024

View reviewed changes

rst0git force-pushed the cuda-multiple-processes branch from 6043fc4 to 421e873 Compare August 18, 2024 12:07

avagin mentioned this pull request Sep 12, 2024

criu: Allow disabling freeze cgroups #2475

Merged

rst0git force-pushed the cuda-multiple-processes branch from 421e873 to 76b2fa5 Compare September 12, 2024 13:11

rst0git closed this Sep 12, 2024

rst0git mentioned this pull request Nov 5, 2024

seize: enable support for frozen containers #2514

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable CUDA checkpointing with multiple processes #2470

Enable CUDA checkpointing with multiple processes #2470

rst0git commented Aug 17, 2024 •

edited

Loading

avagin Aug 17, 2024

avagin Aug 17, 2024

avagin Aug 17, 2024

avagin commented Sep 5, 2024

rst0git commented Sep 12, 2024

Enable CUDA checkpointing with multiple processes #2470

Enable CUDA checkpointing with multiple processes #2470

Conversation

rst0git commented Aug 17, 2024 • edited Loading

avagin Aug 17, 2024

Choose a reason for hiding this comment

avagin Aug 17, 2024

Choose a reason for hiding this comment

avagin Aug 17, 2024

Choose a reason for hiding this comment

avagin commented Sep 5, 2024

rst0git commented Sep 12, 2024

rst0git commented Aug 17, 2024 •

edited

Loading