From cb6b91daa2c6c64b4e0997df0f9b2e96a4d75e9e Mon Sep 17 00:00:00 2001 From: Yanning Yang Date: Thu, 5 Dec 2024 12:44:07 +0000 Subject: [PATCH] plugins/amdgpu: Update `README.md` and `criu-amdgpu-plugin.txt` Signed-off-by: Yanning Yang --- Documentation/criu-amdgpu-plugin.txt | 1 + plugins/amdgpu/README.md | 24 +++++++++++++++++++++++- 2 files changed, 24 insertions(+), 1 deletion(-) diff --git a/Documentation/criu-amdgpu-plugin.txt b/Documentation/criu-amdgpu-plugin.txt index 68803f3dbc..fe76fc3bc6 100644 --- a/Documentation/criu-amdgpu-plugin.txt +++ b/Documentation/criu-amdgpu-plugin.txt @@ -15,6 +15,7 @@ Checkpoint / Restore inside a docker container Pytorch Tensorflow Using CRIU Image Streamer +Parallel Restore DESCRIPTION ----------- diff --git a/plugins/amdgpu/README.md b/plugins/amdgpu/README.md index 1078eafe6f..e64556b0ef 100644 --- a/plugins/amdgpu/README.md +++ b/plugins/amdgpu/README.md @@ -3,7 +3,8 @@ Supporting ROCm with CRIU _Felix Kuehling _
_Rajneesh Bardwaj _
-_David Yat Sin _ +_David Yat Sin _
+_Yanning Yang _ # Introduction @@ -224,6 +225,27 @@ to resume execution on the GPUs. *This new plugin is enabled by the new hook `__RESUME_DEVICES_LATE` in our RFC patch series.* +## Restoring BO content in parallel + +Restoring the BO content is an important part in the restore of GPU state and +usually takes a significant amount of time. A possible location for this +procedure is the `cr_plugin_restore_file` plugin. However, restoring in this +plugin blocks the target process from performing other restore operations, which +hinders further optimization of the restore process. + +Therefore, a new plugin that runs in the master restore process is introduced, +and it interacts with the `cr_plugin_restore_file` plugin to complete the +restore of BO content. Specifically, the target process only needs to send the +relevant BOs to the master restore process, while this new plugin handles all +the restore of buffer objects. Through this method, during the restore of the BO +content, the target process can perform other restore operations, thus +accelerating the restore procedure. It is an implementation of gCROP from the +ACM SoCC'24 paper: [On-demand and Parallel Checkpoint/Restore for GPU +Applications](https://dl.acm.org/doi/10.1145/3698038.3698510). + +*This new plugin is enabled by the new hook `__POST_FORKING` in our patch +series.* + ## Other CRIU changes In addition to the new plugins, we need to make some changes to CRIU itself to