Distributed Multithreaded checkpointing (DMTCP) checkpoints a running program on Linux with no modifications to the program or OS. It allows to restart running the program from a checkpoint.
To access dmtcp, load a dmtcp module. For example:
module load dmtcp/3.0.0
Here's a dummy example prints increasing integers, every 2 seconds. Copy this to a text file on Oscar and name it dmtcp_serial.c
#include<stdio.h>
#include<unistd.h>
int main(int argc, char* argv[])
{
int count = 1;
while (1)
{
printf(" %2d\n",count++);
fflush(stdout);
sleep(2)
}
return 0;
}
Compile this program by running
gcc dmtcp_serial.c -o dmtcp_serial
You should have the files in your directory now:
- dmtcp_serial
- dmtcp_serial.c
The dmtcp_launch
command launches a program, and automatically checkpoints the program. To specify the interval (seconds) for checkpoints, add the "-i num_seconds
" option to the dmtcp_lauch
command.
Example: the following command launches the program dmtcp_serial
and checkpoints every 8 seconds.
$port=$(shuf -i 40000-60000 -n 1)
$dmtcp_launch -p$port -i 8 ./dmtcp_serial
1
2
3
4
5
6
7
8
9
10
^C
[yliu385@node1317 interact]$ ll
total 2761
-rw------- 1 yliu385 ccvstaff 2786466 May 18 11:18 ckpt_dmtcp_serial_24f183c2194a7dc4-40000-42af86bb59385.dmtcp
lrwxrwxrwx 1 yliu385 ccvstaff 60 May 18 11:18 dmtcp_restart_script.sh -> dmtcp_restart_script_24f183c2194a7dc4-40000-42af82ef922a7.sh
-rwxr--r-- 1 yliu385 ccvstaff 12533 May 18 11:18 dmtcp_restart_script_24f183c2194a7dc4-40000-42af82ef922a7.sh
-rwxr-xr-x 1 yliu385 ccvstaff 8512 May 18 08:36 dmtcp_serial
As shown in the example above, a checkpoint file (ckpt_dmtcp_serial_24f183c2194a7dc4-40000-42af86bb59385.dmtcpp
) is created, and can be used to restart the program
The dmtcp_resart
command restarts a program from a checkpoint, and also automatically checkpoints the program. To specify the interval (seconds) for checkpoints, add the "-i num_seconds
" option to the dmtcp_restart
command.
Example: the following command restarts the dmtcp_serial
program from a checkpoint, and checkpoints every 12 seconds
$port=$(shuf -i 40000-60000 -n 1)
$dmtcp_restart -p $port -i 12 ckpt_dmtcp_serial_24f183c2194a7dc4-40000-42af86bb59385.dmtcp
9
10
11
12
13
14
15
^C
[yliu385@node1317 interact]$ dmtcp_restart -p $port -i 12 ckpt_dmtcp_serial_24f183c2194a7dc4-40000-42af86bb59385.dmtcp
15
16
17
^C
It is desirable goal that single job script can
- launch a program if there is checkpoints, or
- automatically restarts from a checkpoint if there is one or more checkpoints
The job script dmtcp_serial_job.sh
below is an example which shows how to achieve the goal:
- If there is no checkpoint in the current directory, launch the program
dmtcp_serial
- If one or more checkpoints exist in the current directory, restart the program
dmtcp_serial
from the latest checkpoint
#!/bin/bash
#SBATCH -n 1
#SBATCH -t 5:00
#SBATCH -J dmtcp_serial
module load dmtcp/3.0.0
checkpoint_file=`ls ckpt_*.dmtcp -t|head -n 1`
checkpoint_interval=8
port=$(shuf -i 40000-60000 -n 1)
if [ -z $checkpoint_file ]; then
dmtcp_launch -p $port -i $checkpoint_interval ./dmtcp_serial
else
dmtcp_restart -p $port -i $checkpoint_interval $checkpoint_file
fi
Submit dmtcp_serial_job.sh
and then wait for the job to run until time out. Below shows the beginning and end of the job output file
$ head slurm-5157871.out -n 15
## SLURM PROLOG ###############################################################
## Job ID : 5157871
## Job Name : dmtcp_serial
## Nodelist : node1139
## CPUs : 1
## Mem/CPU : 2800 MB
## Mem/Node : 65536 MB
## Directory : /gpfs/data/ccvstaff/yliu385/Test/dmtcp/serial/batch_job
## Job Started : Wed May 18 09:38:39 EDT 2022
###############################################################################
ls: cannot access ckpt_*.dmtcp: No such file or directory
1
2
3
4
$ tail slurm-5157871.out
147
148
149
150
151
152
153
154
155
slurmstepd: error: *** JOB 5157871 ON node1139 CANCELLED AT 2022-05-18T09:43:58 DUE TO TIME LIMIT ***
Submit dmtcp_serial_job.sh
and then wait for the job to run until time out. Below shows the beginning of the job output file, which demonstrate that the job restarts from the checkpoint of the previous job.
$ head slurm-5158218.out -n 15
## SLURM PROLOG ###############################################################
## Job ID : 5158218
## Job Name : dmtcp_serial
## Nodelist : node1327
## CPUs : 1
## Mem/CPU : 2800 MB
## Mem/Node : 65536 MB
## Directory : /gpfs/data/ccvstaff/yliu385/Test/dmtcp/serial/batch_job
## Job Started : Wed May 18 09:50:39 EDT 2022
###############################################################################
153
154
155
156
157
The following example script
- creates a sub directory for each task of a job array, and then saves a task's checkpoint in the task's own sub directory when the job script is submitted for the first time
- restarts checkpoints in task subdirectories when the job script is submitted for the second time or later
#!/bin/bash
#SBATCH -n 1
#SBATCH --array=1-4
#SBATCH -t 5:00
#SBATCH -J dmtcp_job_array
module load dmtcp/3.0.0
checkpoint_interval=8
port=$((SLURM_JOB_ID %20000 + 40000))
task_dir=jobtask_$SLURM_ARRAY_TASK_ID
if [ ! -d $task_dir ]; then
mkdir $task_dir
cd $task_dir
dmtcp_launch -p $port -i $checkpoint_interval ../dmtcp_serial
else
cd $task_dir
checkpoint_file=`ls ckpt_*.dmtcp -t|head -n 1`
if [ -z $checkpoint_file ]; then
dmtcp_launch -p $port -i $checkpoint_interval ../dmtcp_serial
else
dmtcp_restart -p $port -i $checkpoint_interval $checkpoint_file
fi
fi