-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint Issues #12
Comments
By the original design, the checkpointing has a trade-off between speed and memory. It slows down the speed of backward pass to give much more memory capacity by forgetting activation memory on the forward pass.
If your term "performance" means "speed", your second observation is unexpected. torchgpipe without checkpointing is identical with typical pipeline parallelism but not GPipe. If you choose the same chunk size in both settings, the concurrency should not decrease. How did you choose the batch size and the number of chunks on both |
For this, I chose the batch size 60 with 480 data points. Then I use the checkpoint never or exept_last I also added a few arg params to make this convenient. #!/bin/bash
id=$1
chk=$2
dataset_size=480
epochs=10
exp_type=pipeline-${id}
version=6_checkpoint_${chk}_chunk_variation
batch_size=240
for chunk_size in 10, 20, 40, 60, 120
do
echo "python3 main-micro.py ${exp_type} --batch_size ${batch_size} --chunks ${chunk_size} --dataset_size ${dataset_size} --save_file stats_${exp_type}_v${version}.csv --epochs ${epochs} --checkpointing ${chk}"
python3 main-micro.py ${exp_type} --batch_size ${batch_size} --chunks ${chunk_size} --dataset_size ${dataset_size} --save_file stats_micro_${exp_type}_v${version}.csv --epochs ${epochs} --checkpointing ${chk}
done
|
One possibility came to my mind. When a process uses up almost all CUDA memory, |
Yes, I am heading that way @sublee. I observed some overheads with smaller batch sizes. |
https://github.com/kakaobrain/torchgpipe/blob/master/benchmarks/unet-speed/main.py In here, when doing the speed benchmarks, why a constant mini-batch size is not used for pipelining. Shouldn't the variable be chunks? |
The constant |
Here the input to that line comes from these? |
Those static methods return EXPERIMENTS: Dict[str, Experiment] = {
'baseline': Experiments.baseline,
'pipeline-1': Experiments.pipeline1,
'pipeline-2': Experiments.pipeline2,
'pipeline-4': Experiments.pipeline4,
'pipeline-8': Experiments.pipeline8,
}
...
f: Experiment = EXPERIMENTS[experiment]
try:
model, batch_size, _devices = f(model, devices)
...
input = torch.rand(batch_size, 3, 192, 192, device=in_device) |
Yes, those values are different. |
Sorry for misunderstanding what "constant" means. We adjusted the batch sizes to maximize the throughput. You can find a similar explanation in the paper v1 "4.2. Performance". |
That is totally fine. I just wanted to learn why the numbers were chosen like that. |
I tried the 'never' option for checkpointing. The idea was to see how the pipeline was performing without checkpointing overhead.
What I observed was the performance is consistent for pipeline parallelism 2, 4 and 8. And also another important observation was the performance is much lower than the performance with checkpointing.
Is this expected or are there any other tunning parameters to get better performance?
I checked the backward and forward to backward time ratio?
Assuming backward time increase with checkpointing, is it a valid logic with your implementation?
Meaning when I turn off checkpointing the pipeline performance must improve?
Could you clarify the implementation details on this.
The text was updated successfully, but these errors were encountered: