✨ DataParallel and DistributedDataParallel for speed up training. #43

Yonv1943 · 2023-02-06T02:17:20Z

DataParallel: multiple thread for single machine multiple GPUs

unbalance GPU memory and GPU usage. (discuss.pytorch.org: Use FullModel which writes loss function into the model to solve the memory usage imbalance problem. )
slow
Collecting gradients by a serial method

DistributedDataParallel: multiple processing for single or multiple machines and multiple GPUs.

balance GPU memory and GPU usage. (don't need to use FullModel)
faster than DataParallel
Ring-Allreduce by pytorch

It is very easy to add DataParallel into the code, but DataParallel brings less speed up.

It's a little tricky to use because DistributedDataParallel needs to be started from the command line, but it gives a significant speedup with 4 GPUs in single machine in high GPU memory.

The text was updated successfully, but these errors were encountered:

Yonv1943 · 2023-02-06T02:22:21Z

Code

Here is the code about: (change txt to py):

The kernel code of `DataParallel`

nn.DataParallel

gpu_ids = (4, 5, 6, 7)
gpu_id = gpu_ids[0]
device = torch.device(f"cuda:{gpu_id}" if (torch.cuda.is_available() and (gpu_id >= 0)) else "cpu")
...
model = nn.DataParallel(model, device_ids=gpu_ids)
...
data = data.to(device)
target = target.to(device)

The kernel code of `DistributedDataParallel`

nn.DistributedDataParallel

'''DistributedDataParallel'''
world_size = dist.get_world_size()  # num_global_GPUs = num_machines * num_GPU_per_machine
rank_id = dist.get_rank()  # the GPU_id of global_GPUs
device = torch.device(f"cuda:{gpu_id}" if (torch.cuda.is_available() and (gpu_id >= 0)) else "cpu")
...
model = DistributedDataParallel(model, device_ids=[rank_id])
...
dist_inp = data[rank_id::num_gpus].to(device)
dist_lab = target[rank_id::num_gpus].to(device)
...
if __name__ == '__main__':
    dist.init_process_group(backend='nccl', init_method='env://')  # backend in {'nccl', 'gloo', 'mpi'}
    run()
    dist.destroy_process_group()

The communication backend to be used by the current process

Backends
torch.distributed supports three built-in backends, each with different capabilities. The table below shows which functions are available for use with CPU / CUDA tensors. MPI supports CUDA only if the implementation used to build PyTorch supports it.

https://pytorch.org/docs/stable/distributed.html

Run it:

DataParallel can be directly run.
DisttributedDataParallel needs a special launch methods:

https://pytorch.org/docs/stable/distributed.html#launch-utility

CUDA_VISIBLE_DEVICES="4,5,6,7" OMP_NUM_THREADS=4 \
 python -m torch.distributed.run --nproc_per_node 4 \
 DEMO_torch_nn_DistDataParallel.py

Why we use `torch.distributed.run` instead of `torch.distributed.launch`

FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.  <-=------------------------- look at here.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

It is possible to run DistDataParallel using `mp.spawn` by pytorch instead of `python =m torch.distributed.run`

Why using mp.spawn is slower than using torch.distributed.launch when using multi-GPU training

pytorch/pytorch#47587

Output

The running output of DataParallel:

if_full_model
    if_data_parallel 
        if_save_data_in_gpu
                                        GPU memory usage for the main thread
                                            GPU memory usage for the sub thread

Experiment: Impact of DataParallel on parallel GPU cards with or without DataParallel on
0   0   1   TimeUsed  1    GPU_max_memo 68  0
0   1   1   TimeUsed 12    GPU_max_memo 68  3

Experiment: Impact of DataParallel on parallel GPU cards with or without DataParallel on
Conclusion: Not putting the data in GPU memory does reduce the memory footprint, but increases the training time
Experiment: Not putting the data in GPU memory
0   1   0   ERROR Putting data on the CPU without turning on full_model will report an error
1   1   0   TimeUsed 19    GPU_max_memo  4  3
1   0   0   TimeUsed 13    GPU_max_memo 13  0


Question: full_model has to be used, so does it affect training?
Conclusion: Whether to turn on full_model or not, the impact on training time and memory usage is so small that it is invisible.
Experiment: The effect of turning on full_model or not on a single card
0   0   1   TimeUsed  1    GPU_max_memo 68  0
1   0   1   TimeUsed  1    GPU_max_memo 68  0
Experiment: the effect of turning on full_model or not under multiple cards
0   1   1   TimeUsed 12    GPU_max_memo 68  3
1   1   1   TimeUsed 12    GPU_max_memo 68  3

The running output of DistributedDataParallel:

I use MNIST-fashion data in this demo.
In order to simulate the large data and larget neural network training, I use data_repeat to increase the MNIST data.

data_repeat=N means repeating the MNIST data from 28**2 to 28**2 * N. This opearation increases the input data and the MLP networks (the input dimension will be 28**2 * N).

data_repeat: int = 24
    world_size 4    rank_id 2
    world_size 4    rank_id 1
    world_size 4    rank_id 0
    world_size 4    rank_id 3
    Train Loop:
    repeat_times:    3   obj: 0.035    GPU_memo 15
    repeat_times:    7   obj: 0.026    GPU_memo 15
    repeat_times:   11   obj: 0.022    GPU_memo 15
    repeat_times:   15   obj: 0.020    GPU_memo 15
    repeat_times:   19   obj: 0.018    GPU_memo 15
    repeat_times:   23   obj: 0.016    GPU_memo 15
    repeat_times:   27   obj: 0.014    GPU_memo 15
    repeat_times:   31   obj: 0.014    GPU_memo 15
    TimeUsed  1   18
    TimeUsed  1   18
    TimeUsed  1   18
    TimeUsed  1   18
;

data_repeat: int = 64
    world_size 4    rank_id 2
    world_size 4    rank_id 3
    world_size 4    rank_id 1
    world_size 4    rank_id 0
    Train Loop:
    repeat_times:    3   obj: 0.049    GPUMemo 40
    repeat_times:    7   obj: 0.058    GPUMemo 40
    repeat_times:   11   obj: 0.050    GPUMemo 40
    repeat_times:   15   obj: 0.037    GPUMemo 40
    repeat_times:   19   obj: 0.033    GPUMemo 40
    repeat_times:   23   obj: 0.030    GPUMemo 40
    repeat_times:   27   obj: 0.029    GPUMemo 40
    repeat_times:   31   obj: 0.028    GPUMemo 40
    TimeUsed  2   MaxGPUMemo 47
    TimeUsed  2   MaxGPUMemo 47
    TimeUsed  2   MaxGPUMemo 47
    TimeUsed  2   MaxGPUMemo 47
;

Yonv1943 · 2023-02-10T10:35:15Z

Code

Here is the code about: (change py.txt to .py):

multiple thread. DEMO_DataParallel.py.txt
multiple processes and launch with python -m torch.distributed.run. DEMO_DistDataParallel.py.txt
multiple processes and directly launch with given arbitrary GPU_IDs. DEMO_DistDataParallel_mp.py.txt

kernel code

def dist_init(rank_id: int, world_size: int):
    os.environ['MASTER_ADDR'] = '127.0.0.1'  # 'localhost'
    os.environ['MASTER_PORT'] = '10086'
    dist.init_process_group(backend='nccl', init_method='env://', rank=rank_id, world_size=world_size)
    # backend in {'nccl', 'gloo', 'mpi'}

def dist_close():
    dist.destroy_process_group()

def run__train_in_fashion_mnist(rank_id: int, world_size: int):
    dist_init(rank_id=rank_id, world_size=world_size)  # backend in {'nccl', 'gloo', 'mpi'}
    ...
    if if_dist_data_parallel:
        model = DistributedDataParallel(model, device_ids=[rank_id])
    ...
    dist_close()

def start_processes(gpu_ids: Tuple[int, ...]):
    os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_ids)[1:-1]
    world_size = len(gpu_ids)

    mp.set_start_method(method='spawn' if os.name == 'nt' else 'forkserver', force=True)
    processes = [mp.Process(target=run__train_in_fashion_mnist, args=(rank_id, world_size), daemon=True)
                 for rank_id in range(0, 0 + world_size)]
    [process.start() for process in processes]
    [process.join() for process in processes]


if __name__ == '__main__':
    GPU_IDs = (4, 5, 6, 7)

    # mp.spawn(fn=run__train_in_fashion_mnist, args=(NumGPUs,), nprocs=len(GPU_IDs), join=True)
    start_processes(gpu_ids=GPU_IDs)

YangletLiu added the enhancement New feature or request label Feb 6, 2023

Yonv1943 mentioned this issue Feb 24, 2023

请教一个多GPU训练的问题 AI4Finance-Foundation/ElegantRL#278

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ DataParallel and DistributedDataParallel for speed up training. #43

✨ DataParallel and DistributedDataParallel for speed up training. #43

Yonv1943 commented Feb 6, 2023 •

edited

Loading

Yonv1943 commented Feb 6, 2023 •

edited

Loading

Yonv1943 commented Feb 10, 2023 •

edited

Loading

✨ DataParallel and DistributedDataParallel for speed up training. #43

✨ DataParallel and DistributedDataParallel for speed up training. #43

Comments

Yonv1943 commented Feb 6, 2023 • edited Loading

Yonv1943 commented Feb 6, 2023 • edited Loading

Code

The kernel code of DataParallel

The kernel code of DistributedDataParallel

The communication backend to be used by the current process

Run it:

Why we use torch.distributed.run instead of torch.distributed.launch

It is possible to run DistDataParallel using mp.spawn by pytorch instead of python =m torch.distributed.run

Output

The running output of DataParallel:

The running output of DistributedDataParallel:

Yonv1943 commented Feb 10, 2023 • edited Loading

Code

Yonv1943 commented Feb 6, 2023 •

edited

Loading

Yonv1943 commented Feb 6, 2023 •

edited

Loading

The kernel code of `DataParallel`

The kernel code of `DistributedDataParallel`

Why we use `torch.distributed.run` instead of `torch.distributed.launch`

It is possible to run DistDataParallel using `mp.spawn` by pytorch instead of `python =m torch.distributed.run`

Yonv1943 commented Feb 10, 2023 •

edited

Loading