Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ DataParallel and DistributedDataParallel for speed up training. #43

Open
Yonv1943 opened this issue Feb 6, 2023 · 2 comments
Open
Labels
enhancement New feature or request

Comments

@Yonv1943
Copy link
Collaborator

Yonv1943 commented Feb 6, 2023

DataParallel: multiple thread for single machine multiple GPUs

DistributedDataParallel: multiple processing for single or multiple machines and multiple GPUs.

It is very easy to add DataParallel into the code, but DataParallel brings less speed up.

It's a little tricky to use because DistributedDataParallel needs to be started from the command line, but it gives a significant speedup with 4 GPUs in single machine in high GPU memory.

@Yonv1943
Copy link
Collaborator Author

Yonv1943 commented Feb 6, 2023

Code

Here is the code about: (change txt to py):

The kernel code of DataParallel

nn.DataParallel

gpu_ids = (4, 5, 6, 7)
gpu_id = gpu_ids[0]
device = torch.device(f"cuda:{gpu_id}" if (torch.cuda.is_available() and (gpu_id >= 0)) else "cpu")
...
model = nn.DataParallel(model, device_ids=gpu_ids)
...
data = data.to(device)
target = target.to(device)

The kernel code of DistributedDataParallel

nn.DistributedDataParallel

'''DistributedDataParallel'''
world_size = dist.get_world_size()  # num_global_GPUs = num_machines * num_GPU_per_machine
rank_id = dist.get_rank()  # the GPU_id of global_GPUs
device = torch.device(f"cuda:{gpu_id}" if (torch.cuda.is_available() and (gpu_id >= 0)) else "cpu")
...
model = DistributedDataParallel(model, device_ids=[rank_id])
...
dist_inp = data[rank_id::num_gpus].to(device)
dist_lab = target[rank_id::num_gpus].to(device)
...
if __name__ == '__main__':
    dist.init_process_group(backend='nccl', init_method='env://')  # backend in {'nccl', 'gloo', 'mpi'}
    run()
    dist.destroy_process_group()

The communication backend to be used by the current process

Backends
torch.distributed supports three built-in backends, each with different capabilities. The table below shows which functions are available for use with CPU / CUDA tensors. MPI supports CUDA only if the implementation used to build PyTorch supports it.

https://pytorch.org/docs/stable/distributed.html
PyTorch_backend_for_comm_NCLL_GLOO_MPI

Run it:

  • DataParallel can be directly run.
  • DisttributedDataParallel needs a special launch methods:
https://pytorch.org/docs/stable/distributed.html#launch-utility

CUDA_VISIBLE_DEVICES="4,5,6,7" OMP_NUM_THREADS=4 \
 python -m torch.distributed.run --nproc_per_node 4 \
 DEMO_torch_nn_DistDataParallel.py

Why we use torch.distributed.run instead of torch.distributed.launch

FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.  <-=------------------------- look at here.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

It is possible to run DistDataParallel using mp.spawn by pytorch instead of python =m torch.distributed.run

Why using mp.spawn is slower than using torch.distributed.launch when using multi-GPU training

pytorch/pytorch#47587

Output

The running output of DataParallel:

if_full_model
    if_data_parallel 
        if_save_data_in_gpu
                                        GPU memory usage for the main thread
                                            GPU memory usage for the sub thread

Experiment: Impact of DataParallel on parallel GPU cards with or without DataParallel on
0   0   1   TimeUsed  1    GPU_max_memo 68  0
0   1   1   TimeUsed 12    GPU_max_memo 68  3

Experiment: Impact of DataParallel on parallel GPU cards with or without DataParallel on
Conclusion: Not putting the data in GPU memory does reduce the memory footprint, but increases the training time
Experiment: Not putting the data in GPU memory
0   1   0   ERROR Putting data on the CPU without turning on full_model will report an error
1   1   0   TimeUsed 19    GPU_max_memo  4  3
1   0   0   TimeUsed 13    GPU_max_memo 13  0


Question: full_model has to be used, so does it affect training?
Conclusion: Whether to turn on full_model or not, the impact on training time and memory usage is so small that it is invisible.
Experiment: The effect of turning on full_model or not on a single card
0   0   1   TimeUsed  1    GPU_max_memo 68  0
1   0   1   TimeUsed  1    GPU_max_memo 68  0
Experiment: the effect of turning on full_model or not under multiple cards
0   1   1   TimeUsed 12    GPU_max_memo 68  3
1   1   1   TimeUsed 12    GPU_max_memo 68  3

The running output of DistributedDataParallel:

I use MNIST-fashion data in this demo.
In order to simulate the large data and larget neural network training, I use data_repeat to increase the MNIST data.

data_repeat=N means repeating the MNIST data from 28**2 to 28**2 * N. This opearation increases the input data and the MLP networks (the input dimension will be 28**2 * N).

data_repeat: int = 24
    world_size 4    rank_id 2
    world_size 4    rank_id 1
    world_size 4    rank_id 0
    world_size 4    rank_id 3
    Train Loop:
    repeat_times:    3   obj: 0.035    GPU_memo 15
    repeat_times:    7   obj: 0.026    GPU_memo 15
    repeat_times:   11   obj: 0.022    GPU_memo 15
    repeat_times:   15   obj: 0.020    GPU_memo 15
    repeat_times:   19   obj: 0.018    GPU_memo 15
    repeat_times:   23   obj: 0.016    GPU_memo 15
    repeat_times:   27   obj: 0.014    GPU_memo 15
    repeat_times:   31   obj: 0.014    GPU_memo 15
    TimeUsed  1   18
    TimeUsed  1   18
    TimeUsed  1   18
    TimeUsed  1   18
;

data_repeat: int = 64
    world_size 4    rank_id 2
    world_size 4    rank_id 3
    world_size 4    rank_id 1
    world_size 4    rank_id 0
    Train Loop:
    repeat_times:    3   obj: 0.049    GPUMemo 40
    repeat_times:    7   obj: 0.058    GPUMemo 40
    repeat_times:   11   obj: 0.050    GPUMemo 40
    repeat_times:   15   obj: 0.037    GPUMemo 40
    repeat_times:   19   obj: 0.033    GPUMemo 40
    repeat_times:   23   obj: 0.030    GPUMemo 40
    repeat_times:   27   obj: 0.029    GPUMemo 40
    repeat_times:   31   obj: 0.028    GPUMemo 40
    TimeUsed  2   MaxGPUMemo 47
    TimeUsed  2   MaxGPUMemo 47
    TimeUsed  2   MaxGPUMemo 47
    TimeUsed  2   MaxGPUMemo 47
;

@YangletLiu YangletLiu added the enhancement New feature or request label Feb 6, 2023
@Yonv1943
Copy link
Collaborator Author

Yonv1943 commented Feb 10, 2023

Code

Here is the code about: (change py.txt to .py):

kernel code

def dist_init(rank_id: int, world_size: int):
    os.environ['MASTER_ADDR'] = '127.0.0.1'  # 'localhost'
    os.environ['MASTER_PORT'] = '10086'
    dist.init_process_group(backend='nccl', init_method='env://', rank=rank_id, world_size=world_size)
    # backend in {'nccl', 'gloo', 'mpi'}

def dist_close():
    dist.destroy_process_group()

def run__train_in_fashion_mnist(rank_id: int, world_size: int):
    dist_init(rank_id=rank_id, world_size=world_size)  # backend in {'nccl', 'gloo', 'mpi'}
    ...
    if if_dist_data_parallel:
        model = DistributedDataParallel(model, device_ids=[rank_id])
    ...
    dist_close()

def start_processes(gpu_ids: Tuple[int, ...]):
    os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_ids)[1:-1]
    world_size = len(gpu_ids)

    mp.set_start_method(method='spawn' if os.name == 'nt' else 'forkserver', force=True)
    processes = [mp.Process(target=run__train_in_fashion_mnist, args=(rank_id, world_size), daemon=True)
                 for rank_id in range(0, 0 + world_size)]
    [process.start() for process in processes]
    [process.join() for process in processes]


if __name__ == '__main__':
    GPU_IDs = (4, 5, 6, 7)

    # mp.spawn(fn=run__train_in_fashion_mnist, args=(NumGPUs,), nprocs=len(GPU_IDs), join=True)
    start_processes(gpu_ids=GPU_IDs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants