Distributed Data Parallelism #402

ncassereau · 2023-03-20T16:21:59Z

DDP allows the usage of multiple GPU to compute a larger batch of data. It allows us to either increase the size of the model or increase the batch size or increase our data memory footprint by for instance removing downsampling, etc.

DDP is more efficient and more flexible than DP. The usage of two GPU on the same node, or two different nodes of GPU are both covered by this PR.

To use DDP, each process of clinicadl needs to know the world size, its rank, as well as the master address and master port. For now, the cluster resolver I suggest only supports the SLURM scheduler, but if this PR is successful, we will add other cluster resolver in the future.

I also suggest the introduction of ZeRO (Zero Redundancy Optimizer). This technology from DeepSpeed (Microsoft) shards optimizer states along the data parallelism dimension. ZeRO is most optimal with the deepspeed library but that would add another dependency so that's a discussion for another day. Pytorch does have a small implementation. It only has the first stage of ZeRO (only optimizer states sharding, stage 2 introduces gradients sharding and stage 3 parameters sharding). Also, unlike DeepSpeed version, Pytorch implementation of ZeRO does increase the volume of communication required to synchronize devices. This is due to the fact that Pytorch probably wanted to limit the amount of code one would need to change to add this feature. As a result, they did not discard the second part of the gradient all-reduce (this all-gather, the first part being a ReduceScatter, or MPI_All_to_all) which is unnecessary with ZeRO. On the other hand, it indeed makes ZeRO usage painless, it's only a few additional lines of code. This feature reduces the memory footprint of the optimizer. The more GPU you have, the less memory per GPU you need. With a 300M parameters network and Automatic Mixed Precision, it reduced the amount of memory needed from 17.1GB to 15.5GB with 4 GPU. So that's neat!

…mpute_outputs_and_loss which is incompatible with DDP

* delete useless files * solve typeError

* change output_dir for get-labels * solve issue 421 * review changes

* add caps_directory option in get-labels * add block announce for clinicadl 1.3.0 * fix conflicts

* Fix missing mods parsing * Fix output path * Fix missing mods parsing

ravih18

Nice work ! looks fine to me. only a few small comments.

However it would be great to add more docstrings to class and methods implemented in clinicadl/utils/maps_manager/cluster/ files as it will be easier in the future to understand what they do, what they are used for to ease the maintenance.

clinicadl/utils/cli_param/train_option.py

clinicadl/utils/task_manager/task_manager.py

clinicadl/utils/maps_manager/maps_manager.py

…version with a packaging Version object

… subpackage works

…the maps_manager

clinicadl/resources/config/train_config.toml

ncassereau and others added 30 commits March 20, 2023 14:41

new amp flag

2f0260a

use amp during training

1ada463

zero_grad set to none

86afa02

linter is bullying me

c4177a6

sort imports

0e4ef7c

add ddp flag

025bc94

new logger filter in order not to pollute stdout

59baef3

Cluster resolver and DDP manager

a78a275

New sampler and test code

193c090

Merge branch 'aramis-lab:dev' into amp

0c9852b

Merge branch 'aramis-lab:dev' into ddp

7e01a74

add resolver flag

9b559b1

update forward function such that DDP will call forward instead of co…

064b90c

…mpute_outputs_and_loss which is incompatible with DDP

change doc such that it matches previous commit changes'

ca07222

First batch of DDP-converted methods

13faf08

Zero Redundancy Optimizer

fef8c41

debugged port for mono task resolver in slurmless environments

43053d1

linter

9c6ef6a

black linter

7d01e35

Merge branch 'aramis-lab:dev' into amp

44e959b

added forgotten ddp in test_loader method

eb5bd6c

updated predict and interpret to work with AMP

e0ad3f4

update cli for predict & interpret

85fd15c

solve conflict with profiler

4a522b7

satisfy linter god

1e2897f

Cb issues tsvtools (aramis-lab#422)

024f992

* delete useless files * solve typeError

fix json option bug (aramis-lab#423)

3c72f0e

change output directory for tsvtools get-labels (aramis-lab#415)

e66d794

* change output_dir for get-labels * solve issue 421 * review changes

add caps_directory option in get-labels (aramis-lab#416)

e2c2345

* add caps_directory option in get-labels * add block announce for clinicadl 1.3.0 * fix conflicts

Fix missing mods parsing (aramis-lab#424)

0134a33

* Fix missing mods parsing * Fix output path * Fix missing mods parsing

ncassereau added 5 commits August 10, 2023 15:03

linter

1c42694

remove useless stuff

d353c97

remove no longer needed function

bf4a365

corrected stack level warning

b08057c

Add a small patch for kineto for pytorch > 1.11

df8ff79

ravih18 reviewed Sep 7, 2023

View reviewed changes

clinicadl/utils/cli_param/train_option.py Outdated Show resolved Hide resolved

clinicadl/utils/task_manager/task_manager.py Outdated Show resolved Hide resolved

clinicadl/utils/maps_manager/maps_manager.py Outdated Show resolved Hide resolved

ncassereau and others added 20 commits September 7, 2023 16:56

Merge branch 'dev' into ddp

6a103a8

solved issue where older versions of pytorch could not compare their …

b5b429f

…version with a packaging Version object

remove needlessly duplicated line

2c4f3b7

corrected some typing mistakes

029daa2

Remove useless imports

bbfd835

ensemble prediction is now only performed on master

8cab7a9

Merge branch 'dev' into ddp

eba8ba9

Rename world_size as dp_degree in task manager

f74c898

Rename ZeRO flag to FSDP

d14c9c0

typo

e6c558d

move alias in correct place with other aliases

5d73bf7

Add docstrings and comments to explicitly explain the way the cluster…

a6fe6df

… subpackage works

add docstrings to kineto patcher

a58ee46

set regex as raw string

2b47ca6

Support typing for python 3.8 or older

ad4a3b2

fix taxk_manager's method generate_sampler's argument name change in …

0112de1

…the maps_manager

Merge branch 'dev' into ddp

79fb0b6

Update maps_manager.py

e77efe7

linter

89fdb71

Rename fsdp to fullyshardeddataparallel

d2cfd80

ravih18 reviewed Sep 20, 2023

View reviewed changes

clinicadl/resources/config/train_config.toml Outdated Show resolved Hide resolved

rename flag with underscore to separate words

cad088d

ravih18 approved these changes Sep 20, 2023

View reviewed changes

camillebrianceau merged commit 1672205 into aramis-lab:dev Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Data Parallelism #402

Distributed Data Parallelism #402

ncassereau commented Mar 20, 2023 •

edited

Loading

ravih18 left a comment

Distributed Data Parallelism #402

Distributed Data Parallelism #402

Conversation

ncassereau commented Mar 20, 2023 • edited Loading

ravih18 left a comment

Choose a reason for hiding this comment

ncassereau commented Mar 20, 2023 •

edited

Loading