Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Data Parallelism #402

Merged
merged 128 commits into from
Sep 21, 2023
Merged

Distributed Data Parallelism #402

merged 128 commits into from
Sep 21, 2023

Conversation

ncassereau
Copy link
Contributor

@ncassereau ncassereau commented Mar 20, 2023

DDP allows the usage of multiple GPU to compute a larger batch of data. It allows us to either increase the size of the model or increase the batch size or increase our data memory footprint by for instance removing downsampling, etc.

DDP is more efficient and more flexible than DP. The usage of two GPU on the same node, or two different nodes of GPU are both covered by this PR.

To use DDP, each process of clinicadl needs to know the world size, its rank, as well as the master address and master port. For now, the cluster resolver I suggest only supports the SLURM scheduler, but if this PR is successful, we will add other cluster resolver in the future.

I also suggest the introduction of ZeRO (Zero Redundancy Optimizer). This technology from DeepSpeed (Microsoft) shards optimizer states along the data parallelism dimension. ZeRO is most optimal with the deepspeed library but that would add another dependency so that's a discussion for another day. Pytorch does have a small implementation. It only has the first stage of ZeRO (only optimizer states sharding, stage 2 introduces gradients sharding and stage 3 parameters sharding). Also, unlike DeepSpeed version, Pytorch implementation of ZeRO does increase the volume of communication required to synchronize devices. This is due to the fact that Pytorch probably wanted to limit the amount of code one would need to change to add this feature. As a result, they did not discard the second part of the gradient all-reduce (this all-gather, the first part being a ReduceScatter, or MPI_All_to_all) which is unnecessary with ZeRO. On the other hand, it indeed makes ZeRO usage painless, it's only a few additional lines of code. This feature reduces the memory footprint of the optimizer. The more GPU you have, the less memory per GPU you need. With a 300M parameters network and Automatic Mixed Precision, it reduced the amount of memory needed from 17.1GB to 15.5GB with 4 GPU. So that's neat!

ncassereau and others added 30 commits March 20, 2023 14:41
…mpute_outputs_and_loss which is incompatible with DDP
* delete useless files

* solve typeError
* change output_dir for get-labels

* solve issue 421

* review changes
* add caps_directory option in get-labels

* add block announce for clinicadl 1.3.0

* fix conflicts
* Fix missing mods parsing

* Fix output path

* Fix missing mods parsing
Copy link
Collaborator

@ravih18 ravih18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work ! looks fine to me. only a few small comments.

However it would be great to add more docstrings to class and methods implemented in clinicadl/utils/maps_manager/cluster/ files as it will be easier in the future to understand what they do, what they are used for to ease the maintenance.

clinicadl/utils/cli_param/train_option.py Outdated Show resolved Hide resolved
clinicadl/utils/task_manager/task_manager.py Outdated Show resolved Hide resolved
clinicadl/utils/maps_manager/maps_manager.py Outdated Show resolved Hide resolved
@camillebrianceau camillebrianceau merged commit 1672205 into aramis-lab:dev Sep 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants