-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed Data Parallelism #402
Conversation
…mpute_outputs_and_loss which is incompatible with DDP
* delete useless files * solve typeError
* change output_dir for get-labels * solve issue 421 * review changes
* add caps_directory option in get-labels * add block announce for clinicadl 1.3.0 * fix conflicts
* Fix missing mods parsing * Fix output path * Fix missing mods parsing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work ! looks fine to me. only a few small comments.
However it would be great to add more docstrings to class and methods implemented in clinicadl/utils/maps_manager/cluster/
files as it will be easier in the future to understand what they do, what they are used for to ease the maintenance.
…version with a packaging Version object
… subpackage works
DDP allows the usage of multiple GPU to compute a larger batch of data. It allows us to either increase the size of the model or increase the batch size or increase our data memory footprint by for instance removing downsampling, etc.
DDP is more efficient and more flexible than DP. The usage of two GPU on the same node, or two different nodes of GPU are both covered by this PR.
To use DDP, each process of clinicadl needs to know the world size, its rank, as well as the master address and master port. For now, the cluster resolver I suggest only supports the SLURM scheduler, but if this PR is successful, we will add other cluster resolver in the future.
I also suggest the introduction of ZeRO (Zero Redundancy Optimizer). This technology from DeepSpeed (Microsoft) shards optimizer states along the data parallelism dimension. ZeRO is most optimal with the deepspeed library but that would add another dependency so that's a discussion for another day. Pytorch does have a small implementation. It only has the first stage of ZeRO (only optimizer states sharding, stage 2 introduces gradients sharding and stage 3 parameters sharding). Also, unlike DeepSpeed version, Pytorch implementation of ZeRO does increase the volume of communication required to synchronize devices. This is due to the fact that Pytorch probably wanted to limit the amount of code one would need to change to add this feature. As a result, they did not discard the second part of the gradient all-reduce (this all-gather, the first part being a ReduceScatter, or MPI_All_to_all) which is unnecessary with ZeRO. On the other hand, it indeed makes ZeRO usage painless, it's only a few additional lines of code. This feature reduces the memory footprint of the optimizer. The more GPU you have, the less memory per GPU you need. With a 300M parameters network and Automatic Mixed Precision, it reduced the amount of memory needed from 17.1GB to 15.5GB with 4 GPU. So that's neat!