-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-deterministic behavior on multi-GPU run #103
Comments
kalekundert
added a commit
to kalekundert/escnn
that referenced
this issue
Oct 31, 2024
Fixes QUVA-Lab#103 Signed-off-by: Kale Kundert <[email protected]>
This bug is very similar to another one addressed by #93, so I just updated that PR to fix this bug as well. Specifically, I replaced all instances of the |
kalekundert
added a commit
to kalekundert/escnn
that referenced
this issue
Nov 25, 2024
Fixes QUVA-Lab#103 Signed-off-by: Kale Kundert <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I have encountered a bug that invalidates multi-GPU training. Each model stored per GPU then diverges from the others given that the initialization of the model is non-deterministic.
This happens for all
sampled_basis
layer. Specifically in the init ofBlocksBasisExpansion
in https://github.com/QUVA-Lab/escnn/blob/master/escnn/nn/modules/basismanager/basisexpansion_blocks.py)at:which makes the order of layers created random given that it iterates over sets. And this will happen on each GPU making models per GPU differ (
in_reprs
being{"irrep_1", "irrep_0", "regular"}
andout_reprs
being{"regular"}
).Here’s one example of wrong ordering:
On one GPU I had
and on another one I had
This should be caught by Torch. Unfortunately it only checks that
parameters
are checked while buffers, i.e., values that stay constant through training, are not being checked. And since thisBlocksBasisExpansion
is a buffer it fails silently.Example of such layer:
module.enc.enc_obs.conv.13.basisexpansion.block_expansion_('regular', 'regular').sampled_basis
diff.enc_a.0.basisexpansion.block_expansion_('irrep_1', 'regular').sampled_basis
The text was updated successfully, but these errors were encountered: