Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic behavior on multi-GPU run #103

Open
lrenaux-bdai opened this issue Oct 31, 2024 · 1 comment
Open

Non-deterministic behavior on multi-GPU run #103

lrenaux-bdai opened this issue Oct 31, 2024 · 1 comment

Comments

@lrenaux-bdai
Copy link

I have encountered a bug that invalidates multi-GPU training. Each model stored per GPU then diverges from the others given that the initialization of the model is non-deterministic.

This happens for all sampled_basis layer. Specifically in the init of BlocksBasisExpansion in https://github.com/QUVA-Lab/escnn/blob/master/escnn/nn/modules/basismanager/basisexpansion_blocks.py)at:

for i_repr in set(in_reprs):
    for o_repr in set(out_reprs):

which makes the order of layers created random given that it iterates over sets. And this will happen on each GPU making models per GPU differ (in_reprs being {"irrep_1", "irrep_0", "regular"} and out_reprs being {"regular"}).

Here’s one example of wrong ordering:
On one GPU I had

"module.enc.enc_out.basisexpansion.block_expansion_('irrep_0', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('regular', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('irrep_1', 'regular').sampled_basis",

and on another one I had

"module.enc.enc_out.basisexpansion.block_expansion_('irrep_1', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('irrep_0', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('regular', 'regular').sampled_basis",

This should be caught by Torch. Unfortunately it only checks that parameters are checked while buffers, i.e., values that stay constant through training, are not being checked. And since this BlocksBasisExpansion is a buffer it fails silently.

Example of such layer:
module.enc.enc_obs.conv.13.basisexpansion.block_expansion_('regular', 'regular').sampled_basis
diff.enc_a.0.basisexpansion.block_expansion_('irrep_1', 'regular').sampled_basis

kalekundert added a commit to kalekundert/escnn that referenced this issue Oct 31, 2024
kalekundert added a commit to kalekundert/escnn that referenced this issue Oct 31, 2024
@kalekundert
Copy link
Contributor

This bug is very similar to another one addressed by #93, so I just updated that PR to fix this bug as well. Specifically, I replaced all instances of the for x in set(y) pattern with for x in unique_ever_seen(y), where the unique_ever_seen() function (which was already part of the PR) iterates through all unique elements of a list, but is guaranteed to maintain the order of the list.

kalekundert added a commit to kalekundert/escnn that referenced this issue Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants