Non-deterministic behavior on multi-GPU run #103

lrenaux-bdai · 2024-10-31T14:44:37Z

I have encountered a bug that invalidates multi-GPU training. Each model stored per GPU then diverges from the others given that the initialization of the model is non-deterministic.

This happens for all sampled_basis layer. Specifically in the init of BlocksBasisExpansion in https://github.com/QUVA-Lab/escnn/blob/master/escnn/nn/modules/basismanager/basisexpansion_blocks.py)at:

for i_repr in set(in_reprs):
    for o_repr in set(out_reprs):

which makes the order of layers created random given that it iterates over sets. And this will happen on each GPU making models per GPU differ (in_reprs being {"irrep_1", "irrep_0", "regular"} and out_reprs being {"regular"}).

Here’s one example of wrong ordering:
On one GPU I had

"module.enc.enc_out.basisexpansion.block_expansion_('irrep_0', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('regular', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('irrep_1', 'regular').sampled_basis",

and on another one I had

"module.enc.enc_out.basisexpansion.block_expansion_('irrep_1', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('irrep_0', 'regular').sampled_basis",
"module.enc.enc_out.basisexpansion.block_expansion_('regular', 'regular').sampled_basis",

This should be caught by Torch. Unfortunately it only checks that parameters are checked while buffers, i.e., values that stay constant through training, are not being checked. And since this BlocksBasisExpansion is a buffer it fails silently.

Example of such layer:
module.enc.enc_obs.conv.13.basisexpansion.block_expansion_('regular', 'regular').sampled_basis
diff.enc_a.0.basisexpansion.block_expansion_('irrep_1', 'regular').sampled_basis

The text was updated successfully, but these errors were encountered:

Fixes QUVA-Lab#103

Fixes QUVA-Lab#103 Signed-off-by: Kale Kundert <[email protected]>

kalekundert · 2024-10-31T16:01:52Z

This bug is very similar to another one addressed by #93, so I just updated that PR to fix this bug as well. Specifically, I replaced all instances of the for x in set(y) pattern with for x in unique_ever_seen(y), where the unique_ever_seen() function (which was already part of the PR) iterates through all unique elements of a list, but is guaranteed to maintain the order of the list.

Fixes QUVA-Lab#103 Signed-off-by: Kale Kundert <[email protected]>

lrenaux-bdai mentioned this issue Oct 31, 2024

Multi-gpu training degrades performance #95

Open

kalekundert added a commit to kalekundert/escnn that referenced this issue Oct 31, 2024

Fix non-determinism in basis expansion blocks

b5d25a5

Fixes QUVA-Lab#103

kalekundert added a commit to kalekundert/escnn that referenced this issue Oct 31, 2024

Fix non-determinism in basis expansion blocks

1ef554a

Fixes QUVA-Lab#103 Signed-off-by: Kale Kundert <[email protected]>

kalekundert added a commit to kalekundert/escnn that referenced this issue Nov 25, 2024

Fix non-determinism in basis expansion blocks

1586cfa

Fixes QUVA-Lab#103 Signed-off-by: Kale Kundert <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-deterministic behavior on multi-GPU run #103

Non-deterministic behavior on multi-GPU run #103

lrenaux-bdai commented Oct 31, 2024

kalekundert commented Oct 31, 2024

Non-deterministic behavior on multi-GPU run #103

Non-deterministic behavior on multi-GPU run #103

Comments

lrenaux-bdai commented Oct 31, 2024

kalekundert commented Oct 31, 2024