No training when using 2 nodes and torchrun #60

ElizavetaSedova · 2023-12-15T09:47:30Z

Thank you for adding the ability to use multi-nodes without slurm! When I run training on one machine using torchrun, everything works without problems. But when I run it on two machines, the training freezes. The machines connect to each other, the model is loaded, but the training process does not continue. Learning gets stuck somewhere. Now I'm trying to figure out how to solve this problem. My launch command:

torchrun --master-addr [ip] \
--master-port [port] \
--node_rank 0 \
--nnodes 2 \
--nproc-per-node 2 \
-m dora run [ARGS]

The text was updated successfully, but these errors were encountered:

adefossez · 2023-12-15T14:21:15Z

Are you change the node id for the second command ?

ElizavetaSedova · 2023-12-15T15:22:21Z

@adefossez Yes, sure. I did some debugging and noticed that the time it took to complete each step increased greatly. In fact, learning is happening, but it is very slow. Moreover, if I run it on one machine, the process goes quickly. I expected that with the addition of N nodes the time to complete one epoch would be reduced, but it increased greatly in the end.

adefossez · 2023-12-15T22:13:23Z

It will depend on the batch size, and whether you are specifying a per GPU or overall batch size (I recommend the later so that the XP meaning doesn't change based on how many gpus are used). What codebase is this for ?

adefossez · 2023-12-15T22:14:17Z

Also it will depend if you have good interconnect between nodes!

Tristan-Kosciuch · 2024-04-09T01:59:42Z

Check your network config, you might have a firewall, security group, etc blocking access on the ports torchrun is using. If you get this working please update us, I'm currently going through the same headache...

ElizavetaSedova · 2024-04-09T09:04:54Z

@Tristan-Kosciuch Yes, everything works for me. To be honest, I don’t remember what the problem was. I reinstalled the environment, updated all the libraries and the problem was solved.

ElizavetaSedova added the bug Something isn't working label Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No training when using 2 nodes and torchrun #60

No training when using 2 nodes and torchrun #60

ElizavetaSedova commented Dec 15, 2023

adefossez commented Dec 15, 2023

ElizavetaSedova commented Dec 15, 2023 •

edited

Loading

adefossez commented Dec 15, 2023

adefossez commented Dec 15, 2023

Tristan-Kosciuch commented Apr 9, 2024

ElizavetaSedova commented Apr 9, 2024

No training when using 2 nodes and torchrun #60

No training when using 2 nodes and torchrun #60

Comments

ElizavetaSedova commented Dec 15, 2023

adefossez commented Dec 15, 2023

ElizavetaSedova commented Dec 15, 2023 • edited Loading

adefossez commented Dec 15, 2023

adefossez commented Dec 15, 2023

Tristan-Kosciuch commented Apr 9, 2024

ElizavetaSedova commented Apr 9, 2024

ElizavetaSedova commented Dec 15, 2023 •

edited

Loading