Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No training when using 2 nodes and torchrun #60

Open
ElizavetaSedova opened this issue Dec 15, 2023 · 6 comments
Open

No training when using 2 nodes and torchrun #60

ElizavetaSedova opened this issue Dec 15, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@ElizavetaSedova
Copy link

Thank you for adding the ability to use multi-nodes without slurm! When I run training on one machine using torchrun, everything works without problems. But when I run it on two machines, the training freezes. The machines connect to each other, the model is loaded, but the training process does not continue. Learning gets stuck somewhere. Now I'm trying to figure out how to solve this problem. My launch command:

torchrun --master-addr [ip] \
--master-port [port] \
--node_rank 0 \
--nnodes 2 \
--nproc-per-node 2 \
-m dora run [ARGS]
@ElizavetaSedova ElizavetaSedova added the bug Something isn't working label Dec 15, 2023
@adefossez
Copy link
Contributor

Are you change the node id for the second command ?

@ElizavetaSedova
Copy link
Author

ElizavetaSedova commented Dec 15, 2023

@adefossez Yes, sure. I did some debugging and noticed that the time it took to complete each step increased greatly. In fact, learning is happening, but it is very slow. Moreover, if I run it on one machine, the process goes quickly. I expected that with the addition of N nodes the time to complete one epoch would be reduced, but it increased greatly in the end.

@adefossez
Copy link
Contributor

It will depend on the batch size, and whether you are specifying a per GPU or overall batch size (I recommend the later so that the XP meaning doesn't change based on how many gpus are used). What codebase is this for ?

@adefossez
Copy link
Contributor

Also it will depend if you have good interconnect between nodes!

@Tristan-Kosciuch
Copy link

Check your network config, you might have a firewall, security group, etc blocking access on the ports torchrun is using. If you get this working please update us, I'm currently going through the same headache...

@ElizavetaSedova
Copy link
Author

@Tristan-Kosciuch Yes, everything works for me. To be honest, I don’t remember what the problem was. I reinstalled the environment, updated all the libraries and the problem was solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants