-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No training when using 2 nodes and torchrun #60
Comments
Are you change the node id for the second command ? |
@adefossez Yes, sure. I did some debugging and noticed that the time it took to complete each step increased greatly. In fact, learning is happening, but it is very slow. Moreover, if I run it on one machine, the process goes quickly. I expected that with the addition of N nodes the time to complete one epoch would be reduced, but it increased greatly in the end. |
It will depend on the batch size, and whether you are specifying a per GPU or overall batch size (I recommend the later so that the XP meaning doesn't change based on how many gpus are used). What codebase is this for ? |
Also it will depend if you have good interconnect between nodes! |
Check your network config, you might have a firewall, security group, etc blocking access on the ports torchrun is using. If you get this working please update us, I'm currently going through the same headache... |
@Tristan-Kosciuch Yes, everything works for me. To be honest, I don’t remember what the problem was. I reinstalled the environment, updated all the libraries and the problem was solved. |
Thank you for adding the ability to use multi-nodes without slurm! When I run training on one machine using torchrun, everything works without problems. But when I run it on two machines, the training freezes. The machines connect to each other, the model is loaded, but the training process does not continue. Learning gets stuck somewhere. Now I'm trying to figure out how to solve this problem. My launch command:
The text was updated successfully, but these errors were encountered: