-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi node training #126
Comments
Can you please provide more info on the code you are using and the failure message? The code is expected to solve multi-node training with torch.distributional, so to me it seems like a configuration problem of your distributional framework. |
Thanks for the help. I am sure I am just doing something stupid here. I submit to slurm like this (everything works with single node):
The training script is the same as the examples, except for these changes: code: `[ -z "${MASTER_PORT}" ] && MASTER_PORT=10087 export NCCL_ASYNC_ERROR_HANDLING=1 OPTION="" tmp_dir= torchrun --nproc_per_node=$n_gpu --master_port $MASTER_PORT --nnodes=$OMPI_COMM_WORLD_SIZE --node_rank=$OMPI_COMM_WORLD_RANK --master_addr=$MASTER_IP rm -rf $tmp_dir` Below is log output. It just hangs forever after this: n_gpu per node 8 |
I think the master addr/ip is not set properly, each node sets itself as the master. see https://discuss.pytorch.org/t/distributed-training-on-slurm-cluster/150417/8 |
I was able to successfully train multimer on a singe node gpu with multiple gpus, but I have been having trouble modifying the training example to train on multiple nodes. Would it be possible to provide an example for multi node training?
I'm not sure how to properly modify the torchrun command or unicore arguments.
The text was updated successfully, but these errors were encountered: