-
-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update multi-node.qmd #1688
base: main
Are you sure you want to change the base?
Update multi-node.qmd #1688
Conversation
Added Distributed Finetuning For Multi-Node with Axolotl and Deepspeed
@muellerzr seem right? |
This seems to assume that you have access to each node before your training starts. However, a lot of cloud systems like AzureML, SLURM, SageMaker does not let you follow guides like this because the assumptions of the guide is that you can modify these variables. @shahdivax @winglian I would suggest a bit more of an automatic setup if you want this to work well for users. |
This assumes that user are using EC2 instances from AWS. ( I forgot to add that 😓) Edit: Added in the heading |
On Node 1 (server), run the finetuning process using Accelerate: | ||
|
||
```bash | ||
accelerate launch -m axolotl.cli.train examples/llama-2/qlora.yml | ||
``` | ||
|
||
This will start the finetuning process across all nodes. You can check the different IP addresses before each step to verify that the training is running on every node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge this is not the case. You need to do accelerate launch -m
on every server else it will sit there and never actually start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we tasted, we were only starting on single node (server) and it was able to use the resources from other nodes,
As a proof, we were able to see the ip of both the machines on the left, and in the total GPU it were showing all the GPU form all the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@muellerzr My guess is this is probably specific to deepspeed since the IP addresses are set in a hostfile. We should probably disambiguate this that it only needs to be run on the first node when this is the case. Most other cases like FSDP or plain multinode DDP will likely still need accelerate launch
to be run on each node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@winglian, @muellerzr That might be the case , because for us deepspeed was a good options where we were using multi node for finetuning via EC2, as it provides the public ip , and we used hostfile, it was really easy to connect both machines and run the finetuning on root only, this indeed connected all the other instances. (using all the resources from all the nodes via single node)
Co-authored-by: Wing Lian <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the changes required are Done , And I think this doc is now good to go.
Title: Distributed Finetuning For Multi-Node with Axolotl and Deepspeed
Description:
This PR introduces a comprehensive guide for setting up a distributed finetuning environment using Axolotl and Accelerate. The guide covers the following steps: