-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft][Demo] auto tp training #5445
base: master
Are you sure you want to change the base?
Conversation
@inkcherry Is there a link to the demo code? I'm interested in the potential use case of this feature proposal. |
This PR should be addressing this discussion. Link. |
FYI:https://github.com/inkcherry/stanford_alpaca/tree/tp_demo |
@inkcherry and @delock, please let us know any way we can help. Thanks! |
It would be super helpful if one can make autoTP training with domino #6733. |
@delock and @inkcherry, is this still active work? |
Hi @skyshine102 @delock @inkcherry , I am leading domino project, would like to collaborate if possible with this effort of autoTP training & decouple TP and Megatron bindings. |
Hi @GuanhuaWang is Domino referring to this paper? https://arxiv.org/html/2409.15241v1 Thanks! |
@delock Yes, @GuanhuaWang is the first author.
|
@GuanhuaWang, sure, I can help with rebasing the code recently. I think this PR still needs three things to be done:
|
@inkcherry , I just noticed that you copied parallel_state from megatron-core to your draft but deepspeed engine already has one (here). You may modify this file instead of copying. Currently deepspeed engine support DP/PP/EP/SP and now TP. I suppose all of these process groups will be needed.
|
…-precision version before the rebase, but the grad norm differs (display issue)
Hi @inkcherry need to make sure this PR does not impact autotp inference performance and compatibility. When your PR is stable, check with Guobing and @rogerxfeng8 for internal test. |
This is an experimental demo on autoTP training, not for review. Apologies for its somewhat rudimentary draft version, I hope to elucidate this process.
Currently, I tested pure TP (DP=1 cases), directly using the
HF transformers Trainer
. I trained llama7B (finetune from pretrained weights) on 4GPUs and 8GPUs with pure TP and achieved a loss curve of 1.6~0.3(expected).Main modifications are as follows:
On the DS side change, in this demo:
1 Decoupling MPU and Megatron, I've directly taken Megatron's code and put it in the parallel_states.py file
2 Adding backward code for the main replace modules, linelinear & linearallreduce.
3 Adding the 'tensor_model_parallel' attribute for linelinear & linearallreduce, ensuring they are correctly handled in grad norm or other calculations.
5 Setting requires_grad=True for the weights and bias of linelinear & linearallreduce, ensuring they are captured in model_params by transformer prepare_deepspeed logic and fed to the DS optimizer in related.
6 _broadcast_model: Due to some inconsistencies in group settings, the dp group used by _broadcast_model is not correct, so I directly bypass the logic here(DP=1).
7 gradient_allreduce: directly disable it because of the similar reason as 6. 5&6 can be resolved by a unified group init function.
8 Adding the autotp_size config.
Currently, in this basic version, I did two simple tests. Under the same gbs and gas conditions, it has 70% performance compared to zero3, but there are some gbs threshold limits lower than zero3 (at this time, zero3 performs better, TP oom, may be either the dataloader or lacking some optimizations from Megatron? I didn't further analyze)
The benefit of doing this is to decouple TP and Megatron bindings, enable user directly using transformers+ds training with tp+something, which can also be applied to other simple models (through module replacement). Additionally, because There are autoTP inference code and the inheritance between zero backend and transformers, No need for particularly much additional logic
For better use:
The most basic is to consider compatibility with zero dp, and it may also be compatible with more features(reuse the relevant logic of ds for Megatron's TP), Some performance and memory optimizations.