-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORT training support stage3 #1439
Conversation
@JingyaHuang could you please help take a look? Thanks a lot! |
…ngwa/enable_stage3_ort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the PR, please fix the style as well with make style
.
And a question for the zero3 support, is it experimental and only available after the 1.17 release, would it be possible to test already? And any idea when ORT 1.17 will be released? Thanks!
Yeah, I will fix later. As of the 1.17, we will have in the coming weeks, I am not sure the concrete date, but we are striving to ship stage3 in this new release. Currently the feature is baking in main branch (which already has version 1.17). Hope this make things clear. Feel free to let me know if there are more questions. |
…ngwa/enable_stage3_ort
Hi @pengwa, thanks for updating the branch and for the explanation. I did a quick test (training bert on glue) with Zero3 enabled, the training did not failed but it poped up some error logs related to the tracing, have you seen that before? (FYI, I tested with ort nightly)
|
Thank @JingyaHuang for bringing this up! I believe this is related to exporter issue in some PyTorch versions. Well, indeed we need more version checks on the dependency libs including PyTorch, DeepSpeed, and ORT. Let me collect the versions first, and update the check later. |
Hi @JingyaHuang, sorry for the delay, I am focusing on optimization the perf and mem, did not spare more time on more restrict version controls, I will hand over this work to my teammate. :) Let me close this one, we will create new PR later. |
@pengwa Sure, no probs! |
For new version ORT training, we had stage3 support
This PR enable that support when stage3 is used.
Fixes # (issue)
Before submitting