-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add get optimizer method #5149
Add get optimizer method #5149
Conversation
@microsoft-github-policy-service agree |
@@ -294,3 +294,22 @@ def build_extension(self): | |||
|
|||
def export_envs(self): | |||
return [] | |||
|
|||
def get_optimizer(self, optimizer_name, cpu_optimization, model_parameters, **optimizer_parameters): | |||
from habana_frameworks.torch.hpex.optimizers import FusedAdamW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced that this much code is needed.
It seems the goal is to replace AdamW on hpu accelerator as below:
from deepspeed.ops.adam import FusedAdam
with
from habana_frameworks.torch.hpex.optimizers import FusedAdamW
Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjruwase Yes, but in addition to prepare the ground for other fused optimizers that are addressed in this function. for example: FusedLamb, OneBit, etc...
Do you have in mind something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nelyahu, thanks for the clarification. I do agree that deferring to accelerator to instantiate fused optimizer when available is the way to go. My concern is the duplication of optimizer selection logic here with engine.py
will be difficult to maintain long term.
I think the complication is that engine.py
mixes optimizer name refinement (especially for Adam variants) and optimizer instantiation. By name refinement, I mean something like ADAM_OPTIMIZER which is a user-facing config value, internally maps to one of many optimizers (e.g., torch.optim.Adam
, DeepSpeedCPUAdam
, etc.). So, my thought is to
- Create an internal naming convention for the various optimizer instantiations. For example,
_TORCH_ADAM_OPTIMIZER
fortorch.optim.Adam
,_DS_FUSED_ADAM
fordeepspeed.ops.adam.FusedAdam
,_DS_ADAM_OPTIMIZER
fordeepspeed.ops.adam.DeepSpeedCPUAdam
.
- Create a function that maps an external name to an internal one based on other configuration values. For example, ADAM_OPTIMIZER would map to one of
_TORCH_ADAM
,_DS_FUSED_ADAM
,_DS_CPUADAM
.
Based on the above the following changes could now be made:
- Simplify
get_optimizer()
by accepting internal optimizer name and creating an optimizer if name is one of supported internal optimizer names. Also,cpu_optimization
is no longer needed. - Simplify
_configure_basic_optimizer()
to (i) convert to internal name, (ii) callaccelerator().get_optimizer()
with internal name to create optimizer, and (iii) otherwise fall through to existing optimizer creation logic.
Looking forward to your feedback. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjruwase I applied the changes you suggested, however some optimizers still require specific code. Which one of the commits you think is best?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShellyNR, thanks for making the changes. This looks better to me.
When you say 'some optimizers still require specific code, is that referring to _ADAM
in hpu_accelerator
which maps to torch.optim.Adam()
? If so, I think that is fine. My goal is to avoid duplication between accelerator codes and engine.py
.
03ee38a
to
1be755b
Compare
"_TORCH_ADAMW": lambda arg1, **arg2: torch.optim.AdamW(arg1, **arg2), | ||
"_CPU_ADAM": lambda arg1, **arg2: DeepSpeedCPUAdam(arg1, **arg2, adamw_mode=False), | ||
"_CPU_ADAMW": lambda arg1, **arg2: DeepSpeedCPUAdam(arg1, **arg2, adamw_mode=True), | ||
"_ADAM": lambda arg1, **arg2: FusedAdam(arg1, **arg2, adam_w_mode=False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this dict. Based on this, I think mappings of cuda-specific optimizers. such as Fused[Adam|Lamb|Lion]
should move to cuda_accelerator.py
.
What do you think?
@ShellyNR, thanks for your work on this PR. I wanted to check if you needed more clarification to address my comments? |
I'm closing this PR, a new PR that addresses this issue without using a get_optimizer method will be pushed soon. |
Add the get_optimizer method to accelerators for use during optimizer configuration, instead of checking for the specific accelerator in the engine.py code.
In all accelerators other than hpu the current implementation returns None so that the previous flow is not affected.
In hpu_accelerator.py, the method returns an optimizer if the configured optimizer has hpu-specific implementation or requirements. Otherwise, it returns None.