Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] sok amp mode error #462

Open
Orca-bit opened this issue Oct 21, 2024 · 1 comment
Open

[BUG] sok amp mode error #462

Orca-bit opened this issue Oct 21, 2024 · 1 comment

Comments

@Orca-bit
Copy link

Orca-bit commented Oct 21, 2024

Describe the bug

[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/main.py", line 129, in <module>
[1,0]<stderr>:    trainer = Trainer(
[1,0]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/trainer.py", line 161, in __init__
[1,0]<stderr>:    self._embedding_optimizer = tf.keras.mixed_precision.LossScaleOptimizer(
[1,0]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/keras/mixed_precision/loss_scale_optimizer.py", line 343, in __call__
[1,0]<stderr>:    raise TypeError(msg)
[1,0]<stderr>:TypeError: "inner_optimizer" must be an instance of `tf.keras.optimizers.Optimizer` or `tf.keras.optimizers.experimental.Optimizer`, but got: <sparse_operation_kit.optimizer.OptimizerWrapperV2 object at 0x7f1b15b44910>.

To Reproduce
Steps to reproduce the behavior:

  1. How to build including docker pull & docker run commands
  2. How to run including the JSON config file used

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS: [e.g. Ubuntu xx.yy]
  • Graphic card: [e.g. a single NVIDIA V100 or NVIDIA DGX A100]
  • CUDA version: [e.g. CUDA 11.x]
  • Docker image nvcr.io/nvidia/merlin/merlin-tensorflow:nightly
  • tf: 2.12.0+nv23.6

Additional context
Add any other context about the problem here.

@kanghui0204
Copy link
Collaborator

The optimizer in SOK is not a TensorFlow optimizer, so you cannot wrap it with tf.keras.mixed_precision.LossScaleOptimizer. Instead, you can get the scale value from dense part's optimizer , then adjust the gradients accordingly the scale and input them into the SOK optimizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants