-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix attention mask handling in the Hybrid Engine Bloom flow #5101
Fix attention mask handling in the Hybrid Engine Bloom flow #5101
Conversation
The Bloom flow in Hybrid Engine applies the same transformation of the input mask already performed earlier in the transformers BloomModel::forward. This results in the non-convergence of scores, specifically in Deepspeed Chat on different accelerators, including CUDA and HPU. The fix removes the redundant 2-nd mask transformation and application, producing correct convergence.
Hello @deepcharm, Thank you for the contribution. I've studied the code in BloomModel::forward and can see that the call to I think in this case the change makes sense since we don't want to do the redundant mask processing operation, however, I just want to be careful that we don't change One option is to add an optional config parameter in our inference config.py that will enable skipping of this operation. We can set this parameter to If more models need this behavior, we can enable this in the corresponding model-specific container. Please let me know if you have feedback or questions. Thanks, |
Hi @lekurile Thanks for your feedback. I will implement the option that you have described and submit another patch. Max |
@deepcharm, thank for improving the PR. Please ping us again when it is ready for review. |
The BLOOM flow in Hybrid Engine applies the same transformation of the input mask already performed earlier in the transformers BloomModel::forward. This results in the non-convergence of scores, specifically in Deepspeed Chat on different accelerators, including CUDA and HPU. An optional config parameter invert_mask is introduced into DeepSpeedInferenceConfig (True by default), which enables skipping the invert operation for some transformer implementations, such as BLOOM.
@deepcharm Thank you! Looks good to me. Approved and running checks. |
…t#5101) The Bloom flow in Hybrid Engine applies the same transformation of the input mask which is already performed earlier by the transformers BloomModel::forward. This results in the non-convergence of scores, specifically in Deepspeed Chat on different accelerators, including CUDA and HPU. The fix removes redundant mask transformation and application, producing correct convergence. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Lev Kurilenko <[email protected]>
…t#5101) The Bloom flow in Hybrid Engine applies the same transformation of the input mask which is already performed earlier by the transformers BloomModel::forward. This results in the non-convergence of scores, specifically in Deepspeed Chat on different accelerators, including CUDA and HPU. The fix removes redundant mask transformation and application, producing correct convergence. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Lev Kurilenko <[email protected]>
The Bloom flow in Hybrid Engine applies the same transformation of the input mask which is already performed earlier by the transformers BloomModel::forward.
This results in the non-convergence of scores, specifically in Deepspeed Chat on different accelerators, including CUDA and HPU.
The fix removes redundant mask transformation and application, producing correct convergence.