Fix attention mask handling in the Hybrid Engine Bloom flow #5101

deepcharm · 2024-02-08T14:33:30Z

The Bloom flow in Hybrid Engine applies the same transformation of the input mask which is already performed earlier by the transformers BloomModel::forward.

This results in the non-convergence of scores, specifically in Deepspeed Chat on different accelerators, including CUDA and HPU.

The fix removes redundant mask transformation and application, producing correct convergence.

The Bloom flow in Hybrid Engine applies the same transformation of the input mask already performed earlier in the transformers BloomModel::forward. This results in the non-convergence of scores, specifically in Deepspeed Chat on different accelerators, including CUDA and HPU. The fix removes the redundant 2-nd mask transformation and application, producing correct convergence.

lekurile · 2024-02-09T19:47:50Z

Hello @deepcharm,

Thank you for the contribution. I've studied the code in BloomModel::forward and can see that the call to _prepare_4d_causal_attention_mask is made here. I believe here is where the referenced 1-mask operation is happening in the _prepare_4d_causal_attention_mask function.

I think in this case the change makes sense since we don't want to do the redundant mask processing operation, however, I just want to be careful that we don't change ds_attention.py masking behavior fundamentally and account for model support beyond BLOOM, since I'm not sure if all transformer model implementations do this mask processing outside the transformers block.

One option is to add an optional config parameter in our inference config.py that will enable skipping of this operation. We can set this parameter to False by default and to True only in the BLOOM container.

If more models need this behavior, we can enable this in the corresponding model-specific container.

Please let me know if you have feedback or questions.

Thanks,
Lev

deepcharm · 2024-02-12T10:40:49Z

Hi @lekurile

Thanks for your feedback. I will implement the option that you have described and submit another patch.

Max

tjruwase · 2024-02-15T19:40:12Z

@deepcharm, thank for improving the PR. Please ping us again when it is ready for review.

…engine

The BLOOM flow in Hybrid Engine applies the same transformation of the input mask already performed earlier in the transformers BloomModel::forward. This results in the non-convergence of scores, specifically in Deepspeed Chat on different accelerators, including CUDA and HPU. An optional config parameter invert_mask is introduced into DeepSpeedInferenceConfig (True by default), which enables skipping the invert operation for some transformer implementations, such as BLOOM.

deepcharm · 2024-03-04T15:40:39Z

Hi @lekurile, @tjruwase

As advised, I've added an optional config parameter invert_mask into DeepSpeedInferenceConfig (True by default),
which enables skipping the invert operation for some transformer implementations, such as BLOOM.

Kindly review the change. Thanks.

lekurile · 2024-03-04T21:34:40Z

Hi @lekurile, @tjruwase

As advised, I've added an optional config parameter invert_mask into DeepSpeedInferenceConfig (True by default), which enables skipping the invert operation for some transformer implementations, such as BLOOM.

Kindly review the change. Thanks.

@deepcharm Thank you! Looks good to me. Approved and running checks.

…t#5101) The Bloom flow in Hybrid Engine applies the same transformation of the input mask which is already performed earlier by the transformers BloomModel::forward. This results in the non-convergence of scores, specifically in Deepspeed Chat on different accelerators, including CUDA and HPU. The fix removes redundant mask transformation and application, producing correct convergence. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Lev Kurilenko <[email protected]>

deepcharm requested review from mrwyattii, awan-10 and arashb as code owners February 8, 2024 14:33

Merge branch 'master' into fix-bloom-attention-mask-hybrid-engine

3752938

lekurile self-requested a review February 9, 2024 17:20

Merge branch 'master' into fix-bloom-attention-mask-hybrid-engine

2f935e2

tjruwase removed request for awan-10 and mrwyattii February 12, 2024 20:27

Merge branch 'master' into fix-bloom-attention-mask-hybrid-engine

4b5345d

deepcharm and others added 2 commits March 4, 2024 13:17

Merge branch 'microsoft:master' into fix-bloom-attention-mask-hybrid-…

959c31a

…engine

lekurile approved these changes Mar 4, 2024

View reviewed changes

Merge branch 'master' into fix-bloom-attention-mask-hybrid-engine

a07db08

deepcharm and others added 2 commits March 6, 2024 18:33

Merge branch 'master' into fix-bloom-attention-mask-hybrid-engine

d13a05f

Merge branch 'master' into fix-bloom-attention-mask-hybrid-engine

eef28b2

lekurile enabled auto-merge March 12, 2024 22:57

lekurile added this pull request to the merge queue Mar 12, 2024

Merged via the queue into microsoft:master with commit d9e12d3 Mar 13, 2024
12 checks passed

deepcharm deleted the fix-bloom-attention-mask-hybrid-engine branch March 14, 2024 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix attention mask handling in the Hybrid Engine Bloom flow #5101

Fix attention mask handling in the Hybrid Engine Bloom flow #5101

deepcharm commented Feb 8, 2024

lekurile commented Feb 9, 2024

deepcharm commented Feb 12, 2024

tjruwase commented Feb 15, 2024

deepcharm commented Mar 4, 2024

lekurile commented Mar 4, 2024

Fix attention mask handling in the Hybrid Engine Bloom flow #5101

Fix attention mask handling in the Hybrid Engine Bloom flow #5101

Conversation

deepcharm commented Feb 8, 2024

lekurile commented Feb 9, 2024

deepcharm commented Feb 12, 2024

tjruwase commented Feb 15, 2024

deepcharm commented Mar 4, 2024

lekurile commented Mar 4, 2024