Calculate position ids in modeling utils for all generative models #30053

zucchini-nlp · 2024-04-04T17:14:06Z

What does this PR do?

As it was discussed under this PR, position ids in some models are not calculated/inferred from attn mask in forward, which gives incorrect positions when the inputs is left padded.

To be consistent and for ease of maintaining, the logic of inferring position ids is moved to "modeling_utils.py" and all generative models call that method in their forward and prepare_inputs_for_generation. I added two tests, to check whether model outputs are same when position ids are passed by a user vs. when inferred from input ids or embeds.

Also Fixes #29149.

The newly added tests are passing. Plus slow tests on vision models, because they still do not have GenerationTesterMixin.

Btw, I see that non-generative models already use create_position_ids_from_input_ids method which is copied separately in each model's file. The logic is a bit different from generative models, because they start counting from "padding_idx" and not "0". Anyway, I guess it is still possible to merge that method and the one proposed here, to have one "get_position_id" for all models in the "modeling_utils".
@gante WDYT ?

HuggingFaceDocBuilderDev · 2024-04-04T17:35:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2024-04-18T14:23:06Z

About the framework changes: I found that tf/flax has a slightly different way to get position_ids from torch models. Those frameworks generate position ids in forward without taking into account attention mask, same thing we had in torch before these fixes.

I made tf and flax same way as torch is now with a cumsum over attention mask, so that the equivalence over frameworks tests pass. I am not sure if we need similar test for tf/flax to "test_position_ids". Tests should pass now, at least locally it seemed okay

gante

In general looks good, thank you for tackling this refactor 💪

A few notes:

No TF function to infer the position IDs? 😢 TF feels neglected 💔
There are CI errors in the model equivalence. Model equivalence is flaky by nature, make sure you run model equivalence for all models with flake finder locally!
After you're happy with the changes, commit with [test_all] and tag me again. I've glanced over the model-level changes after the first few models, I'll do a final check more carefully after the full CI is green 🤗

gante · 2024-04-22T16:29:15Z

src/transformers/generation/candidate_generator.py

+    if length_diff < 0:
+        position_ids = position_ids[:, :length_diff]
+    elif length_diff > 0:
+        new_position_ids = torch.arange(position_ids[0, -1], new_length, device=position_ids.device).unsqueeze(0)


Can you add a comment briefly explaining when each situation can be triggered, and why we want that operation? Our future selves will probably be happy with that comment

e.g. I'm assuming length_diff > 0 is used when candidates are proposed, and thus we want the corresponding position ids. But I'm not immediately seeing when length_diff < 0 can be triggered :)

^ this function still needs better variable names and/or a docstring

src/transformers/models/codegen/modeling_codegen.py

gante · 2024-04-22T16:55:33Z

tests/generation/test_utils.py

@@ -1189,6 +1189,66 @@ def test_assisted_decoding_matches_greedy_search(self):
            for output in (output_greedy, output_assisted):
                self._check_outputs(output, input_ids, model.config, use_cache=True)

+    @is_flaky()
+    def test_assisted_decoding_position_ids(self):


like in the other PR you added an assisted generation test: let's make this a parameterization of the original test, since it's a minor variation :)

tests/generation/test_utils.py

zucchini-nlp · 2024-04-22T18:05:18Z

Okay, will work on it.

TF has only 3 models for decoder-only so I thought we would not need it. Okay I can dd it in the same way
These are the composite models for flax, I think it's needs a fix but could not find where yet
Okay :)

zucchini-nlp · 2024-04-29T12:12:07Z

@gante the comments are addressed now. TF cannot have the "get_position_ids" method in PretrainedModel because all input related preparations in TF happen in a "keras.layers.Layer" class. I am not sure if we can or should be moving the position_id preparation into the "PretrainedModel", since there are only 3 TF models that were needed change.

Also, to note for Flax-based encoder-decoder models: the attention mask for decoder part is overriden to be full, because when using decoder-only model as decoder part the position ids are calculated differently (I mean only the unattended part). In random initialized models it is causing logits mismatch, even if the attention masks out unattended positions. In pre-trained models that does not happen 🤔

github-actions · 2024-05-24T08:03:43Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

zucchini-nlp · 2024-05-24T08:05:08Z

Hold it for a while, not stale

gante

A few more pattern fixes and should be ready to go 🤞

gante · 2024-06-18T14:04:38Z

src/transformers/generation/candidate_generator.py

+    if length_diff < 0:
+        position_ids = position_ids[:, :length_diff]
+    elif length_diff > 0:
+        new_position_ids = torch.arange(position_ids[0, -1], new_length, device=position_ids.device).unsqueeze(0)


^ this function still needs better variable names and/or a docstring

gante · 2024-06-18T14:22:06Z

src/transformers/models/codegen/modeling_codegen.py

+        seq_length = (
+            inputs_embeds.shape[1] if inputs_embeds is not None and past_key_values is None else input_ids.shape[1]
+        )
+        if position_ids is None:
+            device = input_ids.device if input_ids is not None else inputs_embeds.device
+            position_ids = self.get_position_ids_from_attention_mask(
+                attention_mask, past_length, seq_length=seq_length, device=device
+            )
+        else:
+            position_ids = position_ids[:, -seq_length:]


I think we can remove all this code, actually 👀 I see the following cases:

position_ids is None -> the forward pass correctly computes position_ids, due to the changes in this PR

position_ids is not None -> the user has defined position_ids, it's its own responsibility to pass them correctly

WDYT? (this logic would apply to all models, and would make maintenance easier for us 👼 )

gante · 2024-06-18T14:25:36Z

src/transformers/models/encoder_decoder/modeling_encoder_decoder.py

@@ -593,6 +593,7 @@ def forward(
            argument[len("decoder_") :]: value for argument, value in kwargs.items() if argument.startswith("decoder_")
        }

+        print(self.encoder)


Suggested change

print(self.encoder)

gante · 2024-06-18T14:35:12Z

src/transformers/models/ctrl/modeling_tf_ctrl.py

@@ -702,7 +709,9 @@ def prepare_inputs_for_generation(self, inputs, past_key_values=None, use_cache=
        attention_mask = kwargs.get("attention_mask", None)

        if attention_mask is not None and position_ids is None:
-            position_ids = tf.math.cumsum(attention_mask, axis=-1, exclusive=True)


this one should be correct, no? 🤔

(the same comment applies to other TF models)

gante · 2024-06-18T14:35:27Z

src/transformers/models/ctrl/modeling_tf_ctrl.py

+                position_ids = tf.cumsum(tf.cast(attention_mask, tf.int64), axis=-1) - 1
+                # create ones tensor to match dtypes, otherwise we get errors
+                ones_tensor = tf.ones_like(position_ids, dtype=tf.int64)
+                position_ids = tf.where(attention_mask == 0, ones_tensor, position_ids)
+                position_ids = position_ids[..., -input_shape[-1] :]
+                position_ids = tf.reshape(position_ids, (-1, input_shape[-1]))
+            else:
+                position_ids = tf.expand_dims(tf.range(past_length, input_shape[-1] + past_length), axis=0)


Suggested change

position_ids = tf.cumsum(tf.cast(attention_mask, tf.int64), axis=-1) - 1

# create ones tensor to match dtypes, otherwise we get errors

ones_tensor = tf.ones_like(position_ids, dtype=tf.int64)

position_ids = tf.where(attention_mask == 0, ones_tensor, position_ids)

position_ids = position_ids[..., -input_shape[-1] :]

position_ids = tf.reshape(position_ids, (-1, input_shape[-1]))

else:

position_ids = tf.expand_dims(tf.range(past_length, input_shape[-1] + past_length), axis=0)

position_ids = tf.math.cumsum(attention_mask, axis=-1, exclusive=True)

(see comment below)

the same logic applies to other TF models

gante · 2024-06-18T14:47:31Z

tests/models/encoder_decoder/test_modeling_flax_encoder_decoder.py

+        # when model weights are random init masking with attn_mask still leads to logits
+        # mismatch, which does not happen if pre-trained models are used. That causes error in encoder-decoder models
+        # when decoder_only is used in as backbone (GPT2), because GPT prepares positions depending on attn mask (for torch)
+        # and as arange in flax. That's why we init attn mask with all `1`
+        if "decoder_attention_mask" in pt_inputs:
+            pt_inputs["decoder_attention_mask"] = torch.ones_like(pt_inputs["decoder_attention_mask"])
+            inputs_dict["decoder_attention_mask"] = jnp.ones_like(inputs_dict["decoder_attention_mask"])


This change should no longer be needed, correct?

(as a general rule, we shouldn't fudge these equivalence tests :) )

gante · 2024-06-18T14:48:40Z

tests/models/gemma/test_modeling_flax_gemma.py

+        # make full attn mask since below we are preparing position ids assuming it's all ones
+        attention_mask = jnp.ones_like(attention_mask)


the other way around: we should update the creation of position_ids (below) to match the mask

The same comment applies to other FLAX test changes

gante · 2024-06-18T14:49:22Z

tests/models/llama/test_modeling_flax_llama.py

@@ -149,8 +149,8 @@ def check_use_cache_forward_with_attn_mask(self, model_class_name, config, input
        )

        past_key_values = model.init_cache(input_ids.shape[0], max_decoder_length)
-        position_ids = jnp.broadcast_to(
-            jnp.arange(input_ids.shape[-1] - 1)[None, :], (input_ids.shape[0], input_ids.shape[-1] - 1)
+        position_ids = model.get_position_ids_from_attention_mask(


yes, like this!

gante · 2024-06-18T14:49:49Z

tests/models/speech_encoder_decoder/test_modeling_flax_speech_encoder_decoder.py

+        # when model weights are random init masking with attn_mask still leads to logits
+        # mismatch, which does not happen if pre-trained models are used. That causes error in encoder-decoder models
+        # when decoder_only is used in as backbone (GPT2), because GPT prepares positions depending on attn mask (for torch)
+        # and as arange in flax. That's why we init attn mask with all `1`
+        if "decoder_attention_mask" in pt_inputs:
+            pt_inputs["decoder_attention_mask"] = torch.ones_like(pt_inputs["decoder_attention_mask"])
+            inputs_dict["decoder_attention_mask"] = jnp.ones_like(inputs_dict["decoder_attention_mask"])
+


same comment as in the same pattern above, we should remove this

zucchini-nlp · 2024-06-28T06:27:07Z

Will reopen this one later, as a new PR. It will need resolving merge conflicts and propagating changes to new models + PR comments.

zucchini-nlp added 3 commits April 4, 2024 15:39

prepare position ids in modeling utils

d7fcb07

fix seq length when inputs embeds

eaecb4f

Merge remote-tracking branch 'upstream/main' into position_ids

d24796c

zucchini-nlp requested a review from gante April 4, 2024 17:14

zucchini-nlp added 8 commits April 5, 2024 09:24

forgot to fix starcoder2

cf66b56

fix copies

ecba33f

remove that print :)

b769074

lets add same for assisted decoding

b96725f

Merge remote-tracking branch 'upstream/main' into position_ids

49f3495

Merge remote-tracking branch 'upstream/main' into position_ids

8e7a2bd

framework equivalence?

ff4e424

final solution, lets make all frameworks same

5531118

zucchini-nlp added 4 commits April 18, 2024 19:24

Merge branch 'main' into position_ids

6341955

new models

ce37742

tf fix cast

e28e551

tf equivalence

7e5e3bf

gante reviewed Apr 22, 2024

View reviewed changes

zucchini-nlp added 10 commits April 23, 2024 12:37

remove extra if conditions

15ad877

make test parameterized

0d9ee9f

Merge remote-tracking branch 'upstream/main' into position_ids

cbe4394

fix failing flax cases

0f20d92

torch tests fail due to merge conflicts?

e749080

let the tests pass

ef2494e

import if available

d5f5989

fixes

cd10c73

encoder-decoder models

0f1997c

fix llama flax

87befb7

gante reviewed Jun 18, 2024

View reviewed changes

github-actions bot closed this Jun 27, 2024

huggingface deleted a comment from github-actions bot Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate position ids in modeling utils for all generative models #30053

Calculate position ids in modeling utils for all generative models #30053

zucchini-nlp commented Apr 4, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 4, 2024

zucchini-nlp commented Apr 18, 2024

gante left a comment

gante Apr 22, 2024

gante Jun 18, 2024

gante Apr 22, 2024

zucchini-nlp commented Apr 22, 2024

zucchini-nlp commented Apr 29, 2024

github-actions bot commented May 24, 2024

zucchini-nlp commented May 24, 2024

gante left a comment

gante Jun 18, 2024

gante Jun 18, 2024

gante Jun 18, 2024

gante Jun 18, 2024

gante Jun 18, 2024

gante Jun 18, 2024

gante Jun 18, 2024

gante Jun 18, 2024

gante Jun 18, 2024

gante Jun 18, 2024

zucchini-nlp commented Jun 28, 2024

		# make full attn mask since below we are preparing position ids assuming it's all ones
		attention_mask = jnp.ones_like(attention_mask)

Calculate position ids in modeling utils for all generative models #30053

Calculate position ids in modeling utils for all generative models #30053

Conversation

zucchini-nlp commented Apr 4, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Apr 4, 2024

zucchini-nlp commented Apr 18, 2024

gante left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp commented Apr 22, 2024

zucchini-nlp commented Apr 29, 2024

github-actions bot commented May 24, 2024

zucchini-nlp commented May 24, 2024

gante left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp commented Jun 28, 2024

zucchini-nlp commented Apr 4, 2024 •

edited

Loading