New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Cache: `Bart` and related architectures support `Cache` objects #28065

Closed

gante wants to merge 11 commits into huggingface:main from gante:cache

Member

gante commented Dec 15, 2023 •

edited

Loading

What does this PR do?

This PR applies the changes to Bart so it supports the new Cache objects. In other works, it is akin to #26681 but for encoder-decoder models.

⚠️ This is a giant PR that can't be separated due to our copy mechanism (🙃), but the review process doesn't need to be daunting. Here's my suggested review order and high-level rationale:

Changes in cache_utils.py. I've introduced DynamicCacheWithCrossAttention, which expands DynamicCache [cache object equivalent to the previous past_key_values input/output] with the ability to hold a cross-attention cache. This design was intentional: most LLMs (and now even multimodel models) tend to be decoder-only, so this separation will keep the cache class for decoder-only models simpler. It also enable us to be more strict -- I've caught an unintended cache deletion in Whisper thanks to the increased specificity!
Changes in modeling_bart.py. These changes are the equivalent of the modeling changes in Generate: New Cache abstraction and Attention Sinks support #26681, but for encoder-decoder models.
Other changes, which can be reviewed more lightly. They are either related documentation fixes, minor corrections, propagation of bart's changes through make fix-copies (plus a few manual changes like adding imports or updating docstrings), or test upgrades for the new DynamicCacheWithCrossAttention.

The following tests were run locally - includes FA2 and some pretty challenging tests to ensure nothing was broken in the process:

RUN_SLOW=1 py.test tests/models/bart/test_modeling_bart.py -vv
RUN_SLOW=1 py.test tests/models/mbart/test_modeling_mbart.py -vv
RUN_SLOW=1 py.test tests/models/whisper/test_modeling_whisper.py -vv

👉 In any case, we should run the slow CI before merging!

Note on Whisper: same failures as in `main`, i.e. (open me)

gante changed the title ~~Cache: Bart (and related architectures) support Cache objects~~ Cache: Bart supports Cache objects

gante changed the title ~~Cache: Bart supports Cache objects~~ Cache: Bart supports Cache objects

gante changed the title ~~Cache: Bart supports Cache objects~~ Cache: Bart + related architectures support Cache objects

gante changed the title ~~Cache: Bart + related architectures support Cache objects~~ Cache: Bart and related architectures support Cache objects

HuggingFaceDocBuilderDev commented Dec 15, 2023

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante added 7 commits

December 15, 2023 15:27


          tmp commit

36aacb3


          first test passing :o

17dcfec


          bart finalized

86a28e4


          apply make fix-copies

b0a2bdd


          mbart fixed

e087a46


          tmp commit

7d99f84


          tmp commit

11dace1

gante force-pushed the cache branch from 72913af to 11dace1 Compare

December 15, 2023 15:27

gante added 4 commits

December 15, 2023 16:01


          whisper fixed

fb04551


          mass update docstrings

5c801a7


          mass update types

7120c0e


          big bird

28c8727

Member Author

gante commented Dec 15, 2023

@amyeroberts this PR is not finalized, but I'd love to get an early review -- the failing tests are fixed by propagating the changes to models with the #Copied from statement. However, it's not a copy/paste job, so if you were to request changes, they could be painful to propagate to the remaining models 😬

The key parts to review now are labeled as 1 and 2 in the PR header 🤗

gante requested a review from amyeroberts

December 15, 2023 19:43

This was referenced Dec 20, 2023

Fix input_embeds docstring in encoder-decoder architectures #28168

Merged

Cache: dynamic cache with cross attention and UMT5 Cache support #28185

Closed

amyeroberts reviewed

View reviewed changes

Collaborator

amyeroberts left a comment

Impressive piece of work 🔥

I've just paid attention to the addition in cache_utils and changes in BART. Just some nits and questions on my side for understanding but overall structure I think looks great! Would be good to get a second set of eyes from someone with more cache experience on this too.

src/transformers/models/bart/modeling_bart.py

+                              `past_key_values` returned by the model at a previous stage of decoding, when `use_cache=True` or
+                              `config.use_cache=True`.
+                              Two formats are allowed:

Collaborator

amyeroberts Dec 22, 2023

Is passing in inputs in the legacy_format discouraged? If both are allowed, then we should update the the type hint to have both; if the legacy format is deprecated, I'd reword this as we don't want to encourage passing in the old format.

src/transformers/models/bart/modeling_bart.py

Comment on lines +200 to +202

		is_cross_attention = key_value_states is not None

		if is_cross_attention:

Collaborator

amyeroberts Dec 22, 2023

nit - this variable is only used once an on the immediate next line - comment provides enough context

Suggested change

      
                    is_cross_attention = key_value_states is not None
          
                    if is_cross_attention:
          
                    if key_value_states is not None:

src/transformers/models/bart/modeling_bart.py

+                          # Keep only the unprocessed tokens:
+                          # 1 - If the length of the decoder_attention_mask exceeds the length of decoder_input_ids, then we are in a
+                          # setting where some of the inputs are exclusivelly passed as part of the cache (e.g. when passing

Collaborator

amyeroberts Dec 22, 2023

ultranit

Suggested change

      
                        # setting where some of the inputs are exclusivelly passed as part of the cache (e.g. when passing
          
                        # setting where some of the inputs are exclusively passed as part of the cache (e.g. when passing

src/transformers/models/bart/modeling_bart.py

+                          # Keep only the unprocessed tokens:
+                          # 1 - If the length of the decoder_attention_mask exceeds the length of decoder_input_ids, then we are in a
+                          # setting where some of the inputs are exclusivelly passed as part of the cache (e.g. when passing
+                          # input_embeds as input)

Collaborator

amyeroberts Dec 22, 2023

Just for my own understanding, am I right in thinking the reason they're exclusively part of the cache if I pass input_embeds is because any input_ids must have been generated?

src/transformers/models/bart/modeling_bart.py

+                          # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
+                          # decoder_input_ids based on the past_length.
+                          elif past_length < decoder_input_ids.shape[1]:
+                              decoder_input_ids = decoder_input_ids[:, past_length:]

Collaborator

amyeroberts Dec 22, 2023

And in this case - we're removing tokens that have already been seen i.e. have been processed and part of the cache?

src/transformers/models/bart/modeling_bart.py

Comment on lines +2282 to +2303

+                              cache_length = past_length = past_key_values[0][0].shape[2]
+                              max_cache_length = None
+                          # Keep only the unprocessed tokens:
+                          # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
+                          # some of the inputs are exclusivelly passed as part of the cache (e.g. when passing input_embeds as
+                          # input)
+                          if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
+                              input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
+                          # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
+                          # input_ids based on the past_length.
+                          elif past_length < input_ids.shape[1]:
+                              input_ids = input_ids[:, past_length:]
+                          # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
+                          # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
+                          if (
+                              max_cache_length is not None
+                              and attention_mask is not None
+                              and cache_length + input_ids.shape[1] > max_cache_length
+                          ):
+                              attention_mask = attention_mask[:, -max_cache_length:]

Collaborator

amyeroberts Dec 22, 2023

By eye, this looks equivalent to the logic above, just with input_ids instead of decoder_ids -> can we abstract out the common logic here?

github-actions bot commented Jan 16, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Member Author

gante commented Jan 16, 2024

Mr bot, this is not stale (on hold while the static cache is being worked on, as they will likely have overlapping changes and the static cache is more important)

gante added the WIP label

Member Author

gante commented Apr 18, 2024

Closing this PR, at this point it's easier to start from scratch

gante closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

WIP