Update Gemma to reflect upstream HF changes #596
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Fixes #594.
Since Gemma was merged in March 14th, there have been a number of upstream HuggingFace changes that mean our activations/logits no longer matched this implementation well. This PR implements the changes described here, changes include:
A demo showing this agreement can be found here and summarised below. Note: There is currently a bug in various HF/transformer models where the default attention implementation used is not causally masking patterns, I opened an issue here but it is already being addressed here. For this reason it is important when comparing activations/logits to specify
attn_implementation="eager"
when using HF/transformers'from_pretrained
method.Tolerances for both 2b and 7b models when running in float32 can be found below:
Tolerances when running float16 are much worse but I think this is commonly the case when adding models to TL with internal upcasting/downcasting, if this is not the case though and these tolerances should also be similarly close I can investigate this further.
Type of change
Please delete options that are not relevant.
Checklist: