Add diffllama #34083

weak-kajuma · 2024-10-11T08:28:31Z

What does this PR do?

This PR adds the codes for the DiffLlama, which is Llama model with Differential Transformer. Please refer to Differential Transformer. @ArthurZucker

weak-kajuma · 2024-10-11T08:31:58Z

I am coding now, but it's first time I contribute transformers and other OSS. I may ask you some help.

weak-kajuma · 2024-10-11T13:24:21Z

I still have a error located in modeling_diffllama.py@377: apply_rotary_pos_emb. Var "query_states" must be torch.Size([2, 32, 10, 128]) but the var is torch.Size([2, 64, 10, 64]). I need to change "query_states" or "cos"&"sin".

ArthurZucker

Hey! I think this would be an awesome fit to use modular transfomresr!
A bit of doc here: https://huggingface.co/docs/transformers/en/modular_transformers

this would help isolating the changes!

weak-kajuma · 2024-10-16T13:26:38Z

I've finished making normal/eager Attention, and I can run with AutoModelforForCausalLM.generate().
But I'll adapt it for FlashAttention2 and Sdpa Attention.

weak-kajuma · 2024-10-16T13:28:10Z

And also I fixed to fit modular transfomres.

src/transformers/models/diffllama/modeling_diffllama.py

You don't need to divide by 2 if we use same number of attention heads as llama. instead you can just split in forward. Co-authored-by: Minho Ryu <[email protected]>

fit to changeing "num_heads // 2" place Co-authored-by: Minho Ryu <[email protected]>

new codes are more meaningful than before Co-authored-by: Minho Ryu <[email protected]>

fit to changeing "num_heads // 2" place Co-authored-by: Minho Ryu <[email protected]>

fix 2times divide by sqrt(self.head_dim) Co-authored-by: Minho Ryu <[email protected]>

fit to changeing "num_heads // 2" place. and more visible Co-authored-by: Minho Ryu <[email protected]>

bzantium

implemented flash and sdpa attention as well.

src/transformers/models/diffllama/modeling_diffllama.py

src/transformers/models/auto/configuration_auto.py

src/transformers/models/diffllama/modeling_diffllama.py

weak-kajuma · 2024-10-20T11:17:05Z

@bzantium
I found Attention missed implemented from paper still on e072544.
So I'll revert to e072544 and re-implement with your suggested code style.

Co-authored-by: Minho Ryu <[email protected]>

weak-kajuma · 2024-11-07T13:33:33Z

All of your review implemented. And I tried the test many times, but it didn't pass. What should I do?
To: @Cyrilvallez

ArthurZucker · 2024-11-15T22:00:48Z

Hey! Sorry we were all off for a week on a company-wide offsite! 🤗 @Cyrilvallez should be back on monday!

effortprogrammer · 2024-11-20T08:27:56Z

I wonder this pr is still working in progress? Or, most of the implementation has been finalized and waiting for the test coverage review?

ArthurZucker

BTW sorry for being late! Overall super good, what's left to do IMO is use modular transformers https://huggingface.co/docs/transformers/en/modular_transformers to make it simpler (as a lot can inherit from Llama)! Let me know if I can help!

Cyrilvallez · 2024-11-20T16:46:07Z

Hey, sorry for the delay!
In order to use modular transformers, you need to create a new file, modular_diffllama.py, in which you can use inheritance from the different Llama classes. Then, to automatically create the modeling_diffllama.py file, just use our CLI: python utils/modular_model_converter.py --files_to_parse src/transformers/models/diffllama/modular_diffllama.py from the root of the transformers repo 🤗
LMK if you need more guidance for this! You can find some modular example, e.g. here
Basically, any class similar to a Llama class you can directly inherit from to avoid rewriting it, e.g. if DiffLlamaRotaryEmbedding is similar to LlamaRotaryEmbedding, you can use

class DiffLlamaRotaryEmbedding(LlamaRotaryEmbedding):
    pass

in the modular file. In your case, you will probably need to only rewrite the attention classes 😉

effortprogrammer · 2024-11-30T15:25:26Z

Are you still working on this PR, @weak-kajuma ?

weak-kajuma · 2024-12-04T07:12:14Z

@Cyrilvallez Could you review again? I made modular_diffllama.py.

Cyrilvallez

Hey! A great first modular! But you can still cut a lot of code, the only difference here are the attention classes so it's perfect for modular to pick up on everything by itself!
LMK if you run into any issues

src/transformers/models/diffllama/modular_diffllama.py

Cyrilvallez · 2024-12-04T17:36:17Z

You may need to rebase/merge on main though for modular to work perfectly as you seem to be a bit far behind. If something does not work as expected after my comments, you should try that first 🤗

weak-kajuma · 2024-12-06T12:16:29Z

@Cyrilvallez Could you review again? Moduler transformers is very easy and good. And also I can pass all tests by merging latest changes.

effortprogrammer · 2024-12-10T01:50:55Z

@Cyrilvallez any plannings to review this pr?

Cyrilvallez

Alright, very good! Final comments 🤗

Cyrilvallez · 2024-12-10T16:51:19Z

src/transformers/models/diffllama/modular_diffllama.py

+class DiffLlamaRMSNorm(LlamaRMSNorm):
+    pass
+
+
+ALL_LAYERNORM_LAYERS.append(DiffLlamaRMSNorm)
+
+
+class DiffLlamaRotaryEmbedding(LlamaRotaryEmbedding):
+    pass
+
+
+class DiffLlamaMLP(MistralMLP):
+    pass


Should be removed!

If I remove DiffLlamaMLP, then AttributeError: 'DiffLlamaConfig' object has no attribute 'mlp_bias' has happened. So I cannot remove it.

src/transformers/models/diffllama/modular_diffllama.py

utils/check_config_docstrings.py

src/transformers/models/diffllama/configuration_diffllama.py

qubvel added the New model label Oct 11, 2024

weak-kajuma added 2 commits October 11, 2024 13:11

first adding diffllama

3bd9e34

add Diff Attention and other but still with errors

269055e

weak-kajuma force-pushed the add_diffllama branch from 765db6a to 269055e Compare October 11, 2024 13:21

ArthurZucker reviewed Oct 15, 2024

View reviewed changes

weak-kajuma added 3 commits October 16, 2024 12:02

complate make attention Diff-Attention

dbbf073

fix some bugs which may be caused by transformer-cli while adding model

c4ea9df

fix a bug caused by forgetting KV cache...

e072544

bzantium reviewed Oct 20, 2024

View reviewed changes

weak-kajuma and others added 8 commits October 20, 2024 11:52

Update src/transformers/models/diffllama/modeling_diffllama.py

674d7a2

You don't need to divide by 2 if we use same number of attention heads as llama. instead you can just split in forward. Co-authored-by: Minho Ryu <[email protected]>

Update src/transformers/models/diffllama/modeling_diffllama.py

9eac636

fit to changeing "num_heads // 2" place Co-authored-by: Minho Ryu <[email protected]>

Update src/transformers/models/diffllama/modeling_diffllama.py

0e99dbd

new codes are more meaningful than before Co-authored-by: Minho Ryu <[email protected]>

Update src/transformers/models/diffllama/modeling_diffllama.py

1e445c7

new codes are more meaningful than before Co-authored-by: Minho Ryu <[email protected]>

Update src/transformers/models/diffllama/modeling_diffllama.py

cca6a5c

fit to changeing "num_heads // 2" place Co-authored-by: Minho Ryu <[email protected]>

Update src/transformers/models/diffllama/modeling_diffllama.py

dd167af

fix 2times divide by sqrt(self.head_dim) Co-authored-by: Minho Ryu <[email protected]>

Update src/transformers/models/diffllama/modeling_diffllama.py

23099cb

fix 2times divide by sqrt(self.head_dim) Co-authored-by: Minho Ryu <[email protected]>

Update src/transformers/models/diffllama/modeling_diffllama.py

faac378

fit to changeing "num_heads // 2" place. and more visible Co-authored-by: Minho Ryu <[email protected]>

bzantium reviewed Oct 20, 2024

View reviewed changes

src/transformers/models/diffllama/modeling_diffllama.py Outdated Show resolved Hide resolved

bzantium reviewed Oct 20, 2024

View reviewed changes

src/transformers/models/diffllama/modeling_diffllama.py Outdated Show resolved Hide resolved

bzantium reviewed Oct 20, 2024

View reviewed changes

src/transformers/models/auto/configuration_auto.py Outdated Show resolved Hide resolved

bzantium reviewed Oct 20, 2024

View reviewed changes

src/transformers/models/diffllama/modeling_diffllama.py Show resolved Hide resolved

src/transformers/models/diffllama/modeling_diffllama.py Outdated Show resolved Hide resolved

weak-kajuma and others added 4 commits October 20, 2024 11:23

I found Attention missed implemented from paper still on e072544.

53e13aa

re-implemented

63b018a

adding groupnorm

204bec8

Co-authored-by: Minho Ryu <[email protected]>

align with transformers code style

bce12e5

Co-authored-by: Minho Ryu <[email protected]>

ArthurZucker self-requested a review November 15, 2024 22:01

weak-kajuma changed the title ~~[WIP] Add diffllama~~ [Request Reviews]Add diffllama Nov 20, 2024

ArthurZucker reviewed Nov 20, 2024

View reviewed changes

weak-kajuma changed the title ~~[Request Reviews]Add diffllama~~ Add diffllama Nov 20, 2024

apply modular transformers but with bugs

c6932de

weak-kajuma added 4 commits December 1, 2024 07:11

revert prev commit

48e16cf

create src/transformers/model/diffllama/modular_diffllama.py

a44f95d

run utils/modular_model_converter.py

c45aa59

empty commit

c5741eb

Cyrilvallez reviewed Dec 4, 2024

View reviewed changes

weak-kajuma and others added 4 commits December 6, 2024 11:49

leaner modular diffllama

ea622ce

Merge branch 'huggingface:main' into add_diffllama

e30c298

remove more and more in modular_diffllama.pt

3f85c22

remove more and more in modular_diffllama.pt

87d034d

ArthurZucker requested a review from Cyrilvallez December 10, 2024 07:59

Cyrilvallez reviewed Dec 10, 2024

View reviewed changes

weak-kajuma added 2 commits December 21, 2024 05:56

resolve missing docstring entries

4660c6e

force reset

b4ff5f3

weak-kajuma force-pushed the add_diffllama branch from 7b0da01 to b4ff5f3 Compare December 21, 2024 11:30

weak-kajuma and others added 2 commits December 21, 2024 20:30

Merge branch 'huggingface:main' into add_diffllama

484a493

convert modular

0ce2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add diffllama #34083

Add diffllama #34083

weak-kajuma commented Oct 11, 2024

weak-kajuma commented Oct 11, 2024

weak-kajuma commented Oct 11, 2024

ArthurZucker left a comment

weak-kajuma commented Oct 16, 2024

weak-kajuma commented Oct 16, 2024

bzantium left a comment

weak-kajuma commented Oct 20, 2024

weak-kajuma commented Nov 7, 2024

ArthurZucker commented Nov 15, 2024

effortprogrammer commented Nov 20, 2024

ArthurZucker left a comment

Cyrilvallez commented Nov 20, 2024

effortprogrammer commented Nov 30, 2024 •

edited

Loading

weak-kajuma commented Dec 4, 2024

Cyrilvallez left a comment

Cyrilvallez commented Dec 4, 2024

weak-kajuma commented Dec 6, 2024

effortprogrammer commented Dec 10, 2024

Cyrilvallez left a comment

Cyrilvallez Dec 10, 2024

weak-kajuma Dec 21, 2024

Add diffllama #34083

Are you sure you want to change the base?

Add diffllama #34083

Conversation

weak-kajuma commented Oct 11, 2024

What does this PR do?

weak-kajuma commented Oct 11, 2024

weak-kajuma commented Oct 11, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

weak-kajuma commented Oct 16, 2024

weak-kajuma commented Oct 16, 2024

bzantium left a comment

Choose a reason for hiding this comment

weak-kajuma commented Oct 20, 2024

weak-kajuma commented Nov 7, 2024

ArthurZucker commented Nov 15, 2024

effortprogrammer commented Nov 20, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Cyrilvallez commented Nov 20, 2024

effortprogrammer commented Nov 30, 2024 • edited Loading

weak-kajuma commented Dec 4, 2024

Cyrilvallez left a comment

Choose a reason for hiding this comment

Cyrilvallez commented Dec 4, 2024

weak-kajuma commented Dec 6, 2024

effortprogrammer commented Dec 10, 2024

Cyrilvallez left a comment

Choose a reason for hiding this comment

Cyrilvallez Dec 10, 2024

Choose a reason for hiding this comment

weak-kajuma Dec 21, 2024

Choose a reason for hiding this comment

effortprogrammer commented Nov 30, 2024 •

edited

Loading