Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add diffllama #34083

Open
wants to merge 57 commits into
base: main
Choose a base branch
from
Open

Add diffllama #34083

wants to merge 57 commits into from

Conversation

weak-kajuma
Copy link

What does this PR do?

This PR adds the codes for the DiffLlama, which is Llama model with Differential Transformer. Please refer to Differential Transformer. @ArthurZucker

@weak-kajuma
Copy link
Author

I am coding now, but it's first time I contribute transformers and other OSS. I may ask you some help.

@weak-kajuma
Copy link
Author

I still have a error located in modeling_diffllama.py@377: apply_rotary_pos_emb. Var "query_states" must be torch.Size([2, 32, 10, 128]) but the var is torch.Size([2, 64, 10, 64]). I need to change "query_states" or "cos"&"sin".

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! I think this would be an awesome fit to use modular transfomresr!
A bit of doc here: https://huggingface.co/docs/transformers/en/modular_transformers

this would help isolating the changes!

@weak-kajuma
Copy link
Author

I've finished making normal/eager Attention, and I can run with AutoModelforForCausalLM.generate().
But I'll adapt it for FlashAttention2 and Sdpa Attention.

@weak-kajuma
Copy link
Author

And also I fixed to fit modular transfomres.

weak-kajuma and others added 8 commits October 20, 2024 11:52
You don't need to divide by 2 if we use same number of attention heads as llama. instead you can just split in forward.

Co-authored-by: Minho Ryu <[email protected]>
fit to changeing "num_heads // 2" place

Co-authored-by: Minho Ryu <[email protected]>
new codes are more meaningful than before

Co-authored-by: Minho Ryu <[email protected]>
new codes are more meaningful than before

Co-authored-by: Minho Ryu <[email protected]>
fit to changeing "num_heads // 2" place

Co-authored-by: Minho Ryu <[email protected]>
fix 2times divide by sqrt(self.head_dim)

Co-authored-by: Minho Ryu <[email protected]>
fix 2times divide by sqrt(self.head_dim)

Co-authored-by: Minho Ryu <[email protected]>
fit to changeing "num_heads // 2" place.
and more visible

Co-authored-by: Minho Ryu <[email protected]>
Copy link
Contributor

@bzantium bzantium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implemented flash and sdpa attention as well.

src/transformers/models/diffllama/modeling_diffllama.py Outdated Show resolved Hide resolved
@weak-kajuma
Copy link
Author

@bzantium
I found Attention missed implemented from paper still on e072544.
So I'll revert to e072544 and re-implement with your suggested code style.

@weak-kajuma
Copy link
Author

All of your review implemented. And I tried the test many times, but it didn't pass. What should I do?
To: @Cyrilvallez

@ArthurZucker
Copy link
Collaborator

Hey! Sorry we were all off for a week on a company-wide offsite! 🤗 @Cyrilvallez should be back on monday!

@ArthurZucker ArthurZucker self-requested a review November 15, 2024 22:01
@effortprogrammer
Copy link

I wonder this pr is still working in progress? Or, most of the implementation has been finalized and waiting for the test coverage review?

@weak-kajuma weak-kajuma changed the title [WIP] Add diffllama [Request Reviews]Add diffllama Nov 20, 2024
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW sorry for being late! Overall super good, what's left to do IMO is use modular transformers https://huggingface.co/docs/transformers/en/modular_transformers to make it simpler (as a lot can inherit from Llama)! Let me know if I can help!

@weak-kajuma weak-kajuma changed the title [Request Reviews]Add diffllama Add diffllama Nov 20, 2024
@Cyrilvallez
Copy link
Member

Hey, sorry for the delay!
In order to use modular transformers, you need to create a new file, modular_diffllama.py, in which you can use inheritance from the different Llama classes. Then, to automatically create the modeling_diffllama.py file, just use our CLI: python utils/modular_model_converter.py --files_to_parse src/transformers/models/diffllama/modular_diffllama.py from the root of the transformers repo 🤗
LMK if you need more guidance for this! You can find some modular example, e.g. here
Basically, any class similar to a Llama class you can directly inherit from to avoid rewriting it, e.g. if DiffLlamaRotaryEmbedding is similar to LlamaRotaryEmbedding, you can use

class DiffLlamaRotaryEmbedding(LlamaRotaryEmbedding):
    pass

in the modular file. In your case, you will probably need to only rewrite the attention classes 😉

@effortprogrammer
Copy link

effortprogrammer commented Nov 30, 2024

Are you still working on this PR, @weak-kajuma ?

@weak-kajuma
Copy link
Author

@Cyrilvallez Could you review again? I made modular_diffllama.py.

Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! A great first modular! But you can still cut a lot of code, the only difference here are the attention classes so it's perfect for modular to pick up on everything by itself!
LMK if you run into any issues

@Cyrilvallez
Copy link
Member

You may need to rebase/merge on main though for modular to work perfectly as you seem to be a bit far behind. If something does not work as expected after my comments, you should try that first 🤗

@weak-kajuma
Copy link
Author

@Cyrilvallez Could you review again? Moduler transformers is very easy and good. And also I can pass all tests by merging latest changes.

@effortprogrammer
Copy link

@Cyrilvallez any plannings to review this pr?

Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, very good! Final comments 🤗

Comment on lines 54 to 66
class DiffLlamaRMSNorm(LlamaRMSNorm):
pass


ALL_LAYERNORM_LAYERS.append(DiffLlamaRMSNorm)


class DiffLlamaRotaryEmbedding(LlamaRotaryEmbedding):
pass


class DiffLlamaMLP(MistralMLP):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be removed!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove DiffLlamaMLP, then AttributeError: 'DiffLlamaConfig' object has no attribute 'mlp_bias' has happened. So I cannot remove it.

src/transformers/models/diffllama/modular_diffllama.py Outdated Show resolved Hide resolved
utils/check_config_docstrings.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants