Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve error msg for packed being incompatible #2056

Open
felipemello1 opened this issue Nov 23, 2024 · 1 comment
Open

improve error msg for packed being incompatible #2056

felipemello1 opened this issue Nov 23, 2024 · 1 comment

Comments

@felipemello1
Copy link
Contributor

felipemello1 commented Nov 23, 2024

when running gemma2 with packed=True, i got the error below. It should be more comprehensive.

NotImplementedError: Block masks are not implemeted yet, use packed=False.

Also, if Gemma is not compatible with packed, we should fix it and/or remove the option from the config.

@Optimox
Copy link
Contributor

Optimox commented Nov 29, 2024

Hello @felipemello1,

The reason it does not work is quite complex. Block masks could work with gemma 2 without problem as there is nothing specific preventing them to work in theory. The problem comes from flex attention which I did not manage to implement for gemma 2, so gemma 2 is not using flex attention at the moment.

But the way torchtune automatically checks whether flex attention is available or not and the fact that datasets automatically create block masks specific to flex attention create an incompatibility at the moment (this incompatibility does not exist if you have a version of torch which does not contain flex attention).

Here are a few directions I see to fix this problem:

  • make flex attention usage a parameter in each recipes and disable it for now for gemma 2: I feel like hiding to the users which attention implementation is going to be run is not ideal.
  • implement a working flex attention implementation but this would probably need to wait for pytorch to solve the issue

On a similar topic, did you successfully fine tuned gemma 2? I have been running the same (custom) pretraining script with both gemma 2 2B and llama3.2 3B which yields good generations with llama 3.2 and catastrophic ones with gemma 2...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants